Compare commits

...

385 Commits

Author SHA1 Message Date
736d791056 Merge pull request 'feat(ingest): 진행 로그 개선 — 파일명/phase/heartbeat/slowest 요약' (#204) from feat/ingest-log-improve into main
Reviewed-on: #204
2026-06-03 11:04:19 +00:00
6c9c8df43e chore(version): 0.27.0 → 0.26.1 — 새 bump 규칙상 patch
진행 로그 개선은 검색·색인 결과 불변 + 새 명령/플래그/config 없음 + additive-only
wire(asset_phase)라 CLAUDE.md 신규 규칙(기능/인터페이스 변경=minor, 없으면 patch)상
patch 가 맞음. version·라벨·HOTFIXES 헤더를 0.26.1 로 정정.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 11:02:16 +00:00
0263667684 docs(claude): pre-1.0 버전 bump 규칙 명확화 — 기능/인터페이스 변경=minor, 없으면 patch
minor 과다 사용 교정. 기능(behavior)·인터페이스(CLI/config/breaking wire/결과·동작/
migration) 변경 시 minor, 그 외(bug fix·refactor·관측성/로깅·additive-only wire·문서)
는 patch. 경계 예시 포함(진행로그+additive wire=patch, arctic provider/config=minor).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 11:00:49 +00:00
4918983d9c chore(ingest): PR #204 회차1 리뷰 반영 — 버전 라벨 v0.26.0 → v0.27.0
신규 진행로깅 표면(asset_phase / ocr_ms / caption_ms + progress.rs heartbeat·
slowest 주석)이 v0.26.0 으로 잘못 표기돼 있던 것을 v0.27.0(실제 추가 버전)으로
정정. wire schema 의 "추가 버전" 정확성(외부 통합 참조). 로직 변경 없음(주석/doc).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:57:17 +00:00
aeaa18a564 feat(ingest): 진행 로그 개선 — 파일명/phase/heartbeat/slowest 요약
OCR/caption 켜진 볼트 ingest 가 중간부터 느릴 때 TTY 진행바가 파일명·phase·
모델·경과시간을 안 보여 "멈춤"처럼 보이던 문제 해결.
- 신규 wire AssetPhase{idx,total,phase,model} + AssetTimings.ocr_ms/caption_ms
  (additive, ingest_progress.v1 유지)
- app: apply_ocr/apply_caption/embed 진입 시 AssetPhase emit + ocr/caption 시간 측정
- cli: TTY 진행바에 현재 파일명 + phase(model) + asset 경과초(heartbeat),
  종료 시 최장 소요 파일 top-5 요약(quiet 여도 출력, --json 미출력)
- wire schema / README / HANDOFF / HOTFIXES 동기화, version 0.26.0 → 0.27.0

검증(리더): clippy 0, kebab-app/cli 61그룹·parse-image/tui 14그룹 0실패(-j8).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:52:26 +00:00
c91ff909ce docs(spec): ingest 로그 개선 spec + plan
arctic 도그푸딩에서 OCR/caption 켜진 Obsidian 볼트가 중간부터 느려졌으나
TTY 진행바가 파일명·phase·모델·경과시간을 안 보여 "멈춤"처럼 보인 문제.
파일명 표시 + 느린 phase(OCR/caption/embed)+모델 실시간 + asset 경과 heartbeat
+ 종료 시 최장 소요 top-N 요약. additive wire(asset_phase, ocr_ms/caption_ms).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:14:15 +00:00
8dee610a97 docs(hotfixes): arctic 종단 도그푸딩 evidence (recall@10 130/132)
kebab v0.26.0 실제 파이프라인(ollama arctic)으로 namu 재색인 → 확장 골든 eval
recall@10 130/132·recall@50 132/132·fully_consistent 22/24 종단 재현. 측정→구현
→실파이프라인 삼중 확인. 릴리스 전 도그푸딩 trigger(embedder 모델 변경) 충족.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 07:19:19 +00:00
d71ed2516b Merge pull request 'feat(embed): arctic-embed-l-v2.0 임베더(candle+ollama)' (#203) from feat/arctic-embedder into main
Reviewed-on: #203
2026-06-03 06:27:55 +00:00
095c9f37a2 docs(smoke): embedding config 블록 v0.26.0 동기화
SMOKE.md 의 [models.embedding] 예시 주석이 stale: provider 목록에 ollama 누락 +
"candle 은 e5-large 만 지원"(arctic 추가로 더 이상 사실 아님) + endpoint/arctic
미기재. CLAUDE.md §"README Configuration + SMOKE config 블록 동시 갱신" 규칙대로
보완 — provider 4종, arctic 모델(candle/ollama 태그), endpoint(ollama 전용, llm
endpoint fallback), e5↔arctic cascade 주석 추가.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 05:11:03 +00:00
16ddb1dfc3 docs: arctic 임베더 문서 동기화 (README/ARCHITECTURE/HANDOFF/HOTFIXES)
README Configuration: provider candle/ollama + arctic 모델(candle CLS / ollama 태그)
+ endpoint + e5→arctic cascade 경고. ARCHITECTURE: 백엔드 그래프 노드(embedollama)
+ 임베딩 백엔드 결정표(채택 근거 측정 recall@10 130) + 디렉토리 트리. HANDOFF 1줄.
HOTFIXES 2026-06-03 arctic dated entry(레지스트리/pooling/prefix/cascade + 수동
cosine 0.999984 실측 결과).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 04:59:23 +00:00
72c99c452c feat(config,app): embedding provider=ollama 배선 + endpoint, version 0.26.0
kebab-config: EmbeddingModelCfg.endpoint: Option<String>(serde default, ollama용,
None→models.llm.endpoint 폴백) + provider 문서에 ollama + env
KEBAB_MODELS_EMBEDDING_ENDPOINT. kebab-app embedder(): provider match 에 ollama
분기(facade 경유). workspace member += kebab-embed-ollama, app dep 추가.
version 0.25.0 → 0.26.0(minor, +Cargo.lock) — 신규 임베딩 백엔드/모델은 CLAUDE.md
§Release 의 surface 변경 트리거.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 04:59:23 +00:00
cbcae69abf feat(embed): candle 모델 레지스트리 + arctic-embed-l-v2.0 (CLS pooling)
e5 하드코딩(HF_MODEL/SUPPORTED_MODEL/mean/query:+passage:) → 모델 레지스트리
EmbedModelSpec{name,hf_repo,pooling,query_prefix,doc_prefix,dim,version_tag}.
e5(mean, query:/passage:) + arctic(CLS, query:/무접두어). pooling 모델별 분기
(mean=attention-mask-weighted / CLS=hidden[:,0,:]), tokenize/forward/L2 공유.
arctic pooling=CLS 는 HF 1_Pooling/config.json(pooling_mode_cls_token:true) 확인.
model_version 은 arctic 일 때 +arctic-cls 태그(embedding_version cascade 트리거);
e5 는 fastembed-e5 호환(NUMA 드롭인) 위해 plain config.version 유지.

correctness 게이트: tests/arctic_ollama_parity.rs (#[ignore], live Ollama) —
candle arctic vs Ollama snowflake-arctic-embed2 per-sentence 코사인>0.99.
수동 실측 cosine_min=0.999984 (recall@10 130 재현 보장).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 04:59:11 +00:00
7505645008 feat(embed): kebab-embed-ollama 신규 크레이트 — Ollama /api/embed Embedder
arctic-embed-l-v2.0 의 폴백 백엔드(측정에 쓴 경로 그대로). reqwest::blocking
POST {endpoint}/api/embed {model,input:[...]} → embeddings. batch 48 +
fail-soft 재시도 3, 결과 L2 정규화(Ollama raw 반환 → 일관성), dim 검증.
query/doc prefix 는 모델 태그로 추론(arctic-embed→query:/무접두어, e5→query:/passage:).
model_version=ollama:{model}. endpoint=models.embedding.endpoint ?? models.llm.endpoint.
wiremock 테스트 3종(L2 정규화/dim mismatch/empty no-op) + 단위 5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 04:59:11 +00:00
e2ae9a4589 docs(spec): arctic-embed-l-v2.0 임베더 통합 spec + plan
별칭 제거 후 설명형 recall 보강 최선책(측정 arctic recall@10 130/132,
용어 무손실). candle 다중모델화(CLS pooling+query: prefix) 우선 + Ollama
embed provider 폴백. correctness 게이트 = candle≈Ollama 코사인>0.99.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 04:14:26 +00:00
1dfab6dfc5 Merge pull request 'refactor(app): doc-side expansion(별칭) 기능 제거' (#202) from refactor/remove-doc-expansion into main
Reviewed-on: #202
2026-06-03 00:39:26 +00:00
fc5103642e docs: 별칭 제거 문서 동기화 + version 0.25.0
HOTFIXES 2026-06-03 dated entry, 2026-05-30 design spec 제거 banner,
HANDOFF 1줄, README(별칭 섹션/config/명령표 정리), ARCHITECTURE(결정 표 +
디렉토리 트리), SMOKE/DOGFOOD config-migrate 예시 정정. workspace version
0.24.0 → 0.25.0 (+ Cargo.lock).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:37:58 +00:00
e03d03cb26 test: 별칭 전용 테스트 삭제 + 영향 테스트/fixture 갱신
kebab-search/tests/lexical.rs 의 alias 채널 테스트 + insert_chunk_with_aliases
헬퍼 제거(body 회수 회귀 테스트로 대체). Chunk 리터럴 aliases: None 제거
(embedding_records_fk/idempotency/inspect). chunk 스냅샷 fixture 의 aliases
키 제거. config_migrate 는 ingest.code 앵커로, corpus_revision/search_lexical
주석은 V013 비-bump 명시로 갱신.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:37:58 +00:00
16aadea222 feat(store): V013 마이그레이션 — chunk_aliases_fts + chunks.aliases DROP
forward-only 마이그레이션으로 V010 이 만든 chunk_aliases_fts(+트리거)와
chunks.aliases 컬럼 제거. 과거 V010 은 freeze 무수정. 순수 구조 변경 —
corpus_revision bump 안 함(spec §결정: 본문/임베딩 불변, in-process LRU 는
프로세스별·query 이전 실행이라 bump 무의미).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:37:44 +00:00
a48c405826 refactor(wire): ExpansionProgress 이벤트 + 렌더 제거
IngestEvent::ExpansionProgress variant + 직렬화 테스트 제거(AssetChunked/
AssetTimings 유지). CLI/TUI 의 expansion 렌더 제거, AssetTimings 한 줄에서
expand 세그먼트 제거. ingest_progress.v1 schema 의 expansion_progress kind
제거, expansion_ms 설명을 "값 0 유지"로 갱신.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:37:44 +00:00
21e02d8a93 refactor(app): ingest 별칭 생성·캐시·sentinel 벡터 루프 제거
ingest_one_asset 의 청크당 별칭 LLM 생성·derivation_cache 조회/저장·
embed_aliases sentinel 벡터(`{orig}#alias#N`) upsert 루프 제거.
expansion_ms 는 wire 호환 위해 0 고정. alias_sentinel_ids_to_delete 와
orphan purge 3개 호출부를 본문 chunk_id 직접 삭제로 단순화.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:37:43 +00:00
a64c31ee94 refactor(search): alias lexical arm 제거
run_alias_query / merge_body_alias 제거, 검색을 body_rows 직접 사용으로
단순화. build_match_string_for_column 의 column 매개변수 인라인(text 고정).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:36:56 +00:00
ec96648956 refactor(config): ExpansionCfg + [ingest.expansion] 제거
IngestExpansionCfg struct + IngestCfg.expansion 필드 + Default +
KEBAB_INGEST_EXPANSION_* env 파싱 + 테스트 제거. migrate.rs 의
ingest.expansion 섹션 주석 제거 — config migrate 테스트는 ingest.code 앵커로
정정(forward-compat: 기존 [ingest.expansion] 섹션은 serde 가 무시).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:36:56 +00:00
ecaf224381 refactor(chunk): Chunk 생성부의 aliases 리터럴 + store 컬럼 제거
kebab-chunk/* AST·md·tier2·pdf chunker 의 aliases: None 리터럴 삭제,
store-sqlite documents.rs chunks INSERT 컬럼/바인딩 + get_chunk 매핑에서
aliases 제거.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:36:44 +00:00
b1c5feb3f3 refactor(core): Chunk.aliases 필드 제거
doc-side expansion(별칭) 제거 — Chunk 의 aliases: Option<String> 필드와
serde default 테스트 제거. Metadata.aliases(Vec, 문서 메타)는 유지.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:36:44 +00:00
ca8c0645fb docs(spec): doc-side expansion(별칭) 제거 spec + plan
연구문서(2026-06-03) 결론 따라 별칭 기능 제거 설계. 유지(Metadata.aliases,
AssetChunked/AssetTimings, embedding 캐시)/제거(Chunk.aliases, expansion.rs,
ExpansionCfg, chunk_aliases_fts, run_alias_query, ExpansionProgress) 경계 +
신규 DROP 마이그레이션 + wire expansion_progress 제거 결정. 구현 plan 8 task.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 20:13:12 +00:00
c7af6612b7 docs(research): expansion 비용 재고 + 별칭 대체 딥리서치
별칭(doc-side per-chunk LLM expansion)이 ingest 임계경로 병목으로 확정된 뒤
대안 조사. 동시성(OLLAMA_NUM_PARALLEL 최대 1.28×)·모델스왑(qwen3.5 중국어/
degeneration)·백그라운드(총량·treadmill 불변) 모두 실측 소진. Step 0 측정:
별칭 없이도 cross-lingual recall@10 완벽(en/ko/syn/abbr), 약점은 설명형뿐
→ 별칭 ROI 음수. Step 1: bge-m3 dense 는 lateral(설명형 +3 / 용어 -3, 순0).
4-agent 딥리서치: 잔존 = reverse-dictionary 과제, 측정-우선 계층(heading
enrichment → arctic-ko 임베더 → bge-reranker-v2-m3 → near-tie 게이트 expansion).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 20:07:42 +00:00
acb4fa6c65 Merge pull request 'feat(ingest): asset 내부 phase 진행 로깅 (asset_chunked/expansion_progress/asset_timings)' (#201) from feat/ingest-progress-detail into main
Reviewed-on: #201
2026-06-02 17:26:47 +00:00
8bfa4ba76e fix(ingest-progress): 리뷰 반영 — store_ms 경계 정정 + 중복 expansion 프레임 가드
- store_ms 에서 stale-vector orphan purge(LanceDB I/O) 제거 → embed/vector phase
  (embed_ms)로 이동. store_ms 가 이제 SQLite put_* 만 의미(진단 정확도; 편집
  재색인 시 920ms 오귀속 제거). purge 는 여전히 unconditional + upsert 이전.
- 최종 expansion_progress 프레임을 done != last_done 로 가드 (throttle 배수 시
  중복 프레임 + chunks==0 시 0/0 프레임 제거).
- schema/HOTFIXES: store_ms/embed_ms 설명 정정 + dangling IMPL_REPORT 참조 제거.

clippy -D warnings 0, test 312 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 14:49:02 +00:00
ad0ccf4ccf chore(ingest-progress): remove process artifacts before PR
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 14:13:36 +00:00
b351523e51 docs(worktree): IMPL_BRIEF + IMPL_REPORT for ingest-progress-detail
작업 입력(brief)과 산출 증거(report: 변경/이벤트/exit-code 검증/smoke 샘플/
잔여 리스크). 메인 세션이 PR 정리 시 드롭 가능한 worktree 메타 아티팩트.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 13:58:33 +00:00
a48b055358 feat(ingest): asset 내부 phase 진행 로깅 (asset_chunked/expansion_progress/asset_timings) + v0.24.0
asset(문서) 단위뿐이던 ingest 진행 이벤트에 문서 내부 phase 가시성을 추가.
큰 문서가 expansion(별칭 LLM, 청크당 순차)으로 수십 분 걸려도 진행바가
1/N 에 멈춘 듯 보이던 문제 해결.

wire ingest_progress.v1 additive (backward-compat):
- asset_chunked {idx,total,chunks} — 청킹 직후, markdown/image/pdf 전 경로
- expansion_progress {idx,total,done,chunks} — expansion 루프 스로틀
  (25청크 또는 1s, 종료 시 done==chunks). 캐시 히트도 done 에 포함
- asset_timings {idx,total,parse_ms,chunk_ms,expansion_ms,embed_ms,store_ms}
  — markdown 경로 phase별 wall-clock

설계: timing 은 kebab_core::IngestItem(wire-stable) 변경을 피해 신규
AssetTimings 이벤트로 ingest_one_asset 가 직접 emit (AssetFinished 무변경).

CLI(progress.rs): 진행바 sub-message(→ N chunks / 별칭 확장 done/chunks) +
asset 종료 시 phase timing 한 줄(fmt_ms). TUI reducer no-op arm.

검증: clippy -D warnings exit 0; cargo test -p kebab-app -p kebab-cli
312 passed/0 failed. ordering-invariant 테스트 재작성 + 신규 직렬화 테스트.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 13:58:27 +00:00
581e1d5d55 feat(cli): ingest 시 임베딩 백엔드/디바이스 한 줄 표시 + README KB 이전 문서 (v0.23.1)
- kebab-cli ingest: 시작 시 `임베딩 백엔드: <provider> (Metal/GPU 빌드|CPU) · 모델 …`
  를 stderr 로 표시 (--json/--quiet 억제). Metal 표기는 cfg!(feature=embed_metal)
  기반; 확정 런타임 디바이스는 kb.log(`candle device = …`).
- README: '외부 계산 + 로컬 검색' 절에 복사 대상(kebab.sqlite/sqlite, lancedb/vector_dir)
  + [storage] config 키 + models/assets 복사 불필요 + 동일 버전/모델 조건 + rsync 예시.
- 버전 0.23.0 → 0.23.1 (CLI 출력 + 문서만, 동작/schema 불변).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 12:25:45 +00:00
c17d6e67a8 Merge pull request 'feat(embed): candle Metal (Apple Silicon GPU) opt-in build feature' (#200) from feat/embed-candle-metal into main
Reviewed-on: #200
2026-06-02 11:40:52 +00:00
af8fd34716 docs(embed): README 에 cargo install --features embed_metal 안내 추가
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 11:38:28 +00:00
369aeb3d24 feat(embed): candle Metal (Apple Silicon GPU) opt-in build feature + v0.23.0
- kebab-embed-candle: `metal` feature → candle metal backend; select_device()
  picks Device::new_metal(0) (CPU fallback) under the feature, else Device::Cpu.
  .contiguous() before to_vec2 (Metal rejects strided views; CPU tolerates).
- feature passthrough: kebab-app/embed_metal → kebab-cli/embed_metal.
  Build on macOS: cargo build --release --features embed_metal.
- default (non-metal) path unchanged: clippy 0, candle units + thread_cap + parity pass.
- README + HOTFIXES: Mac-GPU-ingest → copy sqlite+lancedb → server CPU-query workflow.
- version 0.22.0 → 0.23.0 (opt-in build surface).

macOS-only compile; Metal execution/speed/parity validated by user on M4 Pro
(not buildable on the Linux CI/dev machine).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 11:37:08 +00:00
99f8cfa691 Merge pull request 'feat(embed): candle 임베딩 provider (NUMA-안전, opt-in)' (#199) from feat/embed-candle into main
Reviewed-on: #199
2026-06-02 10:14:39 +00:00
d85d7348a5 docs(embed-candle): 도그푸딩 + A1 반증 + MKL 부정결과 증거 기록
- HOTFIXES + release-notes: candle 전체 도그푸딩 997 docs/23,151 chunks/에러 0 (9.5h)
- A1(taskset -c 0-3) 실서버 반증: 4코어 제한에도 onnxruntime segfault → candle 만이 실 해법
- MKL 가속 부정 결과: 코어 더 쓰나 38~50% 느림 → 미채택, 순수-Rust 유지
- 패리티 2.01e-7 재확인, 성능 트레이드오프 명시

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 09:08:12 +00:00
edac3ae737 chore(embed-candle): PR #199 회차 1 리뷰 반영 — SMOKE.md candle 모델 주의 명시
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 17:01:35 +00:00
6ec4e6809f fix(embed-candle): address round-1 review
- commit track-spec + meta-spec/plan into branch (HIGH: dangling `amends:` ref)
- inline parity evidence (cosine 1.0, max_abs_diff 2.01e-7) into HOTFIXES +
  release notes; drop refs to deleted IMPL_REPORT/SPIKE_REPORT (MEDIUM)
- model guard: reject non-e5-large `model` before the 2GB download so
  model_id() can't mislabel vectors (MEDIUM) + unit test
- parity test now covers BOTH query: and passage: prefixes (MEDIUM)
- guard encodings.first() index; document zero-attention/pooling invariant;
  clarify embed_batch prefixing doc (LOW)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 16:54:20 +00:00
1011c75fff chore(embed-candle): remove spike/impl process artifacts before PR
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 16:37:46 +00:00
8f7b6ee538 feat(embed): candle 임베딩 provider (NUMA-안전, opt-in) + v0.22.0
duo-socket NUMA 서버에서 fastembed(onnxruntime)가 intra-op 스레드를 48개로
하드코딩해 NUMA 힙 손상 → double-free 로 ingest 가 죽는 문제를 회피하기 위해,
같은 multilingual-e5-large 모델을 순수 Rust(candle)로 돌리는 opt-in 임베딩
provider 를 추가한다.

- 신규 crate kebab-embed-candle: CandleEmbedder (kebab_core::Embedder).
  hf-hub safetensors → XLMRobertaModel forward → mask mean-pool → L2 → e5
  prefix. candle 의존성 트리를 이 crate 에 격리 (core/config 외 kebab-* 의존 0).
- 스레드 캡: [models.embedding].num_threads + env KEBAB_EMBED_THREADS →
  글로벌 rayon 풀 1회 캡 (NUMA-안전 레버).
- kebab-app::embedder() 가 provider 분기 (fastembed/onnx/"" → 기존 경로 불변,
  candle → CandleEmbedder, 미지값 → 에러).
- Phase 0 스파이크 crate 제거 (production 흡수).
- 버전 0.21.1 → 0.22.0 (신규 config surface, pre-1.0 minor bump).

패리티: cosine_min=1.000000, max abs diff=2.01e-7 (< 1e-5) → embedding_version
유지, 재색인 0. fastembed default 동작/벡터 불변. wire schema 변경 없음.

검증(파일+exit code): clippy -D warnings EXIT=0(warning 0), test EXIT=0
(candle unit 5 + thread_cap rayon=4 + config 68), parity #[ignore] EXIT=0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 14:52:25 +00:00
76841af7d3 spike(embed-candle): candle e5-large 타당성 검증 — VERDICT PASS
Track 1 / Phase 0 격리 스파이크. candle(순수 Rust)로
intfloat/multilingual-e5-large 를 돌려 기존 onnxruntime
FastembedEmbedder 와 비교.

결과:
- 패리티: 한/영 10문장 cosine min=mean=1.000000 (완전 일치)
- padding_idx: XLM-R 규약 정상 (소스 + 패리티 이중 확인)
- 스레드 제어: RAYON_NUM_THREADS=4 로 컴퓨트 스레드 12→4 캡 확인
  (fastembed 4.9.1 의 48-하드코딩+override불가 문제 구조적 부재)
- latency: batch=32 candle 2.161s vs fastembed 0.536s (~4×, 4 vs 12 스레드)

→ candle 본 구현 진행 권고 (GREEN). 상세 SPIKE_REPORT.md.

candle 의존성은 crates/spike-embed-candle 에만 격리. 프로덕션
crate 동작 변경 없음. 결정적 NUMA 검증은 그 듀얼소켓 서버에서
사용자 실행 필요 (meta-spec §4.3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 14:23:51 +00:00
980e20fd8d docs: SMOKE/DOGFOOD 에 config migrate 플레이북 추가
SMOKE 에 config migrate 스모크 단계(dry-run/적용/멱등/--json), DOGFOOD §9 에
스키마 마이그레이션 시나리오(.bak byte-identical·값 보존·가시화·멱등·doctor).
v0.21.1 에 포함되도록 태그 이동.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 13:58:08 +00:00
cd79ed326c chore: bump version 0.21.0 → 0.21.1
config 마이그레이션(kebab config migrate, PR #198) — 신규 CLI 서브커맨드 +
doctor 체크 + init 섹션 주석 + wire config_migration.v1 + schema_version 1→2.
additive 변경(데이터 무효화 아님)이라 patch bump.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 13:51:56 +00:00
9dbf9d781d Merge pull request 'feat(config): config.toml 마이그레이션 (kebab config migrate)' (#198) from feat/config-migration into main
Reviewed-on: #198
2026-05-31 13:48:10 +00:00
9501edd82b docs: config migrate surface 동기화 (README/HOTFIXES/HANDOFF)
README Configuration 에 kebab config migrate 불릿, HOTFIXES 에 dated entry
(메커니즘 + 도그푸딩 evidence 표 + 한계), HANDOFF 한 줄. lib.rs 백업 경로는
with_extension 유지(리뷰 nit: .toml config 엔 정상 동작, 회귀 위험 회피).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 13:25:42 +00:00
4b4a4c0b32 fix(config): init 헤더에 지원 확장자 상세 목록 유지
annotated_default_document 의 HEADER 가 기존 init 헤더의 '처리 가능한 형식'
상세 목록(.md / .png .jpg .jpeg / .pdf)을 보존하도록 복원. p9-fb-25 의
init_template 계약(지원 확장자 안내) 유지.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 12:46:45 +00:00
f2cc325cf3 feat(cli): kebab config migrate 서브커맨드 + wire config_migration.v1
- Cmd::Config { Migrate { --dry-run } }, --json 시 config_migration.v1.
- wire_config_migration (ConfigMigrationReport 가 schema_version 자체 보유).
- schema.rs WIRE_SCHEMAS 에 config_migration.v1 등록 + JSON schema 파일.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 12:09:31 +00:00
b7e022a5e3 feat(app): config migrate facade + init 주석 공유 + doctor 체크
- config_migrate_with_config_path: 백업(.bak)+atomic write(tmp→rename)+dry-run,
  round-trip 검증으로 실패 시 원본 보존. ConfigMigrationReport 반환.
- init_workspace 가 annotated_default_document() 사용(섹션 주석 포함).
- doctor 에 config_migration 체크 추가(미동기 시 ok=false + hint).
- tests/config_migrate.rs 4개(백업/atomic/dry-run/멱등/doctor) 통과.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 12:09:31 +00:00
bd7c4fd7ef feat(config): config 마이그레이션 엔진 (reconcile + step 체인)
- toml_edit 0.22 의존성 추가
- migrate.rs: CURRENT_SCHEMA_VERSION=2, annotated_default_document(주석
  카탈로그 공유 원천), reconcile(빠진 섹션/키 주석과 함께 추가, 값 불가침),
  step_1_to_2(workspace.include 제거), migrate_document(step+reconcile+stamp)
- schema_version default 1 → 2
- 56 tests green, clippy -D warnings clean

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 12:09:31 +00:00
4dcb4a45d6 feat(config): migrate 모듈 스캐폴딩 + toml_edit 의존성
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 11:41:32 +00:00
6d86214060 docs(plan): config 마이그레이션 구현 계획 (TDD, 13 tasks)
reconcile(additive)+step 체인(non-additive) 분리, init/migrate 공유
annotated_default_document, app facade 백업+atomic write, doctor 체크,
CLI config migrate, wire config_migration.v1. bite-sized TDD steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 11:39:31 +00:00
6bbb8f854b docs(spec): config 마이그레이션 설계 계약
kickoff 인계(#197)의 brainstorm 결과를 확정한 spec. 트리거=명시 명령
`kebab config migrate`+doctor 안내, 주석 보존=toml_edit 부분 편집,
메커니즘=reconciliation(additive)+step 체인(non-additive) 하이브리드.
init/migrate 가 주석 달린 default 문서를 공유. 안전 3축(멱등·백업·dry-run)
+ atomic write. wire schema config_migration.v1 신설.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 11:34:19 +00:00
2a4df4d48d Merge pull request 'docs: config 마이그레이션 작업 인계 kickoff' (#197) from docs/config-migration-kickoff into main 2026-05-31 11:11:17 +00:00
16f3d6eef2 docs: config 마이그레이션 작업 인계 kickoff
config.toml 스키마 진화 시 기존 사용자 파일 자동 마이그레이션 기능의
별도 세션 인계 문서. 현황(serde default forward-compat 있음/파일 마이그레이션
없음/schema_version 장식), 핵심 난점(주석 보존), 설계 3안(전체재작성/toml_edit
append/백업), 트리거(명령 vs 자동), 방법론(v0.21.0 PR #195/#196 패턴) 정리.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 11:11:08 +00:00
fa89c7b561 Merge pull request 'docs(readme): v0.21.0 전면 재구성' (#196) from feat/doc-side-expansion into main 2026-05-31 10:44:38 +00:00
a4c81fed86 docs(readme): v0.21.0 전면 재구성
Quick start 를 맨 앞(빠른 사용), 핵심 기능을 중간, 아키텍처·설계를 뒤로
재배치. kebab 무관 내용(ollama sudo-less tarball 설치, CPU 모델 트러블슈팅)
과 구식 버전 태그(fb-XX, p9-fb, V009, v0.17~v0.20.x 산재), stale 버전 문구
제거. v0.21.0 기준(doc-side expansion 별칭, 파생물 캐시, 외부 계산 워크플로)
서술. 302→206 줄.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 10:44:29 +00:00
5b7c02fe13 Merge pull request 'feat(expansion): doc-side expansion 별칭 개별 dense 벡터 + 파생물 캐시(V012)' (#195) from feat/doc-side-expansion into main 2026-05-31 10:25:45 +00:00
88c5b83dea docs: derivation-cache spec/handoff 독자 관점 보강
PR #195 구현(e9b5202) 기준으로 빠졌던 디테일 보강:
- chunk_id(위치 기반 벡터 식별자) vs cache_key(내용 해시 조회 키) 구분 callout
- §7 호환성/마이그레이션 신설: 본문 재색인 불필요, V012 가산이나 binary 교체 필요,
  별칭 sentinel 묶음→개별 변경의 기존 KB 영향(레거시 호환)
- version_key 에 kind 토큰("doc|") 반영, orphan sentinel cleanup(LIKE prefix) 명시
- embed_with_cache 순서 보존 불변, 별칭 개별 벡터 근거(희석 13/18→16/18)
- 정정: derivation_cache_gc 는 메서드만 존재하고 미연결(캐시 현재 무한 누적, 후속)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 10:25:00 +00:00
2619b7bff7 test(chunk): AST snapshot fixture에 aliases:null 필드 반영
Chunk 구조체에 aliases 필드가 추가된(별칭 인프라) 뒤 chunk-*-ast-v1
snapshot fixture 들이 미갱신 상태로 남아 drift FAIL 이었다. chunk_id·
text·policy_hash·tokenized 는 전부 불변 — 직렬화에 "aliases": null 한
필드만 추가됐다(청크 생성 로직 무변경, 회귀 아님). UPDATE_SNAPSHOTS=1 로
10개 fixture(code c/cpp/go/java/js/kotlin/python/rust/ts + long_section)
재베이크.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 09:57:16 +00:00
e9b520216e fix(expansion): per-alias sentinel orphan cleanup + 캐시 견고성 (PR #195 리뷰)
MAJOR: 별칭 dense 벡터의 chunk_id 가 레거시 단일 `{id}#alias` 에서 줄별
`{id}#alias#0`, `#alias#1`, … 로 바뀌었으나 orphan cleanup 이 단일 sentinel
하나만 삭제해 `#alias#N` 벡터가 LanceDB / embedding_records 에 누수됐다.

- kebab-app: `alias_sentinel_ids_to_delete` 헬퍼 추가(접근법 A) — 본문 +
  legacy `{id}#alias` + `{id}#alias#0`..`{id}#alias#{max-1}` 를 모두 delete-set
  에 포함. max=expansion.max_aliases_per_chunk(= parse_aliases 의 하드 cap)와
  일치. parser-bump / edited-asset / deleted-file 세 LanceDB cleanup 경로 모두
  이 헬퍼를 사용.
- kebab-store-sqlite: embedding_records 명시 DELETE 4 경로(put_chunks /
  purge_*_except_doc_id / purge_orphan_at_workspace_path /
  purge_deleted_workspace_path)를 정확 일치(`|| '#alias'`)에서 `{id}#alias%`
  프리픽스 LIKE 로 전환. 본문 chunk_id 는 32자 hex 라 LIKE 와일드카드 없음.

MINOR 1: alias 캐시 히트 시 비-UTF8 payload 를 미스로 강등(재생성 분기로)
— embedding 경로의 decode-실패→미스 강등과 동작 일치.
MINOR 2: embedding version_key 맨 앞에 kind 토큰("doc") 추가 — 임베더가
kind 별 프리픽스를 붙이므로 미래에 query 임베딩이 같은 캐시를 타도 충돌 방지.

회귀 테스트:
- kebab-app: alias_sentinel_ids_to_delete 단위 테스트 2건.
- kebab-store-sqlite: per-alias sentinel embedding_records 가 세 cleanup
  경로 모두에서 사라지는지 핀하는 통합 테스트 3건.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 09:14:34 +00:00
a8fd76499c feat(expansion): doc-side expansion 별칭 개별 dense 벡터 + 파생물 캐시(V012)
별칭을 줄별 개별 dense 벡터(sentinel `{chunk}#alias#N`)로 색인하고
boilerplate 청크는 별칭 생성을 skip. 묶음 1벡터 방식은 평균화로 특정
표현이 희석돼 오히려 회귀(13/18)했던 것을 폐기. 변형 일관성 14/18 →
16/18, mean_spread@10 0.222 → 0.111 (나무위키 ~1000 문서 CS corpus).
`kebab-core::strip_alias_suffix` 가 suffix 형과 per-alias 형 둘 다 처리.

파생물 캐시(V012): embedding 벡터 + 별칭 LLM 결과를 청크 내용 해시
키로 캐싱해 재색인 시 내용 불변 청크의 재계산을 skip. cache_key =
blake3(kind ‖ text_blake3 ‖ version_key)[:32], version_key 에
model/prompt/dimensions 포함 → §9 cascade 와 정합(버전 bump 시 자동
miss). 측정: 정답 3개 cold 1879s → warm 13s ≈ 145배. 순수 가산이라
corpus_revision bump 없음. search/ask 는 kebab.sqlite+lancedb 만으로
동작 → 외부 서버 색인 후 DB 만 복사하는 이식 워크플로 가능.

V012 schema migration + 신규 surface 로 workspace version 0.20.2 →
0.21.0 (minor) bump. README/HANDOFF/ARCHITECTURE/HOTFIXES sync.
known limitation: stack·svm 설명형 2개 잔존 + grounded 판정이 부분
인용을 grounded 로 오분류(후속 후보).

측정 상세: docs/superpowers/handoffs/2026-05-31-namu-wiki-alias-cache-study.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 08:24:04 +00:00
0282a81c67 fix(store): CASCADE 대체 4번째 경로 + V011 CHECK 복원 (Task 4.5 리뷰)
리뷰 MAJOR: purge_document_at_workspace_path_except_doc_id(parser-bump 경로)에
원본+sentinel embedding_records 명시 DELETE 누락 → tombstone 누적. 추가 +
회귀 테스트. MINOR: V011 status CHECK(pending/committed/tombstone) 복원.
NIT: foreign_keys PRAGMA no-op 주석.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 14:02:46 +00:00
f3587b7143 feat(store): filter_chunks sentinel 별칭 candidate strip (committed 통과)
LanceDB 후보의 sentinel chunk_id({orig}#alias)는 chunks JOIN 에서 탈락해
VectorRetriever strip 이전에 사라진다. candidate 를 kebab_core::strip_alias_suffix
로 원본 chunk_id 로 strip 해 IN-list/JOIN 에 넣어(committed 판정은 원본 body chunk
기준) 통과시키되, 반환은 입력 candidate 형태(sentinel 유지) — VectorRetriever 가
그 sentinel 을 받아 strip+dedup 한다. SQL replace 대신 (b) Rust strip 채택(명확).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 13:41:28 +00:00
483b1ec06b feat(store): V011 embedding_records FK 제거 + CASCADE 대체 명시 DELETE (sentinel 별칭 벡터)
별칭 dense 벡터를 sentinel chunk_id({orig}#alias)로 색인하려면 chunks 에 없는
chunk_id 가 embedding_records 에 들어가야 한다. V001 의 chunk_id REFERENCES chunks
ON DELETE CASCADE FK 가 이를 SQLite 787 로 막으므로 테이블을 FK 없이 재생성한다.
status/vector_committed(V003) + 3개 인덱스 보존, chunks_bd_tombstone_embeddings
trigger 무수정. DROP→RENAME 시 dangling trigger 재파싱을 피하려 legacy_alter_table=ON.

사라진 CASCADE 는 put_chunks + purge 두 경로(purge_orphan_at_workspace_path,
purge_deleted_workspace_path)의 명시 DELETE 로 대체 — chunks 삭제 직전 원본 +
{id}#alias sentinel embedding_records 를 함께 정리. corpus_revision baseline 2→3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 13:41:20 +00:00
d279f343e7 docs(spec,plan): 별도 벡터 인프라 — FK 제거(V011) + CASCADE 대체 + filter_chunks
PoC: 별칭 순수 벡터가 영어 설명형 rank 7~30 (concat 본문 희석으로 미회복) →
별도 벡터 명분. 차단요인 3건: embedding_records FK(787, V011 재생성),
CASCADE 대체(명시 DELETE), filter_chunks sentinel strip. plan Task 4.5/4.6.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 13:25:45 +00:00
b56469f010 fix(core): clippy uninlined_format_args — strip_alias 테스트 (리뷰 MAJOR-1)
workspace clippy --all-targets -D warnings 게이트 통과. format! 인자 인라인.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 11:24:04 +00:00
6ba8cb2c88 feat(search): VectorRetriever sentinel 별칭 strip + dedup
별칭 dense 벡터({orig}#alias) hit 을 원본 chunk_id 로 strip 해 hydrate,
body+alias 중복은 첫(높은 score) 하나만 유지. overfetch 2→3 (dedup 후 k
확보). wire/RetrievalDetail 무변경. vector/hybrid 회귀 0, clippy green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 11:09:32 +00:00
afa8af0f88 feat(app): 별칭 dense 별도 벡터 색인 + purge (sentinel) 2026-05-30 10:48:58 +00:00
b9d20d23d1 feat(config): ingest.expansion.embed_aliases flag (default off) 2026-05-30 10:31:07 +00:00
86b4e1ebd0 feat(core): ALIAS_SUFFIX + strip_alias_suffix (dense alias vectors) 2026-05-30 10:31:03 +00:00
825543549d docs(plan): 별칭 dense 별도 벡터 구현 plan
ALIAS_SUFFIX(core) → embed_aliases flag → ingest sentinel 벡터+purge →
VectorRetriever strip+dedup → 측정. TDD, 완성 코드. doc-side expansion PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 10:28:43 +00:00
bcb8b93751 docs(spec): 별칭 dense 별도 벡터 설계 spec
PoC(concat) 측정: dense 별칭이 6/0/2/0.25 (설명형은 dense 본령 실증), 단
영어 설명형 2개는 concat 본문 희석으로 미회복. 처방: 별칭을 sentinel
chunk_id 별도 벡터로 색인(본문 벡터 불변=회귀 안전, 별칭 순수 신호).
flag ingest.expansion.embed_aliases default off. lexical 완화는 폐기.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 10:26:24 +00:00
116b3e6377 fix(app): clippy unused_self — build_request 를 associated fn 으로
CI 게이트(clippy --workspace --all-targets -D warnings) 통과. 동작 동일.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 03:47:06 +00:00
69b53d1c97 docs(spec): doc-side expansion 검색 메커니즘을 shipped 구현에 맞춰 정정
Task 6 리뷰 MINOR-1: spec 본문이 단일 UNION ALL+GROUP BY 로 기술됐으나
shipped = 2-query(run_query+run_alias_query) + Rust merge_body_alias(body 우선).
서로 다른 FTS 테이블 bm25 절대값 비교가 무의미해 body-우선 merge 가 더 깨끗.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 03:20:13 +00:00
a271352e33 feat(search): lexical body+alias 병합 검색 (pool-rescue)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:12:14 +00:00
cde4d75f6b feat(app): ingest 별칭 생성 hook (flag off 기본, fail-soft) 2026-05-30 03:03:09 +00:00
bddcd53688 fix(app): parse_aliases 접두 제거가 숫자/하이픈 선두 별칭 손상 (Task 4 리뷰 MAJOR-1)
탐욕적 trim_start_matches → 명시적 strip_list_marker(마커+공백 패턴만 1회).
"3D 렌더링"/"2단계"/"-fast" 보존, "- "/"1. " 마커만 제거. 회귀 테스트 2개.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 02:49:25 +00:00
2a207f9868 feat(app): ExpansionGenerator — 청크당 별칭 생성 (fail-soft)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:36:20 +00:00
cc31868d24 feat(config): [ingest.expansion] flag (default off) 2026-05-30 02:26:41 +00:00
0df47febf0 test(store): doc-side expansion Task 2 리뷰 보강 (M1/M2/N1)
- M1: chunk_aliases trigger 가드에 AND aliases <> '' (빈 문자열 미색인)
- M2: 재색인 멱등 테스트 (재-put 후 별칭 행 1개)
- N1: 본문 격리 음성 단언 (별칭 term 이 chunks_fts 로 누출 안 됨)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 02:24:24 +00:00
b12a616ab2 feat(store): V010 chunk_aliases_fts + put_chunks 별칭 영속화
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:15:27 +00:00
848b75c069 feat(core): Chunk.aliases 필드 (doc-side expansion)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 02:09:39 +00:00
467a974901 docs(plan): doc-side expansion 구현 plan + spec 정제 (별도 FTS 테이블)
spec: chunks_fts §5.5 verbatim 충돌 회피 → 별도 chunk_aliases_fts 테이블 +
lexical 내부 body+alias 병합(RetrievalDetail/wire schema 무변경)으로 정제.
plan: 7 task TDD (Chunk 필드 → V010 → config → ExpansionGenerator →
ingest hook → lexical 병합 → 측정/문서). 완성 코드 + 빌드 규약.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 02:04:58 +00:00
098413922b docs(spec): 색인시 doc-side expansion 설계 spec (Phase 2)
brainstorm 확정: 청크당 별칭 생성(같은언어+한↔영 번역), additive+수동
재색인, 1차 단순 품질제어. 별도 FTS5 aliases 채널 → RRF 3채널 융합.
flag off 기본, kebab eval variants 로 on/off 측정.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 01:54:46 +00:00
695010ea7a Merge pull request 'docs: Phase 2 doc-side expansion 킥오프 + 구현 방법론 핸드오프' (#194) from docs/phase2-doc-expansion-kickoff into main
Reviewed-on: #194
2026-05-30 01:23:48 +00:00
8bb7c276d0 docs: Phase 2 doc-side expansion 킥오프 + 구현 방법론 핸드오프
새 세션이 Phase 2(색인시 doc-side expansion)를 자립적으로 이어받을 컨텍스트 문서.
배경(rerank 반증→재정의→Phase1 진단 B우세→딥리서치→PoC), 설계 방향(KO↔EN 번역 별칭
+ 별도 FTS5 필드 + RRF, flag off), 이미 만든 측정 도구(kebab eval variants + dogfood golden),
그리고 지금까지와 동일한 구현 방법론(brainstorm→spec→plan→OMC teammate sequential 구현+리뷰
+독립검증, 모델 라우팅, 빌드 redirect+exit, 측정=variant eval 프록시금지, gitea-pr 리뷰루프)을 담음.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 01:19:14 +00:00
01a03463a6 Merge pull request 'feat(eval): 변형 일관성(query-paraphrase robustness) 평가 프레임워크' (#193) from feat/paraphrase-robustness-eval into main
Reviewed-on: #193
2026-05-30 01:12:26 +00:00
b6ad947378 docs: README 명령 표 슬림 + ARCHITECTURE 상세 이전·동기화
README 의 괴물 셀(ingest 2891→544, search 2952→687, ask 1244→415, tui 2300→453자)을
"무엇 + 핵심 flag + 포인터"로 축소. 빠진 구조 detail 은 ARCHITECTURE 로 이전:
- symbol path 형식에 Go/Java/Kotlin/C/C++ 추가 + code chunk provenance(citation.kind/code_lang/repo)
- Markdown title 자동 채움 순서(md-frontmatter-v2)
- RAG groundedness 검증(mDeBERTa-v3 XNLI, nli_threshold gate) 결정 행 신설
- TUI 행을 P9-1~4 완료 + F1 cheatsheet 로 최신화 (stale "진행 예정" 제거)
flag 망라는 --help, TUI 키는 in-app F1 cheatsheet(권위 런타임 소스)로 위임 — stale 방지.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 01:10:39 +00:00
1529e6d991 docs(readme): PR #193 회차 1 리뷰 반영 — eval 명령 표에 aggregate/variants 추가
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 01:00:32 +00:00
5ad1f98227 docs(handoff): doc-side expansion 딥리서치 + PoC 결과 (Phase 2 방향 확정)
딥리서치(104 agent): 어휘격차 pool-miss 최선책 = 색인시 doc-side expansion.
PoC(dogfood KB): recall@50=0 이던 3쿼리가 별칭 추가로 rank1~2 부활(hybrid+vector,
골든 verbatim 아님=일반화). 핵심 미검증 고리 실 corpus 정량 확인.
Phase 2 = 색인시 doc-side expansion(KO↔EN 번역 별칭) → 별도 FTS5 필드 → RRF, flag off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
a58cae2ff3 docs(research): 어휘격차 pool-miss 해결 딥리서치 레퍼런스
deep-research 워크플로(104 agent, 5각도, 22소스, 25 claim 3-vote 검증, 22 confirmed/3 killed).
결론: 색인시 doc-side expansion(doc2query)이 pool-miss 최선책 — pool 자체를 키우고
per-query 지연 ~0(색인시 1회), 정확매칭 보존(별도 필드 append). 단 vanilla mt5는 같은언어라
한/영 갭은 색인시 KO↔EN 대체 query 생성 필요. query-side(HyDE=거부된 per-query LLM,
Vector-PRF=recall 주장 기각)는 부적합. 검증은 기존 variant eval 로 가능.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
7a1dff1684 docs(handoff): query-paraphrase robustness Phase 1 완료 + (A)/(B) 진단
8그룹×4변형(dogfood) 측정: groups=8 A_dominant=2 B_dominant=4 spread@10=0.750.
진단 — 문제는 한/영이 아니라 어휘 거리(영어 풀어쓴 문장도 miss, 일부 한국어는 OK).
B(어휘격차, recall@50=0, rerank 불가) 우세 → 쿼리 확장/번역 처방 신호. A(순위출렁)는
cap_theorem/vector_database 2그룹뿐. "측정 먼저" 논제 정량 검증(rerank 단독은 부분해법).
Phase 2 처방 결정 대기.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
0988f66331 feat(cli): kebab eval variants <run_id> — 변형 일관성 진단 리포트
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
82e02aa4fe fix(eval): 변형 일관성 리뷰 H1/M1 — pool truncation 방어 + answer 판정 정렬
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
db4af0cc72 feat(eval): 변형 일관성 메트릭 + A/B(순위출렁/어휘격차) 분류
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
ab20202241 test(eval): Task1 리뷰 nit — 3+멤버 그룹/group=None 테스트 + 에러 메시지에 divergent query id
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
a51e6395c0 feat(eval): GoldenQuery.group + 그룹 정합성 검증 (변형 일관성 기반) 2026-05-30 00:53:24 +00:00
fe4c854673 docs(plan): query-paraphrase robustness Phase 1 구현 계획
5개 task: (1) GoldenQuery.group + 그룹 정합성 검증, (2) 변형 일관성 메트릭
모듈 + A/B(순위출렁/어휘격차) 분류, (3) kebab eval variants CLI, (4) dogfood
golden 변형 그룹 큐레이션, (5) 측정 + 진단 리포트. TDD bite-sized, 완성 코드.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
1de3f4ffca docs(spec): query-paraphrase robustness 평가 프레임워크 설계 (측정 먼저)
목표 재정의: 한/영 overlap → 같은 의미의 다양한 표현(동의어·다른 어휘·풀어쓴
문장·한영)에서 일관된 답변 품질. 지난 reranker 실험이 overlap 프록시 최적화로
헛돈 교훈 반영 — 처방 전 진짜 지표(변형 일관성)를 직접 재는 평가부터.

Phase 1(본 spec 구현): kebab-eval golden suite에 변형 그룹(intent group) +
변형 일관성 메트릭(recall_spread, answer_consistency) + recall@pool vs recall@k로
(A)순위출렁/(B)어휘격차 자동 판별. Phase 2(처방)는 측정 결과 게이트 뒤 조건부.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
7fbfec647d Merge pull request 'feat: v0.20.2 — Ask 응답언어 rag-v3 + 8 dogfood findings + 검색 품질 eval baseline' (#192) from fix/v0-20-2-dogfood-findings into main
Reviewed-on: #192
2026-05-29 05:27:52 +00:00
ca8c83b1ba chore(hotfixes): PR #192 회차 1 리뷰 반영 — refusal marker 표기 정정
`<REFUSE>` marker → citation marker(`[#번호]`) 유무 기반 (pipeline.rs:463-486).
release-notes 정정과 일관.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 05:22:57 +00:00
6c611990d8 chore: bump version 0.20.1 -> 0.20.2
v0.20.2 — Ask 응답언어 자동매칭(rag-v3) + 8 dogfood findings
+ eval --config facade 패치 + 검색 품질 golden baseline
(hybrid hit@3=1.0 / MRR=0.833, lexical hit@3=1.0).
evidence: tasks/HOTFIXES.md 2026-05-29, docs/release-notes/v0.20.2-draft.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 05:10:42 +00:00
166b1404e4 docs(release-notes): correct refusal判정 mechanism + O-2 phrasing
leader review of writer draft: refusal 판정은 citation marker(`[#번호]`)
유무 기반이며 `<REFUSE>` 특수 마커가 아님. O-2 문구 예시도 실제 rag-v3
규칙으로 정정.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:58:08 +00:00
2d0168b7ab docs(sync): README + HANDOFF v0.20.2 surface 반영
CLAUDE.md docs-split 규칙에 따라 사용자 visible surface 변경 동기화.

README:
- [rag] prompt_template_version default rag-v2 → rag-v3 (v0.20.2)
- v3 규칙 설명 (답변 언어 = 질문 언어)
- O-2 known limitation (소형 모델 refusal 언어 불일치)

HANDOFF:
- 머지 후 발견된 버그/결정 에 v0.20.2 1줄 요약 추가
- 검색 품질 baseline (hybrid MRR=0.833) + O-2 known limitation 언급

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 04:54:26 +00:00
4afcaf96d2 docs(release-notes): add v0.20.2-draft (rag-v3 응답언어 + 검색 품질 eval 인프라)
v0.20.2 릴리즈 노트 초안 작성. 사용자 영향 4단락 구조로 각 finding 기술.

- Finding #1/O-2: rag-v3 응답언어 자동 매칭 + refusal 언어중립화
- Finding #2: bulk search input schema 확정 (15필드)
- Finding #3: list docs human-readable path 보강
- Finding #7: index_version 두 곳 구분 (vector vs FTS5)
- eval --config facade + 검색 품질 baseline (hybrid hit@3=1.0 / MRR=0.833)
- Finding #4/#5/#6/#8: docs/schema 정비
- version cascade 주의 (rag-v3 → eval compare)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 04:54:19 +00:00
16c4579399 docs(hotfixes): add 2026-05-29 v0.20.2 dogfood findings + 검색 품질 baseline
8-finding 도그푸딩 라운드 및 검색 품질 baseline 결과를 HOTFIXES 에 기록.

- 8 findings 요약 표 (rag-v3, bulk schema, list docs, index_version 등)
- Finding O-2 known limitation (소형 모델 refusal 언어 불일치)
- 검색 품질 baseline 표 (hybrid MRR=0.833, lexical MRR=0.7)
- golden 큐레이션 교훈 (dispatch.py 정답 정정 → hit@3 0.9→1.0)
- eval logs cross-link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 04:54:11 +00:00
40d7faee71 docs(dogfood): add §10.2 search quality baseline scenario (v0.20.2 golden suite)
eval --config facade 패치로 dogfood KB 직접 평가 가능해짐에 따라
§10 Eval 에 §10.2 검색 품질 baseline 섹션 추가.

- golden suite 실행 명령 (hybrid + lexical eval run → aggregate)
- v0.20.2 metric baseline 표 (hybrid hit@3=1.0 / MRR=0.833)
- 정성 체크리스트 (한국어 2자 hit@3, empty=0, MRR 임계치)
- golden 큐레이션 절차 + dispatch.py 오류 교훈
- §10.1 로 기존 basic eval run 재구성

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 04:54:03 +00:00
a3bb2580bf test(rag): add rag-v3 dispatch integration test + refresh stale rag-v2 docs (code-review)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 04:46:27 +00:00
2429189447 docs(spec): reflect search-quality critic round-1 (eval --config, lang-filter non-goal, curation)
Incorporates all critic (opus) round-1 findings into the dogfood
search-quality eval design spec:

BLOCKER-1: §4.4 execution commands now use --config /build/dogfood/config.toml
  (Task A facade-rule patch makes this the canonical path).  §5.1 re-titled
  from "(후속 패치)" to "Task A로 적용됨 — 권장 운영 경로"; XDG workarounds
  demoted to "패치 전 fallback".  Intro paragraph updated accordingly.

MAJOR-1: §3 Non-Goals gains an explicit bullet: lang/media/code_lang
  SearchFilters validation is out of scope for this harness (runner uses
  SearchFilters::default(), runner.rs:151).  §4.1 "code 검색" row no longer
  claims code_lang filter coverage.

MINOR-1: §4.3 step 3 now names kebab inspect doc <id> as the primary
  chunk-selection path (breaks chunk-level curation loop); search hits
  demoted to "보조 확인용".

MINOR-2: §4.1 golden category table gains two new rows — 한국어 N-gram
  fallback query (복합어/신조어 coverage) and 영어 whole-token exact query
  (separates substring artefacts).

MINOR-3: §4.1 YAML header note added: record corpus_revision in golden
  file so stale-bail root cause is immediately traceable.

NIT: §9 References line numbers corrected (runner.rs:31, metrics.rs:116/144);
  runner.rs:151 SearchFilters::default() reference added.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 03:43:00 +00:00
d93b757cf1 fix(cli): thread --config through kebab eval run/aggregate/compare (facade-rule)
Cmd::Eval now loads Config via cli.config (same pattern as all other
subcommands) before dispatching to the inner match.  Each arm now calls
the *_with_config variant:

  run_eval(&opts)             → run_eval_with_config(&cfg, &opts)
  compute_aggregate(run_id)   → compute_aggregate_with_config(&cfg, run_id)
  store_aggregate(run_id, ..) → store_aggregate_with_config(&cfg, run_id, ..)
  Compare already called compare_runs_with_config but sourced cfg from
  Config::load(None) — that redundant load is removed; cfg comes from
  the shared binding above.

Fixes the same facade-rule regression pattern as P3-5 / P4-3: previously
`kebab --config /build/dogfood/config.toml eval run` silently evaluated
the XDG-default (empty) KB instead of the dogfood KB.

Also fixes runner.rs test that hardcoded rag-v2 after commit 5719969
bumped the default prompt_template_version to rag-v3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 03:42:40 +00:00
571996938c docs(contract): bump default prompt_template_version to rag-v3 (Todo #1)
line 899: V1만 legacy → V1/V2 둘 다 legacy, v0.20.2 부터 rag-v3 default 선언.
line 1349 (★): config 예시 default rag-v2 → rag-v3.
line 1533 (★): §9 cascade table 코드 상수 rag-v2 → rag-v3.
line 287 이후: answer.v1 예시 블록에 historical snapshot 주석 추가 (n1 — model+ptv stale, 값 변경 안 함).
task spec grep 판단: tasks/p9/p9-fb-15 의 rag-v2 언급 2줄은 rag-v2 도입 시점 historical 기술 → frozen 유지.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:45:13 +00:00
be79bdb83d docs(#7): distinguish vector-store vs FTS5 index_version (Todo #7)
schema.schema.json models.index_version: vector store (LanceDB) version 임을 명시.
search_hit.schema.json index_version: lexical (FTS5) version 임을 명시.
search_hit.schema.json retrieval: 내부 필드 목록 + hybrid 전용 fusion 설명 추가 (hunk 공유).
README kebab schema 행: index_version 두 곳의 의미가 다름을 주의 표기 추가.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:45:04 +00:00
4e76f103c1 docs(#5,#6): clarify retrieval.* nesting + single-mode score relation (Todo #5/#6)
README Score 해석 절에 score ↔ retrieval.* 구조 설명 추가:
- fusion_score/lexical_score/vector_score/lexical_rank/vector_rank 는 retrieval 내부 (top-level 아님).
- single-mode 에서 score==fusion_score==lexical/vector_score 가 같은 값인 것은 정상 (Finding X).
search_hit.schema.json score 필드에 score_kind 관계 + single-mode 동일값 이유 설명 추가.
search_hit.schema.json retrieval/index_version 설명은 Task 12 커밋에 포함 (같은 hunk).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:44:56 +00:00
4fd672193f docs(#4): clarify lang vs code_lang semantic and und=code (Todo #4)
lang_breakdown description에 code 문서는 자연어 감지 미수행(lang="und" 정상) 사실 추가.
README에 lang vs code_lang 설명 절 신규 추가.
task spec grep: tasks/p9/p9-fb-15 의 rag-v2 언급은 historical 기술 → frozen 유지.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:44:34 +00:00
1454321b12 fix(rag): rag-v3 refusal/hedge phrasing follows answer language (Finding O-2)
SYSTEM_PROMPT_RAG_V3: 한국어 리터럴 refusal/hedge 문구를 언어 중립으로 교체.
- 근거가 부족하면 "근거가 부족하다"고 답한다. → 답변 언어로 근거가 부족함을 밝히고 [#번호] 인용 없이 답한다.
- 근거가 모호하면 "확실하지 않다" 라고 명시한다. → 근거가 모호하면 답변 언어로 불확실함을 명시한다.

MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT: 동일 패턴 두 곳 교체.
- 근거가 부족하면 "근거가 부족하다"고 답한다. → 답변 언어로 근거가 부족함을 밝히고 [#번호] 인용 없이 답한다.
- self-check 의 즉시 "근거가 부족하다" 라고만 답한다. → 즉시 답변 언어로 근거가 부족하다고만 답한다.

refusal 판정 로직(citation marker 기반)은 무변경 — 문구만 언어 중립화.
test rag_v3_contains_v2_rules_plus_language_rule: "확실하지 않다" assert → "불확실함" assert 로 갱신.
task spec grep: tasks/p9/p9-fb-15 의 rag-v2 언급은 도입 시점 historical 기술 → frozen 유지.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:43:28 +00:00
649ec35108 feat(cli): add Ollama remote-endpoint hint to kebab init (Todo #8)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:19:34 +00:00
dece5e89fc feat(bulk): document bulk search input schema + error shape hint (Todo #2)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:16:21 +00:00
3cb49f1f9b feat(cli): show title + doc_path in list docs human output (Todo #3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:03:10 +00:00
f5ff823984 docs(dogfood): add RAG response-language scenario (Todo #1 verification) 2026-05-28 21:28:22 +00:00
b82eaec21a feat(config): default prompt_template_version rag-v2 -> rag-v3 (Todo #1) 2026-05-28 21:20:59 +00:00
6daa43375b feat(rag): add rag-v3 system prompt with response-language matching (Todo #1) 2026-05-28 21:20:18 +00:00
85efeeca3e docs(plan): v0.20.2 dogfood findings 구현 plan (15 task)
planner(opus) 작성 → critic 리뷰 시도 → leader 좌표 검증.
8 todo → 15 task: 코드 4 (rag-v3 / list docs / bulk / init) + 각 finding 후
전체 도그푸딩 검증 task 4 + docs-only 3 + contract + HOTFIXES/release-notes + version bump.

plan critic round-1 은 환경 도구 손상으로 좌표 blocker(B-1/B-2/M-1/M-2)를 오진 →
leader 가 pipeline.rs/config/cli/bulk/Cargo.toml 을 직접 grep 검증해 plan 좌표 정확 확인,
executor 용 "anchor grep 재확인" + binary 경로 주의 헤더 추가.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 21:09:39 +00:00
2b4ba8e104 docs(spec): v0.20.2 dogfood findings 설계 + round-1 critic 반영
v0.20.1 전체 도그푸딩에서 발견된 8 todo (Ask 응답언어 rag-v3 / doc.lang
docs / bulk input / list title / fusion_score·score_kind / schema
index_version / Ollama hint) 를 단일 patch release 로 설계.

writer worker 초안 → opus critic round-1 리뷰 반영:
- B1: top-level score placeholder → 확정 (score_kind 가 의미 선언, search.rs:95-99)
- M1: 이미지 caption 언어 강제 out-of-scope 명시
- M2: config default 테스트(lib.rs:1316) 갱신 필요 명시
- M3: bulk input 전체 필드 (query/mode/k/trust_min/ingested_after/media/tag/lang)
- M4: rag-v3 의 eval_runs.config_snapshot_json cascade 영향

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 20:37:42 +00:00
b08941d6ab docs(handoff): mark brainstorming-needed items in v0.20.1 findings todo
각 todo 에 fix path classification 추가:
- 🧠 필수: design / user expectation 결정 필요 — brainstorming skill 우선
- 🔧 mechanical: spec drift 또는 명확한 fix — 별 brainstorming X
- 📝 mild discussion: spec drafter self-review 로 trade-off 결정 가능

Classification:
- 🧠 필수 (2): #1 Ask 영어→한국어 response policy, #4 doc.lang semantic
- 🔧 mechanical (4): #2 bulk schema, #5/#6 docs sync, #7 schema rename
- 📝 mild (3): #3 list title, #8 Ollama default

추가 brainstorming 후보 (직접 finding 외):
- BS-A: HTML corpus 지원 (1415 file skipped)
- BS-B: Tier 1/2/3 chunker UX visibility
- BS-C: kebab dogfood subcommand (자동화)
- BS-D: 영문 code chunk 의 tokenized_korean_text 효율
- BS-E: builtin_blacklist 명세 노출

권장 워크플로:
1. brainstorming 단계 먼저 (#1, #4 + BS-x 별 검토)
2. mechanical batch (#2, #5, #6, #7) — 한 PR
3. mild discussion batch (#3, #8)
4. dogfood retest → v0.20.2 patch release
2026-05-28 19:45:53 +00:00
6bf4e82e62 docs(handoff): v0.20.1 full dogfood findings todo for next session
머지 후 v0.20.1 의 full dogfood (사용자 실제 corpus 6293 file, 3.5
시간 ingest, §1~§11 시나리오) 발견된 findings 를 새 session 의 self-
contained todo handoff 로 정리.

P0 (bug / 의도와 다른 동작):
- #1 Ask 영어 query → 한국어 응답 (rag-v2 prompt template 강제)
- #2 bulk search input format 불명확 (wire schema 미명시)
- #3 list docs title 중복 (heading-based, doc_path 보조 필요)
- #4 doc.lang = und 53% (code file 의 lang detection 실패)

P1 (docs drift):
- #5 fusion_score 위치 (.retrieval.fusion_score)
- #6 score_kind="bm25" 의미 (lexical mode 의 fusion_score)
- #7 schema index_version vs lexical_index_version 혼동

P2 (setup):
- #8 Ollama endpoint default 가 localhost (사용자 환경 remote)

각 todo 별 severity, scenario, suspected location, action item 명시.
새 session 시작 명령 + branch 권장 + 도그푸딩 재실행 절차 + finding
cumulative table 포함.

Repo state: main HEAD=a0c7fa3, clean. v0.20.1 binary OK. /build/dogfood/
KB (3940 docs, 34896 chunks) preserved for regression test.
2026-05-28 19:40:10 +00:00
a0c7fa3d1a docs(claude): dogfood 보관소를 /build/dogfood/ 로 통합 + 분류 정책 명시
사용자 정정 따라 dogfood data layout 갱신:

1. 위치: /build/cache/dogfood/ → /build/dogfood/. /build/cache 는 의미상
   캐시 (regeneratable downloads/models) 이지 test data 아님.
   /build/dogfood 는 sudo 로 신설 + chown.

2. 분류 정책: kebab version / 생성 시점 / scenario name prefix 금지
   (v0.20.1-dogfood/, dogfood-v018/ 같은 디렉토리 신설 X). 모든 분류
   는 문서 의미 / 종류 / 형식 기준만. 자세한 layout 은
   /build/dogfood/README.md.

3. 단일 디렉토리 정책: source 문서 + KB state + logs 모두 /build/
   dogfood/ 안 하나로. 매 도그푸딩 run 마다 kb/ 만 reset, 별 디렉토리
   신설 X.

4. 금지 위치 명시: /tmp/kebab-*, /build/cache/dogfood*, /home/altair823/
   KnowledgeBase, XDG paths 신규 사용 금지.

Source dirs 정리 (이번 commit 외 별 작업으로 완료):
- /build/cache/dogfood{,-p10b,-v017,-v018,-v0.19.0} 모두 삭제 (move 후).
- /home/altair823/KnowledgeBase, kebab-dogfooding 도 /build/dogfood/ 로 이동.
- XDG paths 는 /build/dogfood/_archive/xdg-state/ 로 snapshot.

최종 corpus: 6293 files (markdown/code/html/manifest/resources), 554M.
2026-05-28 14:44:23 +00:00
ebc6bf45c4 Merge pull request 'feat(v0.20.1): 한국어 morphological tokenizer (V009) + N-gram supplement + eager backfill' (#191) from feat/korean-morphological-tokenizer into main
v0.20.1 — V009 한국어 morphological tokenizer + N-gram supplement (Bug #8 closure).

22 implementation commit + 2 docs polish (dogfood data consolidation + CLAUDE.md trigger policy).

Dogfood evidence: KnowledgeBase 1781 doc / 9050 chunk backfill 26.6초, '한국' query 0 → 10 hit.

PR-level review verdict: APPROVED WITH NOTES (4 minor docs polish, all addressed).
2026-05-28 14:17:13 +00:00
d8fdc815be docs(claude): dogfood trigger + 통합 data 보관소 정책
사용자 요청 — 사용자가 누적된 ad-hoc 도그푸딩 데이터를 /build/cache/
dogfood/ 한 곳에 collection 한 후, 도그푸딩의 필요 시점을 추론해
CLAUDE.md 에 정책 section 추가.

신규 section `## Dogfood trigger` (사이 Release 와 Naming):
- 도그푸딩이 필요한 시점 (6 trigger 분류: schema/migration, wire
  schema/CLI, search/RAG, performance, language/locale, file/asset).
- Release-level: bump commit 이전에 evidence 명시 필수.
- 도그푸딩 데이터 보관소: /build/cache/dogfood/ 의 디렉토리 구조 +
  README.md cross-link + /tmp/kebab-* 신규 사용 금지.
- 도그푸딩 결과 기록: HOTFIXES dated entry + release notes draft 의
  4-단락 풀어쓰기 + DOGFOOD.md scenario catalog cascade.

실 작업:
- /build/cache/tmp/v0.20.1-* 5 디렉토리, /tmp/dogfood-* 2 디렉토리,
  관련 log file 모두 /build/cache/dogfood/ 로 mv. config.toml 의
  hard-coded path 자동 sed-replace.
- /build/cache/dogfood/README.md 신규 — 디렉토리 구조 + 신규 시나리오
  시작 절차 + V007 시뮬레이션 패턴 + 정리 정책.

기대 효과: 도그푸딩 evidence 의 git-tracked HOTFIXES + draft release
notes 외에도 raw data 가 한 곳에서 자유롭게 재사용 가능. 새 release
의 도그푸딩이 이전 KB 위에서 incremental 확인 가능.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9 (도그푸딩 evidence cascade)
Plan: post-implementation infrastructure
2026-05-28 14:06:24 +00:00
9f2a56d091 docs(hotfixes): large-scale KnowledgeBase dogfood evidence (N-gram supplement)
사용자 실제 /home/altair823/KnowledgeBase/ (1781 markdown / 9050 chunk)
를 v0.20.1+N-gram supplement 포함 binary 로 backfill 재실행:

- Backfill duration: 26.6초 (9050 chunk, OnceLock 캐시 + 1000-row
  batch transaction). ~3 ms/chunk amortized.
- '한국' query: V007 의 0 hit → V009 + N-gram 의 10 hit (Bug #8
  functional closure 실측 검증).
- '한국어' query: 5 → 10 hit (morpheme + N-gram 동시 매칭).
- 영어 whole-token: 'token'/'pipeline'/'config' = 10 hit each
  (V009 회귀 측면 정상).

Snippet evidence: KB 의 testdata/coding-md-corpus/*/...md 의
"문서를 한국어로 다시 정리하기" 패턴이 ko-dic 분해 + N-gram window
로 '한국' query 매칭 demonstrate.

기타 한국어 (서울, 지하철, 대한민국 등) 0 hit 는 KB corpus 의
단어 자체 부재 — data limitation, V009 implementation limitation X.

Test data 위치:
- /home/altair823/KnowledgeBase/ (사용자 실제 KB, 1781 markdown)
- /build/cache/tmp/v0.20.1-dogfood/kb/ (ingested SQLite + LanceDB)
- /build/cache/tmp/v0.20.1-dogfood2/corpus/ (한국어 wiki fixture)
- /build/cache/tmp/v0.20.1-v007strict/corpus/no-space.md (whitespace-less)
- /build/cache/tmp/v0.20.1-ngram/corpus/extra.md (대한민국, 한국정부, 주민등록번호)

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9 + Appendix B
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (dogfood evidence final)
2026-05-28 14:02:02 +00:00
fe20be8195 feat(chunk): N-gram supplement (Option β) — sub-token emit for Korean compounds
#4 (사용자 요청): spec §6.2 의 Option β (sub-token 추가 emit) 를
v0.21.x P9 follow-up 에서 v0.20.1 implementation 으로 promote.
dogfood 의 ko-dic compound noun limitation (`대한민국`, `한국정부`,
`주민등록번호` 등 단일 token 정책) 해소.

Implementation (`crates/kebab-chunk/src/lib.rs::tokenize_korean_morphological`):
- 신규 helper `is_hangul()` — 한글 음절 (U+AC00..D7A3) + 자모
  (U+1100..11FF, U+3130..318F) 판정.
- lindera output 의 각 morpheme 에 대해, 한글만 + 길이 ≥ 3 인 경우
  sliding window 2-gram 추가 emit. `[한국정부, 한국, 국정, 정부]`
  형태로 token list expand.
- 영어 / 숫자 / 혼합 token 은 supplement X (false positive 회피).

Tests (`crates/kebab-chunk/tests/tokenize_korean.rs`):
- `tokenize_korean_morphological_emits_2gram_for_long_morpheme`: 5 probe
  fixture 중 supplement 발화 case 확인 (실측 `서울특별시` →
  `[서울, 특별시, 특별, 별시]`, `대한민국` → `[대한민국, 대한,
  한민, 민국]`).
- `tokenize_korean_morphological_no_2gram_for_english`: Rust optimization
  fixture 에서 영어 substring (`Rus`, `ust`, `imi`) emit 없음 보장.

Dogfood evidence (`tasks/HOTFIXES.md` 2026-05-28 entry 보강):
- '대한', '한민', '민국' query 모두 hit (대한민국 의 sliding window).
- '특별', '주민', '등록' 같은 sub-token query hit.
- 영어 'tokenizer' query 는 corpus 부재로 0 hit (supplement X).
- Trade-off: DB size +20-30% (Korean-heavy), false positive 작은 risk.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 (Option β promote)
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (post-implementation enhancement)
2026-05-28 13:48:05 +00:00
028d9ad4ea docs(release): v0.20.1 release notes draft + spec/plan dogfood cross-link
#1 (사용자 요청): release notes draft 작성 + spec/plan 의 dogfood
evidence cross-link 보강.

docs/release-notes/v0.20.1-draft.md (신규):
- 4 단락 본문 (한국어 2자 query 지원 + 영어 substring 회귀 + V007→V009
  자동 backfill + ingest 성능 영향).
- Migration cascade table (lexical_index_version, corpus_revision,
  wire schema shape preservation).
- API + dependency 변경 (lindera v3, lindera-ko-dic v3, retired
  short_query_hint helper, 새 facade APIs).
- Breaking changes 명시 (영어 substring 회귀, 첫 부팅 latency, DB/
  binary 크기 증가).
- Upgrade 절차 + Known limitation + 14 dogfood scenario reference.

spec Appendix B (segmentation evidence):
- "Empirical verification (2026-05-28 dogfood — post-merge update)"
  subsection 신규. prior-knowledge 가정 vs 실측 결과 table. Scenario
  1-4 모두 verified 표시. ko-dic 의 '서울특별시' → '[서울, 특별시]'
  분해 증거 명시.

plan Changelog:
- post-implementation entry: 22 commit on branch, S3 blockers, S7
  cascade, S11 sanity regression updates, opus PR review 4 finding
  fixes.
- dogfood evidence entry: 14 scenario verify pass, ko-dic 분해
  evidence, HOTFIXES + spec Appendix B cross-link.

Spec: …spec…md Appendix B
Plan: …plan…md (post-implementation + dogfood evidence Changelog)
Release notes: docs/release-notes/v0.20.1-draft.md
2026-05-28 13:34:33 +00:00
a3513c9110 docs(hotfixes): V009 dogfood verification evidence (2026-05-28)
V009 한국어 morphological tokenizer 의 dogfood 검증 결과를 HOTFIXES
2026-05-28 entry 에 보강. 14 scenario 의 hit count + ko-dic 의
compound noun 분해 evidence (서울특별시 → [서울, 특별시]) + Option α
acceptance 의 known limitation 명시.

Reference corpus: DOGFOOD.md §2.1bis 의 korea-overview.md +
korea-compound.md (10 KB 합계, 2 markdown). KB ingest + 14 query
검증 모두 expected.

사용자 KnowledgeBase 같은 영어/code 중심 KB 에서 한국어 lexical
0-hit 가 정상임을 reference fixture evidence 와 분리해 사용자
오인 방지.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S11 + dogfood evidence)
2026-05-28 13:24:29 +00:00
f2a76cfe94 docs(dogfood): V009 morphological tokenizer scenarios + fixture evidence
v0.20.1 dogfood verification 의 fixture + scenario 를 DOGFOOD.md /
SMOKE.md 에 반영. 사용자 KnowledgeBase 같은 영어/code 중심 KB 에서
한국어 0-hit 가 정상 (token 부재) 임을 명시하고, ko-dic 의 morpheme
분해 동작을 검증할 reference fixture (korea-overview.md +
korea-compound.md) 를 inline 으로 제공.

DOGFOOD.md §2.1 갱신:
- description: trigram → unicode61 + 형태소 column.
- scenarios: 한국어 2-char (한국, 서울) + compound noun (서울특별시) +
  영어 whole-token 회귀 + 1-char filter 등 7 case 로 확장.
- §2.1bis 신규: V009 dogfood evidence reference corpus + 검증 명령 +
  예상 snippet (lindera 분해 증거) + known limitation (ko-dic
  compound 단일 token 정책, Option α acceptance).

SMOKE.md 'V009 morphological 검색' 갱신:
- trigram 시절 hint advisory + 3자 키워드 권장 시나리오 제거.
- v0.20.1 의 2-char Korean / compound noun / 1-char filter / 영어
  whole-token 회귀 scenario 로 교체.

Reference fixture (실측 verify pass):
- korea-overview.md: '한국' / '서울' / '지하철' 모두 hit.
- korea-compound.md: '한국어' / '한국문화' / '서울특별시' compound hit.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (dogfood evidence)
2026-05-28 13:22:43 +00:00
8c56ef3010 docs(superpowers): v0.20.x C 한국어 morphological tokenizer spec + plan artifacts
본 commit 은 v0.20.x C task (Bug #8 — 한국어 2자 query 0-hit) 의
4-stage workflow artifact 5 파일을 archive:

- spec.md (668 line, status=accepted): Option A/B/C 비교 + lindera
  Path A (영어 substring 회귀 인정) 결정 + 12 section + 4 Appendix
  (B segmentation evidence, C cost evidence, D license evidence).
- spec-critic-r1.md: 3 critical + 6 major finding (NEEDS_REWRITE).
- spec-critic-r2.md: r1c rewrite 후 traceability matrix (ACCEPT).
- plan.md (750 line, status=accepted): 11 step + dependencies +
  cost optimization routing + 9 closure micro-patches 적용.
- plan-closure-r1.md: traceability matrix + 9 MP 의 origin.

이 artifact 들은 implementation 머지 후 frozen reference. 후속
deviation 은 tasks/HOTFIXES.md 가 source of truth.

Workflow stage:
1. spec drafter (omc team writer, opus)
2. spec critic R1 (omc team critic, opus) — NEEDS_REWRITE
3. spec rewriter r1c (omc team writer, opus) — 7 item fix
4. spec closure R2 (in-process verifier, sonnet) — ACCEPT
5. plan drafter (omc team planner, opus)
6. plan closure (in-process verifier, sonnet) — ACCEPT + 9 MP
7. subagent-driven-development implementation (11 step + 5 follow-up
   + 1 docs polish = 17 commit)
8. PR-level final code review (in-process code-reviewer, opus) —
   Approved with notes (4 minor docs finding, merge as-is)

Branch: feat/korean-morphological-tokenizer
Version: 0.20.1
2026-05-28 12:53:31 +00:00
5d9ea588ed docs(v0.20.1): polish PR-review findings (README/HOTFIXES/schema/SKILL)
opus PR-level final review (Approved with notes) 의 4 minor finding
mechanical 정정:

1. README.md — `kebab search` row 의 영어 substring 매칭 표현이
   V007 시절 그대로였음. V009 의 whole-token 회귀 (substring → V002
   동작) 를 정직히 명시 + vector/hybrid mode 권장 안내.
2. tasks/HOTFIXES.md — 2026-05-28 entry 의 file path 정정. lexical.rs
   는 lindera 호출자가 아니라 build_match_string 의 MIN_QUERY_CHARS
   3→2 갱신만; lindera helper 의 실제 owner 는 kebab-chunk/src/lib.rs.
   ingest.rs 는 본 PR scope 외, eager backfill hook 위치는 kebab-app/
   src/app.rs::App::open_with_config.
3. docs/wire-schema/v1/search_response.schema.json — `hint` field
   description 이 V007 trigram 3-char minimum 시절 advisory 시그니처
   그대로. v0.20.1 에서 helper retired + always-omit 사실 명시
   (forward-compat 차원에서 field 만 schema 에 보존).
4. integrations/claude-code/kebab/SKILL.md — `hint` field 설명의
   self-contradiction ("present only with trigram in edge cases" vs
   "Korean 2-char now supported") 해소. retired + reuse 가능 명시.

PR-level reviewer recommendation: "Merge as-is — block 사유 아님 (모든
finding minor)". 본 commit 은 reviewer 의 옵션 1 (별 docs hotfix
commit) 채택.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (PR-level finding follow-up)
2026-05-28 12:53:00 +00:00
53ec9b4dc5 test(chunk): regenerate AST + long-section snapshots for V009 chunk field
S3 의 Chunk struct 갱신 (kebab-core 의 tokenized_korean_text:
Option<String> field 추가) 가 모든 chunk snapshot JSON 의 serde
serialize 결과를 변경시킴. 10 snapshot fixture (9 AST chunker +
markdown long-section) 의 baseline 을 V009 형태로 regenerate.

각 snapshot 의 변경 = chunk JSON 마다 `"tokenized_korean_text":
null` field 추가 (대부분의 fixture 가 영어 코드라 lindera 의 None
fallback). 동작 변경 없음 — serde representation 의 cascade만.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up via S11 sanity)
2026-05-28 12:27:37 +00:00
21b52bc285 style: cargo fmt --all (S3+S4+S5+S7 follow-up)
V009 morphological tokenizer 작업 (S3 chunk + S4 backfill + S5
short_query_hint 제거 + S7 신규 tests) 의 형식 정리. 동작 변경 없음.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S11)
2026-05-28 12:06:01 +00:00
97fd895a10 chore(release): bump version 0.20.0 → 0.20.1
CLAUDE.md §Release / binary version bump 의 두 트리거 모두 hit:
- 사용자 도그푸딩 필요 (Bug #8 한국어 2자 query 해소 — '한국', '서울',
  '지하철' 검색 검증).
- frozen design contract 변경 (§5.5 chunks_fts 의 unicode61 + CASE
  expression triggers + tokenized_korean_text column).

V009 + lindera ko-dic 형태소 분석기 통합 외에도 v0.20.x 의 logging
round 2 enhancement (PR #190) 가 같은 v0.20.x 시리즈에 포함되어
v0.20.1 patch release 시점에 함께 cut.

Build verification: ./target/release/kebab --version → kebab 0.20.1.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §12.1
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S10)
2026-05-28 12:05:31 +00:00
d13eb87401 docs(v0.20.x): sync README + HANDOFF + ARCH + SKILL + HOTFIXES for V009
V009 한국어 morphological tokenizer 의 사용자 visible surface 변경 +
release notes scope 를 5 docs 에 cascade.

- README.md: kebab search 명령 row 에 한국어 2자 query 지원 명시.
- integrations/claude-code/kebab/SKILL.md: V007 3-char hint 제거 +
  V009 2자 한국어 query 지원 1줄.
- HANDOFF.md: C task status 완료 flip + v0.20.1 release notes scope
  에 본 변경 추가 + 머지 후 발견 summary 행.
- docs/ARCHITECTURE.md: embedding upgrade (e5-small → e5-large),
  lindera-ko-dic FTS5 한국어 지원, version notes 추가.
- tasks/HOTFIXES.md: 2026-05-28 entry — Bug #8 V009 해소, lindera-ko-dic
  실제 crate name (spec deviation), cargo-deny deferred, Path A
  영어 substring 회귀 명시.

Spec: tasks/p9/p9-9-v0.20.x-korean-morphological-tokenizer-spec.md §7.4
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-28 11:55:25 +00:00
26f3a7756c test(eval): regenerate runner_per_query_snapshot for V009 baseline
V009 FTS5 tokenizer (trigram → unicode61 + 형태소) 로 인한 BM25
distribution + hit ordering 변경의 의도된 cascade. eval runner 의
per-query snapshot (run-1.json) 을 V009 baseline 으로 regenerate.

Regenerate 절차: UPDATE_SNAPSHOTS=1 cargo test -p kebab-eval
--test runner runner_per_query_snapshot_matches_fixture -j 4.
회귀 0 — kebab-eval workspace 의 모든 test (runner / loader /
metrics_and_compare) pass.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S8)
2026-05-28 11:51:49 +00:00
881f949fcb test(search): regenerate lexical_snapshot_run_1 for V009 BM25 distribution
V009 FTS5 tokenizer 가 trigram → unicode61 + tokenized_korean_text
column 로 갱신되면서 BM25 IDF 계산이 변화 → fusion_score 의 부동
소수점 값이 미세하게 다름 (히트 순서·snippet·구조 동일). S1 머지
직후 update 누락되었던 snapshot 을 V009 baseline 으로 regenerate.

회귀 없음: `cargo test -p kebab-search --test lexical lexical_snapshot_run_1 -j 4`
exit 0.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1 follow-up via S7 reviewer)
2026-05-28 11:49:15 +00:00
c5de5f812b test(fts,app): V009 morphological tokenizer integration tests
신규 4 test 추가:

- crates/kebab-store-sqlite/tests/fts.rs:
  - fts_v009_korean_morphological_2char_query_hits: tokenized_korean_text
    column 이 채워진 chunk 의 '한국' 2-char query hit.
  - fts_v009_english_whole_token_only: V007 trigram substring 매칭
    회귀 (Path A) — 'token' query 가 'tokenizer' chunk 에서 0-hit.
- crates/kebab-app/tests/search_korean.rs:
  - korean_morphological_2char_query_lexical_mode: end-to-end
    한국어 wiki fixture ingest → '한국' / '서울' query hit.
  - korean_morphological_mixed_english_korean_query: 'Rust' English
    whole-token + '최적화' Korean morpheme hit.

crates/kebab-search/src/lexical.rs:
  - build_match_string() 의 MIN_TRIGRAM_CHARS(3) → MIN_QUERY_CHARS(2).
    V009 unicode61 은 최소 token 길이 제한 없어 2자 한국어 morpheme
    query 가 통과되어야 함. 1자 단독은 여전히 필터.
  - 관련 unit test 2개 V009 동작으로 갱신.

fixture text 는 lindera ko-dic 의 실제 segmentation 동작에 의존
(spec Appendix B prior-knowledge 예측). 실측 시 fixture 조정 가능.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9.1, §9.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S7)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:38:52 +00:00
f94e0c4a9b feat(app): bump lexical_index_version to V009 (fts5-v009-korean-morphological)
V009 의 FTS5 tokenizer 가 trigram → unicode61 + 한국어 형태소 분해
column 로 갱신됨. lexical_index_version 의 format 에
`fts5-v009-korean-morphological` suffix 추가하여 V007 baseline 과
구별. eval runner 의 config_snapshot 및 search cache 무효화에
자동 picks up.

기존 format: lex:{chunker_version}
신규 format: lex:{chunker_version}:fts5-v009-korean-morphological

Wire schema shape 변경 없음 (SearchHit.index_version 의 string
content 만 변화). lexical_index_version_is_returned_unchanged test
는 IndexVersion 의 임의 string 을 사용해 unchanged.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.1, §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S6)
2026-05-28 11:23:13 +00:00
923b959610 refactor(app): retire short_query_hint helper, keep wire field as None
V009 unicode61 + 형태소 tokenizer 환경에서 2-char 한국어 query 가
hit 가능해졌으므로 V007 시기의 "3자 이상 권장" hint 가 obsolete.
SearchResponse.hint field 는 wire schema 보존 위해 struct 에 유지 +
항상 None.

- kebab-app/src/app.rs: short_query_hint 함수 + doc-comment 삭제.
  2 호출 site 가 hint = None 으로 정리.
- kebab-app/src/lib.rs: re-export 에서 short_query_hint 제거.
- kebab-tui/{app.rs,search.rs,run.rs}: short_query_hint field + 4
  호출 cascade 제거.
- kebab-cli/tests/wire_search_response.rs:
  search_plain_emits_short_query_hint_to_stderr test 삭제.
  search_json_emits_hint_field_for_short_query →
  search_json_hint_absent_for_short_query_v009 으로 교체
  (hint 항상 None 검증).
- kebab-search/src/lexical.rs::build_match_string: V007 의 trigram
  multi-token OR-combine 분기는 V009 환경에서 redundant 하나 보존
  (future 확장성) — doc-comment 1 줄 추가.

Wire schema shape 변경 없음 (search_response.schema.json:33 의 hint
field 보존, struct 에 None 으로 항상 셋팅).

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §7.2, §7.3, §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S5)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:13:45 +00:00
b63af20b72 feat(app): first-boot eager backfill for tokenized_korean_text
V007 → V009 업그레이드 시 기존 chunks 의 tokenized_korean_text 가
NULL — 첫 App::open_with_config 호출 시 자동으로 lindera ko-dic
으로 분해 후 UPDATE. chunks_au trigger 가 chunks_fts 를 자동 재-index.
사용자 재-ingest 불필요.

- crates/kebab-store-sqlite/src/store.rs:
  backfill_tokenized_korean_text(progress_cb, tokenize) API. 1000 row 마다
  commit + progress 콜백. idempotent (IS NULL 필터로 partial
  completion 재실행 안전). tokenizer 를 파라미터로 받아 §8 dep 경계 유지.
- crates/kebab-app/src/app.rs::open_with_config: run_migrations 직후
  backfill 호출. 실패 시 warn log 만 (App open 은 성공 — vector/hybrid
  mode 계속 가능). 500 row 마다 info log progress.
- crates/kebab-store-sqlite/tests/fts.rs:
  backfill_tokenized_korean_text_populates_nullable_rows 단위 test
  (idempotency 포함).
- clippy pre-existing 오류 수정 (redundant_closure, map_unwrap_or,
  cast_lossless, uninlined_format_args — kebab-app/ingest_log.rs,
  pdf_ocr_apply.rs, app.rs, tests/ocr_inspect_smoke.rs).

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §8.1, §8.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:01:00 +00:00
e8f44a57e3 fix(test): update first_ingest_bumps_corpus_revision baseline for V009
V004 seeds corpus_revision=0, V009 migration bumps to 1 (spec §5.2 —
LRU cache invalidation). Test previously asserted fresh store = 0;
now reads post-migration baseline dynamically and verifies that the
ingest commit increments past it.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §5.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)
2026-05-28 10:47:09 +00:00
4b4a8cbb3a fix(test): retire V007 trigram-only fts tests, add V009 unicode61 sanity
V007 trigram tokenizer 의 substring 매칭을 검증하던 3 test 는 V009
unicode61 으로 의도된 회귀 (spec §3 Non-Goals Path A) 가 발생하므로
obsolete:

- fts_trigram_korean_3char_substring_hits: '발생한' → '발생한다' hit
  은 trigram 의 substring 매칭이라 V009 의 whole-token 매칭에서 fail.
- fts_trigram_korean_short_query_zero_hit_pinned: 2-char Korean
  query 의 0-hit 동작은 V009 의 형태소 column 으로 해소되므로 이 핀
  자체가 obsolete (S7 이 신규 2-char hit test 로 대체).
- fts_trigram_english_substring_hits: 'token' → 'tokenizer' hit 은
  V009 unicode61 의 whole-token only 에서 fail.

신규 추가:
- fts_v009_unicode61_space_separated_korean_token_hits: V009 unicode61
  의 whole-token 매칭 sanity (token '충돌은' hit, substring '발생한'
  0-hit). S7 이 추가할 morphological 검증 test 와 별개의 baseline.

S7 (plan §2 Step 7) 가 v009_korean_morphological_2char_query_hits +
v009_english_whole_token_only 를 추가하여 회귀 + 신규 동작 모두 핀할
예정.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §3, §9.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)
2026-05-28 10:36:32 +00:00
4dc1c10be1 fix(test): update corpus_revision baseline to post-V009 migration
V009 migration bumps corpus_revision by 1 at apply time (spec §5.2 —
invalidates pre-V009 LRU search cache). Existing tests assumed V004
seed (0) was the final baseline; updated to expect 1 after migration:

- fresh_store_starts_at_zero → fresh_store_starts_at_post_migration_baseline
- bump_increments_monotonically: expected 1,2,3 → 2,3,4 (post-baseline)
- revision_persists_across_reopen: 2 → 3 (V009 +1, +bump,+bump)

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §5.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)
2026-05-28 10:33:50 +00:00
bd86f61c9c fix(chunk): close S3 reviewer blockers — get_chunk read + AST chunker cascade
S3 spec compliance reviewer (sonnet) 가 2 blocker 발견:

1. crates/kebab-store-sqlite/src/documents.rs: get_chunk SELECT 가
   tokenized_korean_text column 을 미조회 → DB 의 값이 read 시 유실.
   SELECT column list + row → Chunk 변환 시 row.get 인덱스 추가.
   ChunkRow struct + chunk_row_from_sql + get_chunk Chunk 생성 cascade.

2. crates/kebab-chunk/src/code_*_ast_v1.rs (9 file): make_chunk 가
   tokenized_korean_text: None 하드코딩 → 한국어 주석을 가진 코드
   파일이 FTS hit 안 됨. tier2_shared 와 동일 패턴으로
   tokenize_korean_morphological(text) 호출 cascade.

이 commit 은 S3 의 rework — amend 아닌 별 commit (S3 boundary
유지). spec §6.2 invariant ("모든 chunker 가 chunk emit 직전에
tokenize 호출") 충족.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 rework)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:30:53 +00:00
b134ae9dd5 feat(chunk): integrate lindera korean morphological tokenizer
V009 의 tokenized_korean_text column 에 들어갈 morpheme sequence
를 lindera ko-dic 으로 분해. chunk builder pipeline 의 chunk 생성
직후 시점에서 호출 → chunk struct 의 field 에 pre-fill → store
의 put_chunks 가 단일 transaction 안에서 INSERT.

- crates/kebab-core/src/chunk.rs: Chunk struct 에
  tokenized_korean_text: Option<String> field 추가 (#[serde(default)]).
- crates/kebab-chunk/src/lib.rs: tokenize_korean_morphological()
  helper + OnceLock 캐싱 + fallback (None) 정책.
- crates/kebab-chunk/Cargo.toml: lindera features = ["embed-ko-dic"]
  추가 (DictionaryKind::KoDic 활성화에 필요).
- 모든 chunker (tier2_shared, md_heading_v1, pdf_page_v1, 9개
  code AST v1): Chunk 리터럴에 tokenized_korean_text pre-fill.
- crates/kebab-store-sqlite/src/documents.rs::put_chunks: INSERT
  SQL column list + placeholder + binding 갱신 (12번째 column).
- crates/kebab-chunk/tests/tokenize_korean.rs: 단위 테스트 2개.

lindera 3.0.7 API 정정: load_dictionary_from_kind →
load_embedded_dictionary, Token.text → Token.surface.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:22:15 +00:00
597d8b70ad feat(deps): add lindera + lindera-ko-dic for korean morphological tokenizer
Workspace dependency 만 추가 — 실제 사용은 S3 의 kebab-chunk
tokenize_korean_morphological() helper.

- Cargo.toml (workspace): lindera = "3", lindera-ko-dic = "3" 추가.
- crates/kebab-chunk/Cargo.toml: per-crate dep (lindera-ko-dic 에
  embed-ko-dic feature 로 KO-DIC 딕셔너리 embedded blob 활성화).
- crates/kebab-app/Cargo.toml: [features] 에 fts_korean_morphological
  (spec §6.3 Option A — marker role only, disable path 없음).

License: lindera = MIT, lindera-ko-dic = MIT
(cargo info 로 확인). cargo deny 도입은 P9 follow-up.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.1, §10.1
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:03:58 +00:00
b106120e93 feat(fts): add V009 korean morphological tokenizer migration
V007 trigram tokenizer 의 한국어 2자 query 0-hit 한계 (Bug #8) 해소를
위한 V009 migration 추가. unicode61 tokenizer 로 환원 + 한국어 형태소
분해 결과를 별 column `tokenized_korean_text` 에 pre-fill 하는 방식.

- migrations/V009__fts_korean_morphological.sql 신규: column ADD,
  chunks_fts DROP+재정의, 3 trigger CASE expression, backfill INSERT,
  corpus_revision bump.
- design §5.5 갱신: trigram → unicode61 + 형태소 column. CASE
  expression trigger 본문.
- crates/kebab-store-sqlite/tests/fts.rs: V007 verbatim test 를
  V009 source-of-truth 로 rename. v009_bumps_corpus_revision unit
  test 추가.
- store.rs: clippy bool_to_int_with_if + cast_lossless 기존 경고 수정
  (pdf_ocr_events 관련 코드, S1 작업 중 발견).

영어 substring 매칭은 V002 (whole-token only) 로 회귀 — spec §3
Non-Goals + 후속 release notes (v0.20.1) 에서 정직히 기술.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 09:48:46 +00:00
43366b1b15 docs(handoff): C 한국어 morphological tokenizer (Bug #8) 새 session handoff
v0.20.0 sub-item 1 + bugfix 1~4 + ingest log r1+r2 머지 후, 다음 우선순위
C (한국어 morphological tokenizer) 의 self-contained context.

새 session 의 첫 step + workflow patterns + 환경/memory references + cascade
risk + 가능한 fix paths (Option A jieba-rs / B bi-gram supplement / C
query-side expansion). spec/plan/executor cycle 동일 패턴.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:18:04 +00:00
70507e94ca docs(handoff): logging round 2 closed (PR #190 merged) + v0.20.1 release notes scope 갱신
PR #190 (2026-05-28 commit 7bbdc89a) merge 후:
- "Logging feature future enhancements" section →  closed (4 enhancement 모두 PR #190 에서 ship).
- G (v0.20.1 release) section 의 release notes scope 갱신 — V008 migration + pdf_ocr_events + CLI inspect + retention 추가.

다음 작업 priorities 그대로 C → B → A → G.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:15:07 +00:00
7bbdc89ae3 Merge pull request 'feat: ingest log round 2 — image_w/h + V008 SQLite mirror + CLI inspect + retention' (#190) from feat/ingest-log-round2-enhancements into main
Reviewed-on: #190
2026-05-28 08:12:25 +00:00
7c24734cc7 docs(superpowers): v0.20.x logging r2 spec + plan artifacts
logging round 2 (4 enhancement: image_w/h + V008 SQLite mirror + CLI inspect + retention) 의 spec/plan ACCEPT 후 round artifacts.

- spec: 751 line (ACCEPT, 7/7 critic round 1 finding + 7/7 closure r2 traceability)
- plan: 576 line (ACCEPT, 6/6 step + 13/13 AC + G1/G2/G3 plan-level resolve)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:04:32 +00:00
9a36a06f97 style: cargo fmt --all (v0.20.x logging r2 feature follow-up)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:34:01 +00:00
35c987df1c feat(app): log retention — keep_recent_runs + retention_days (Enhancement 4)
LoggingCfg gains two fields with serde defaults: keep_recent_runs
(default 100, top-N file retention) and retention_days (default 30,
time-based retention for both ndjson files and the SQLite mirror).

IngestLogWriter::open now runs cleanup_old_logs before creating a new
ingest-*.ndjson — delete iff (idx >= keep_recent) OR (modified <=
cutoff). ingest_with_config_opts also calls
SqliteStore::prune_pdf_ocr_events(retention_days) at ingest start so
the SQLite mirror tracks the same retention window.

Backward compat (AC-9): both new fields use #[serde(default = ...)],
so a pre-v0.20.x config with only [logging] ingest_log_enabled +
ingest_log_dir parses unchanged. kebab init writes the new defaults
automatically via Config::default() -> toml::to_string_pretty (AC-12).

docs/SMOKE.md config example synced.

Closure r1 F5: explicit OR-on-stale comment inside cleanup_old_logs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:17:47 +00:00
d9ec7b8dc3 feat(cli): kebab inspect ocr-stats + ocr-failures (Enhancement 3 + wire schema additive minor)
Two new wire schemas land as additive minor: ocr_stats.v1 (corpus-wide
aggregate — total_events, success_rate, p50/p90/p99/max_ms, by_engine,
top-10 by_doc by failure count) and ocr_failures.v1 (per-doc or
corpus-wide recent failures, with --doc-id + --limit). Both ship via
new CLI subcommands `kebab inspect ocr-stats` / `inspect ocr-failures`.

App gains four facade methods: inspect_ocr_stats /
inspect_ocr_failures plus their *_with_config companions — required by
CLAUDE.md "the facade rule" so `--config <path>` is honored. The CLI
dispatch arms thread cfg explicitly into the _with_config form.

Runtime introspection emit (WIRE_SCHEMAS in schema.rs) gains two
entries; the meta JSON Schema (schema.schema.json) is untouched
because its wire.schemas is pattern-based, not enum-based.

ingest_log::percentiles extended to (p50, p90, p99, max). p99 surfaces
only via inspect ocr-stats; IngestSummary (round 1) stays 3-percentile.

SKILL.md synced with the two new schemas (AC-13).

Closure r2 G2 (facade *_with_config pair) + G3 (runtime emit, not
meta schema file) + closure r1 F4 (p99) resolved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:13:08 +00:00
4e451c9f7c feat(app): dual-write PDF OCR events to SQLite + ndjson (Enhancement 2 wiring)
Pre-capture canonical.doc_id and Arc<SqliteStore> before the OCR
emit_progress closure so both the ndjson file and the SQLite mirror
carry the same doc_id for every event. File write is durable
(errors propagate); SQLite insert is non-critical (tracing::warn on
failure, ingest does not abort) per spec R-1.

LogEvent::Ocr gains a doc_id: Option<&str> field as an additive
Serde change — round 1 ndjson logs deserialize with doc_id=None.

Closure r1 F1: doc_id NULL in dual-write resolved via
let doc_id_for_log = canonical.doc_id.0.clone() pre-capture.
Closure r2 G1: Arc::clone(&app.sqlite) reused instead of opening a
second SqliteStore — eliminates double-open lock contention and
duplicate migration runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:06:03 +00:00
6482bf1321 feat(store): V008 pdf_ocr_events migration + record/prune API (Enhancement 2)
Add migrations/V008__pdf_ocr_events.sql with the events table + 3
indices (doc_id, run_id, ts). SqliteStore gains two pub fn:
record_pdf_ocr_event (insert one OCR sample) and prune_pdf_ocr_events
(delete rows older than retention_days; returns the affected row
count). Both follow the existing Mutex<Connection> lock pattern.

Wiring into ingest path lands in the next commit.

Closure r1 F2: explicit lock acquisition in both methods.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 05:56:54 +00:00
5977c8cdf1 feat(app): capture image_width/height in PDF OCR raster decode (Enhancement 1)
Add extract_image_dimensions(bytes) helper using image::ImageReader
and fill the 2 PdfOcrProgress::Finished emit points in pdf_ocr_apply.rs
where page_image_bytes is in scope (OCR error path + success path).
The no-DCTDecode skip path leaves None as page_image_bytes is absent.
Result: LogEvent::Ocr carries non-null image_width/image_height on
successful raster decode, enabling future size-conditioned timeout tuning.

Closure r1 F3: kebab-app/Cargo.toml image features += "jpeg" added as
direct [dependencies] entry (not relying on feature unification via
kebab-parse-image).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 05:54:55 +00:00
89d334a92b docs(handoff): v0.20.0 sub-item 1 머지 후 next priorities (C/B/A/G order)
PR #189 (2026-05-28 commit 09333d0) 머지 후 다음 작업 priorities 기록:

- C (다음 우선): 한국어 morphological tokenizer (Bug #8 V007 trigram 2-char 한계 follow-up)
- B: OCR dense page coverage (metro-korea.pdf page 8/13 timeout — max_pixels 동적 / column-level OCR / 점진적 timeout 축소)
- A: v0.20 의 deferred sub-items (sub-item 2 multi-region image / sub-item 3 PDF normalize integration / TODO #4 figure-table / TODO #5 Enricher trait)
- G: v0.20.1 patch release + release notes (180s timeout + [logging] section + wire schema additive + CLI fix)

Non-blocking known: Bug #12 falsified, ask phrasing-sensitive refusal.

Logging feature 후속 enhancements (low priority): image_width/height capture, SQLite mirror, CLI inspect query, log retention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:48:44 +00:00
09333d0b05 Merge pull request 'feat(pdf): scanned PDF OCR via qwen2.5vl:3b vision LLM (v0.20.0 sub-item 1)' (#189) from feat/pdf-scanned-ocr into main
Reviewed-on: #189
2026-05-28 04:37:41 +00:00
685007789a style: cargo fmt --all (round 4 ingest log feature follow-up)
Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이
test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all
적용. 모든 import alphabetical reorder + line wrapping 정합.

추가 untracked artifact 동시 commit:
- docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT)
- docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT)

workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:18:40 +00:00
445b096215 fix(test): clippy + fmt fixes for logging_roundtrip and ingest_log_smoke
* kebab-config/tests/logging_roundtrip.rs: r#"..."# → plain string
    (clippy::unnecessary_hashes).
  * kebab-app/tests/ingest_log_smoke.rs: |e| e.ok() → Result::ok,
    |s| s.as_u64() → Value::as_u64 (clippy::redundant_closure).
  * cargo fmt --all applied to pre-existing formatting drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:26:42 +00:00
415227bf76 test(app): ingest_log_smoke integration test (AC-9)
crates/kebab-app/tests/ingest_log_smoke.rs 신규:

  * ingest_log_smoke (AC-9): tempdir + 1 md + 1 scanned PDF →
    ingest → assert log file exists + 각 line valid JSON +
    각 kind ∈ {ocr,parse_error,skip,error,summary} + last
    line kind=summary + scanned>0.

  * ingest_log_disabled_emits_no_file (AC-6): enabled=false 일
    때 log_dir 안 ingest-*.ndjson 파일 0개 verify.

fixture: ../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf
재사용 (OCR disabled — Ollama 없이 smoke 실행).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:06:43 +00:00
f9dc0f749f feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc<Mutex> sync)
v0.20.x ingest log feature 의 ingest pipeline wiring. 5 emit hook:

  Hook 1: ingest_with_config_opts entry/exit (writer init + summary write + flush)
  Hook 2: apply_ocr_to_pdf_pages closure (PdfOcrProgress::Finished → LogEvent::Ocr)
  Hook 3: ingest_one_*_asset Err arm (LogEvent::Error)
  Hook 4: scan 직후 fs_skips.events enumerate (LogEvent::Skip)
  Hook 5: (Hook 3 통합) per-asset fatal error → LogEvent::Error

Hook 4 의 skip event carry 위해 kebab-source-fs 의 FsScanSkips 에
events: Vec<FsSkipEvent> field 추가 (kebab-source-fs 가 kebab-app
재호출 안 함 — cycle 회피).

Ownership: Option<Arc<Mutex<IngestLogWriter>>> binding 1 곳, 5 hook 이
clone+lock+write. ocr_ms_samples (Vec<u64> success-only) 는 Arc<Mutex>
로 share, summary stage 가 sort+p50/p90/max 계산. single-threaded
per-asset loop 라 deadlock/contention 위험 없음.

Writer 실패는 ingest 자체 fail 시키지 않음 (tracing::warn + 진행).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:05:07 +00:00
bef0c98867 feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields
v0.20.x ingest log feature 의 wire side. additive minor cascade:

  * PdfOcrProgress::Finished + IngestEvent::PdfOcrFinished 의 4 field:
      - image_byte_size: Option<u64>
      - image_width:     Option<u32>
      - image_height:    Option<u32>
      - failure_reason:  Option<String>
  * docs/wire-schema/v1/ingest_progress.schema.json — 4 추가 property
    (모두 optional, required 변경 없음 = additive minor)
  * integrations/claude-code/kebab/SKILL.md — wire schema description 동기

기존 ingest_progress.v1 consumer (CLI wire dump, integration test
fixture, kebab-cli wire_search/wire_ask) 는 4 추가 field 의
Option::None 으로 backward-compat. version bump 0 (additive minor =
binary-version cascade trigger 아님 per CLAUDE.md §Versioning cascade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:57:59 +00:00
f8a4c79727 feat(app): IngestLogWriter + LogEvent enum (per-ingest-run ndjson log)
v0.20.x ingest log surface 의 module side. crates/kebab-app/src/
ingest_log.rs 신규:
  * IngestLogWriter — open/write_event/write_summary/flush + Drop flush
  * LogEvent enum 4 variant (ocr / parse_error / skip / error)
  * IngestSummary struct (kind="summary" literal + 11 stat field)
  * generate_run_id (ISO 8601 prefix + uuid v7 마지막 8 hex)
  * expand_log_dir ({state_dir} placeholder 의 hand-roll expand)
  * now_ts (Rfc3339 UTC helper)
  * percentiles helper (sorted Vec p50/p90/max)

uuid v7 = workspace dep, rand 신규 의존 회피 (spec §6 R-5).

본 step 은 self-contained writer + 5 unit test. ingest pipeline 의
emit hook 5개 wiring 은 step 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:53:09 +00:00
f60304beb4 feat(config): add [logging] section (ingest_log_enabled + ingest_log_dir)
v0.20.x ingest log surface 의 config side. LoggingCfg struct 신설:
  * ingest_log_enabled (bool, default true)
  * ingest_log_dir (PathBuf, default "{state_dir}/logs")

#[serde(default)] tag 로 pre-v0.20 config 가 [logging] section 부재
시 LoggingCfg::default() 자동 init (AC-10 backward compat).

{state_dir} placeholder 의 실제 expand 는 step 2 (IngestLogWriter)
의 expand_log_dir helper 가 담당 (kebab-config 의 expand_path_with_base
는 {state_dir} 미지원, spec §6 R-3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:44:21 +00:00
6a9551e0fa fix(config): pdf.ocr.request_timeout_secs default 60 → 180 (Bug #11 follow-up)
Round 3 final dogfood (2026-05-28) 에서 60s default 가 dense Korean page
(metro-korea.pdf page 8/9/13) 의 OCR 을 강제 timeout — round 2 대비 1 page
더 indexed 손실. user perspective: cost vs coverage trade-off 가 60s 에선
coverage 쪽으로 너무 깎임.

Sweet spot 점진적 축소 정책 채택 — conservative starting point 180s 부터
dogfood evidence (OCR 평균 ms 분포) 기반 점진적 축소. 60s 같은 짧은 default
로 직접 jump 안 함.

- crates/kebab-config/src/lib.rs::default_pdf_ocr_request_timeout_secs() = 180
- unit test rename (_is_60s → _is_180s) + assertion 180
- crates/kebab-config/tests/pdf_ocr.rs assert_eq 180
- tasks/HOTFIXES.md 2026-05-28 follow-up entry 추가

User override path 보존 — config.toml [pdf.ocr] request_timeout_secs = N
로 user 가 직접 tune.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 01:40:23 +00:00
46e99470eb docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md
3-round dogfood-driven fix cycle 의 산출물:

- bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line
- bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line
- bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line
- docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus)

각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 01:21:34 +00:00
9b44e27dfe test(app): update schema_report assertion for streaming_ask=true (Bug #9 follow-up)
schema_report_reflects_freshly_ingested_kb 가 `!streaming_ask` 를 assert 했으나
Bug #9 fix (760eee8) 로 streaming_ask 가 true 로 정정됨. assertion 을 반전.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:58:10 +00:00
854a180365 fix(cli): add active_parsers + active_chunkers to Models test fixture in wire.rs (Bug #13)
Step 4 의 Models struct 확장 (active_parsers / active_chunkers 추가) 이
crates/kebab-cli/src/wire.rs 의 테스트 fixture 초기화를 누락 → E0063 컴파일 에러.
#[serde(default)] 는 serde 역직렬화 전용 — struct literal 초기화 에는 모든 field 필요.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:18:27 +00:00
5bba95fd71 docs(spec): HOTFIXES entry + parent spec cross-link for Bug #11 timeout deviation
Bug #11 (이전 commit `fix(config): pdf.ocr.request_timeout_secs default 600 → 60`)
의 frozen-spec deviation handoff.

- tasks/HOTFIXES.md: 2026-05-27 dated subsection — Discovered / Symptom / Root cause /
  Fix / Amends 5-field 포맷 (기존 entries 와 일치).
- docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md: PDF OCR config block
  line 1000 (default value) + OQ-1 line 1628 에 inline HTML 주석 2 줄 cross-link.
  prose 변경 0 — parent spec frozen contract 보존, HTML 주석은 markdown render 시 invisible.

HOTFIXES entry 가 live source of truth (CLAUDE.md "Spec contract" 규칙).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:16:18 +00:00
2c7fa7142a fix(cli): empty query emits error.v1 invalid_input for search + ask (Bug #14)
이전: `kebab search "" --json` / `kebab search "  " --json` / `kebab ask "" --json`
모두 exit=0 + silent 0 hit (search) 또는 LLM 빈 prompt round-trip (ask). user
mistake (typo, shell expansion 실수) 가 silent → debugging 비용.

이후: 양쪽 arm 에서 `query.trim().is_empty()` → kebab_app::StructuredError
(ErrorV1, code=invalid_input, hint 포함). exit=2 (StructuredError → 기존
exit_code() 의 generic non-zero path).

--bulk mode 는 영향 0 (bulk arm 이 query 무시).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:16:08 +00:00
d9c7aabce1 feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13)
이전: schema.v1.models 가 parser_version / chunker_version 단일 값만 보고 →
multi-medium corpus (md + pdf + code Rust/Python + dockerfile + k8s + manifest)
의 version cascade audit 누락 risk.

이후: additive minor — Models struct 에 active_parsers + active_chunkers Vec<String>
추가. backward compat: 기존 단일 field 보존 (markdown default), 신규 array 는
optional (#[serde(default)] + JSON schema required 미포함).

source:
- kebab_store_sqlite::fetch_distinct_parser_versions() 가
  documents.parser_version DISTINCT + ORDER BY 반환.
- fetch_distinct_chunker_versions() 가 chunks.chunker_version 동일 pattern.
- collect_models 가 매 schema 호출마다 재계산 (cache 없음 — R-3 자동 해결).

wire schema additive only — 메이저 bump 불필요. v0.20.1 minor 로 충분.
integrations/claude-code/kebab/SKILL.md 동기 갱신.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:15:58 +00:00
10b0e2f4f2 fix(config): pdf.ocr.request_timeout_secs default 600 → 60 per dogfood evidence (Bug #11)
metro-korea.pdf v0.20 final-dogfood (2026-05-27):
- page 8 + page 13 양쪽 모두 600s default 까지 완전 timeout
  (`ms: 600000, chars: 0, skipped: true`)
- 결과: 본문 indexed 안 됨 + page 당 20분 cost 낭비

cloud GPU Ollama 의 실측 per-page throughput 는 6-32s (parent spec 가정 105s 보다
훨씬 빠름). 60s 면 production-friendly upper-bound. dense/고해상도 page 는
config.toml override (`[pdf.ocr] request_timeout_secs = N`) 로 user 가 늘릴 수
있음 — Step 6 에서 HOTFIXES + parent spec cross-link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:15:21 +00:00
28f513795e fix(config): emit error.v1 code=config_not_found for missing --config path (Bug #10)
이전: `kebab search "rust" --config /tmp/nonexistent.toml --json` 가 exit=0 +
`{"hits":[]}` silent fallback to XDG default. typo / wrong path 가 0-hit 으로만
surface — debugging nightmare.

이후: kebab_config::ConfigNotFound thiserror::Error 추가, Config::load 의
`Some(p) if !p.exists()` arm 이 anyhow::Error::new(ConfigNotFound { path })
return. kebab_app::error_wire::classify 가 downcast → ErrorV1 code=config_not_found,
hint, details.path 채워서 stderr 에 ndjson 으로 emit.

R-1 (relative path): std::path::Path::exists() 는 cwd-relative — 별도 작업 없이
absolute + relative 모두 cover. integration test 두 개로 검증.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:14:54 +00:00
760eee89c8 fix(app): flip streaming_ask + single_file_ingest capabilities to actual surface (Bug #9)
capabilities_snapshot() 가 streaming_ask + single_file_ingest 를 hardcoded false 로
보고했으나 실제 구현은 v0.20 final-dogfood 에서 production-grade:
- kebab ask --stream → answer_event.v1 ndjson 191 event 정상 emit
- kebab ingest-file <path> / kebab ingest-stdin --title <T> → ingest_report.v1 정상

MCP host + Claude Code skill 등 agent 가 schema.capabilities 로 routing 결정 시
false negative → 사용자가 실제 동작 feature 를 사용 불가능하다고 오인.

http_daemon 은 false 유지 (별도 sub-item 의 non-impl).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:13:57 +00:00
f763049923 test(cli): assert 'code' in search --help output (Bug #7 regression pin)
Why: Step 2 의 doc-comment edit 가 향후 누군가 value list 를 재정렬
하거나 alias section 으로 분리할 때 silently 사라질 risk. clap 의
--help 렌더링 가 doc-comment 의 free-form text 라 grep-only smoke 가
유일한 검출 수단.

Change: 신규 test file (kebab-cli convention `cli_*` prefix 답습).
CARGO_BIN_EXE_kebab 으로 fresh binary 실행, stdout 의 `code` substring
assert. spec §4.4 의 acceptance row 1:1 mapping.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md §4.4
/ §5 (acceptance row 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:47:55 +00:00
8cf73d1f43 docs(cli): list 'code' in --media help string + SKILL.md (Bug #7)
Why: kebab search --media code 가 v0.18.0 부터 functional support 됨
(MEDIA_KINDS 외 path 로 first-class 처리, schema.v1.media_breakdown.code
존재). 그러나 SearchArgs 의 clap doc-comment + SKILL.md line 57 의
value list 가 stale — `code` 누락. user 가 --help 만 보고 code 미지원이라
오해 가능.

Change: 2 surface 동기 — main.rs line 158-160 의 multi-line clap
doc-comment + integrations/claude-code/kebab/SKILL.md line 57.
Rust binary surface / wire schema 변경 0.

Out of scope (follow-up): crates/kebab-mcp/tools/search.rs:44,
crates/kebab-core/src/search.rs:32+52, crates/kebab-app/src/
ingest_progress.rs:69, crates/kebab-cli/tests/wire_schema_breakdowns.rs:35
도 동일 stale list 보유. spec ACCEPT (round 1c) 의 grep boundary
밖이므로 본 round 미포함.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.3 / §4.3a.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:47:16 +00:00
a58ee10dfb fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6)
Why: metro-korea.pdf (Identity-H CID font without ToUnicode CMap) 의
ingest 가 pdf_ocr_pages=0 으로 잘못 종료. lopdf 0.32.0 의 emit
`?Identity-H Unimplemented?` marker 28 ASCII char 가 is_valid_text_char()
의 0x0020..=0x007E range 통과 → ratio=1.0 → OCR fallback 0.5
threshold bypass.

Change: MOJIBAKE_MARKERS const + compute_valid_char_ratio() 4-단계
(strip → trim-empty zero → dominance cap-0.3 → 기존 ratio). marker
list extensible. is_valid_text_char() 본체 변경 0.

Tests: +2 unit (dominance + minority) on top of 기존 8. parser_version
/ wire schema 변경 0.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.1 / §4.2 / §6 R-1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:42:59 +00:00
e674ff474b fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)
v0.20.0 sub-item 1 dogfood report 의 Bug #4 — F4 mojibake.pdf 의 lopdf
`get_pages()` count = 0 (Pages tree broken). root cause = 기존 byte-
level `re.sub` + manual startxref edit 가 lopdf strict load 통과시키지만
Pages dict 의 `/Kids` reference 깨짐.

- `tests/fixtures/_synth/mojibake.py`: full rewrite — replace byte-level
  `re.sub` + manual startxref with pikepdf open+inject-dummy-ToUnicode+
  del+save (auto xref regen). HYSMyeongJo-Medium CID font: CID font 이
  ToUnicode 를 자체 생성하지 않아 dummy stream 을 inject 후 strip
  (removed=1 invariant). Exit codes 2/3/4 for invariant fail.
- `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`: regenerate via
  pikepdf — 1 valid page, no /ToUnicode marker, byte-identical 후 reproducible.
- `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json`:
  regen via 2-run cargo test pattern (hand-rolled unwrap_or_else baseline
  bootstrap, no insta crate).
- `crates/kebab-parse-pdf/tests/text_extractor_regression.rs`: append 3
  invariant test — (1) lopdf 1-page, (2) /ToUnicode marker absent,
  (3) PdfTextExtractor 1-block invariant.
- `crates/kebab-parse-pdf/src/text_quality.rs`: f4_fixture_ratio_under_threshold
  threshold 0.3 → 0.5 (production valid_ratio_threshold 기본값). 구 broken
  fixture (pages=0) 는 extract_text="" → ratio=0.0; 신 fixed fixture 는
  CID 2-byte fallback decode → ratio≈0.375 — 여전히 OCR trigger 조건 충족.

spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 4)
prior: 241ded5 (Step 3 integration test)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:02:17 +00:00
241ded59df test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)
v0.20.0 sub-item 1 bugfix Step 3 (Group C) — integration-level regression
for Bug #3 (intra-doc chunk_id collision under aggressive overlap).

- `crates/kebab-app/tests/common/mod.rs`: `pub mod mock_ocr;` 1 line append.
- `crates/kebab-app/tests/common/mock_ocr.rs` (new): MockOcrEngine lift +
  `single` / `per_page` ctor (backward-compat single + per-page cursor).
- `crates/kebab-app/tests/pdf_ocr_apply.rs`: inline MockOcrEngine 제거 +
  `mod common; use common::mock_ocr::MockOcrEngine;` import. 10 ctor call
  site migration (`MockOcrEngine { .. }` → `MockOcrEngine::single(...)`).
- `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs`
  (new): F1 + F2 scanned PDF + Bug #3 trigger shape (10 char "가" + ". " +
  500 char "나") via mock OCR. assertion: chunk_id global uniqueness (HashSet
  dedup) across F1 + F2; F2 trigger text produces ≥2 chunks (collision shape).
- C1 decision: Option A (share via tests/common/mock_ocr.rs). Facade mock
  injection unavailable (OllamaVisionOcr hardcoded) — helper-level chain test
  (apply_ocr_to_pdf_pages → PdfPageV1Chunker) adds value beyond unit B5.

spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4.5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 3)
prior: 436fd01 (Step 2 Bug #3 chunk_id fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:45:38 +00:00
436fd015a2 fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3)
v0.20.0 sub-item 1 dogfood report 의 Bug #3 (Critical). scanned_page2.pdf
(1580 char OCR text) ingest 시 `chunks.chunk_id` PRIMARY KEY violation —
`per_chunk_hash = #c{char_start}` 가 post-overlap `actual_start` 사용 +
overlap walk floor 가 `prev_min` 으로 collapse → segment 1/2 동일 `#c0`.

- `crates/kebab-chunk/src/pdf_page_v1.rs`: `chunk_page` returns 4-tuple
  (segment_start, actual_start, chunk_end, slice); caller `per_chunk_hash`
  suffix uses `segment_start` (pre-overlap boundary, strictly increasing)
  instead of `char_start` (post-overlap, may collapse to prev_min).
- VERSION_LABEL `"pdf-page-v1"` → `"pdf-page-v1.1"` (design §9 cascade,
  explicit user-facing audit trail). `crates/kebab-app/tests/pdf_pipeline.rs:
  168, 368` 의 hardcoded literal 도 v1.1 로 갱신.
- module docs (`pdf_page_v1.rs:47-60`): workaround description 의
  `#c{char_start}` reference 를 `#c{segment_start}` 로 갱신 + segment_start
  invariant 명문 + HOTFIXES.md cross-ref.
- `pdf_page_v1.rs::tests`: `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
  regression pin (10 char "가" + ". " + 500 char "나" — multi-chunk +
  overlap walk collapse trigger).
- `tasks/HOTFIXES.md`: 2026-05-27 entry (symptom F2 1580 char OCR,
  intra-doc collision root cause, second-iteration patch rationale).

spec:  docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4)
plan:  docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 2)
prior: d9acda5 (Step 1 Bug #2 walker fix)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:32:09 +00:00
d9acda517a fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2)
v0.20.0 sub-item 1 dogfood report 의 Bug #2 — `[ingest.code].max_file_bytes`
가 walker 단계의 모든 file 에 일률 적용 → PDF/image/markdown 의 대부분 (256 KB+)
이 walker pre-extract skip. fix:

- `crates/kebab-source-fs/src/code_meta.rs`: `pub(crate) fn is_code_file(path)
  -> bool` helper 추가 (= `code_lang_for_path(path).is_some()`).
- `crates/kebab-source-fs/src/connector.rs:168-190`: walker size-cap check 가
  `is_code_file(&abs_path) && is_oversized(...)` short-circuit. PDF/image/
  markdown 는 walker bypass — parser 의 자체 size control (lopdf load_mem,
  image OCR max_pixels) 가 cover.
- `crates/kebab-source-fs/src/connector.rs` 기존 mod tests 안 추가:
  `size_cap_skips_only_code_files` — 300 KB PDF + MD + .rs 의 walker 결과
  검증. 기존 sibling test (huge.rs / longfile.rs, fixture 명 `.rs`) regression 0.

spec:  docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§3)
plan:  docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 1)
prior: b4d9e60 (PR #189)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:20:38 +00:00
b4d9e60816 chore(release): bump version 0.19.0 → 0.20.0 — v0.20.0 sub-item 1 scanned PDF OCR
# v0.20.0 — scanned PDF OCR via Ollama vision LLM

v0.20.0 의 핵심 변경 = embedded text 가 없는 scanned PDF (책 스캔, 영수증,
카메라 page) 의 OCR ingest. PoC 의 5 engine 비교 (Tesseract / EasyOCR /
PaddleOCR / gemma4:e4b / qwen2.5vl:3b) 에서 qwen2.5vl:3b 의 alnum 94.79%
(page1) / 81.56% (받침) 가 모든 다른 engine 을 능가 — 본 release 의 default
vision OCR.

## 1. OCR opt-in 사용법

`[pdf.ocr]` config 의 `enabled = true` 또는 `KEBAB_PDF_OCR_ENABLED=true` env
로 활성화. default off — OCR 한 page 당 45-100s (qwen2.5vl:3b on CPU,
remote Ollama) 의 cost 가 책 archive 외 비-OCR KB 에 부적합.

```toml
[pdf.ocr]
enabled = true
model = "qwen2.5vl:3b"
# 다른 default 는 README 참조
```

qwen2.5vl:3b 의 Ollama pull:

```bash
ollama pull qwen2.5vl:3b   # 3GB Ollama image
```

## 2. v0.19 indexed scanned PDF 의 force-reingest

v0.19 binary 로 scanned PDF 를 ingest 한 KB 는 자동으로 OCR path 진입 안
함 — parser_version "pdf-text-v1" 보존 (CLAUDE.md §Versioning cascade 의
trigger 회피 결정, H-4). 따라서 v0.20 binary upgrade + config
`pdf.ocr.enabled = true` 만 적용 시 try_skip_unchanged 의 Unchanged path 가
OCR 실행을 skip. 명시적 재처리:

```bash
kebab ingest --root /path/to/kb --force
```

## 3. DCTDecode-only v1 scope (FlateDecode / CCITTFax page 처리)

v0.20.0 의 PDF page image extract = lopdf 의 image XObject 의 /Filter ==
DCTDecode 만 cover (JPEG passthrough). 다른 encoding (FlateDecode raw
pixel, CCITTFaxDecode bilevel, JPXDecode JPEG2000) 은 warning event 발행 +
해당 page skip.

scanned PDF 의 일부 page 가 FlateDecode 또는 CCITTFax 로 encoded 시:

```bash
qpdf --object-streams=disable --recompress-flate input.pdf normalized.pdf
```

v1 의 의도 = single binary 원칙 (image crate 도입 0). v1.1+ 또는 별
sub-item 에서 multi-filter 지원 검토.

## 4. Family asymmetry (image OCR gemma4:e4b vs PDF OCR qwen2.5vl:3b)

image OCR (P6) 의 default 는 gemma4:e4b 그대로 (변경 0). PDF OCR (v0.20)
만 qwen2.5vl:3b. 사용자가 [image.ocr] model = "qwen2.5vl:3b" 으로 통일
가능 단 default 는 family asymmetric 보존.

## Dogfood + test 결과

- workspace test: 178 result lines, 0 failure.
- workspace clippy (-D warnings): exit 0.
- alnum e2e (real Ollama, manual invoke):
  - F1 (한국어 page1): 94.79% (≥ 0.85 threshold).
  - F2 (받침-intensive): 81.56% (≥ 0.70 threshold).
- integration smoke + vector PDF regression: pass.

## 변경된 surface

- new config: [pdf.ocr] (11 field) + 11 env override KEBAB_PDF_OCR_*.
- new wire: IngestEvent::PdfOcrStarted/Finished (additive minor).
- new wire: IngestItem.pdf_ocr_pages/ms_total (additive minor).
- new CLI line: "📷 OCR page N..." / "✓ OCR page N (chars chars, msms via ollama-vision)".
- new module: kebab-parse-pdf::{page_image, text_quality} + kebab-app::pdf_ocr_apply.
- dep: workspace lopdf = "0.32" 통합.
- fixture: 5 PDF (F1/F2/F4/F6/F7) under crates/kebab-parse-pdf/tests/fixtures/.

## 변경되지 않은 surface (invariant)

- Extractor::extract trait body byte-identical (PR #187).
- PdfTextExtractor body 변경 0 — post-extract enrichment pattern 으로 분리.
- parser_version "pdf-text-v1" 보존.
- chunker_version "pdf-page-v1" 보존.
- workspace.dependencies 의 production dep graph 변경 0 (-e normal baseline 보존).

## sub-item 의 11 commit history

9d7faab Step 1: foundation + cargo tree baselines
aeeff36 Step 2: lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7)
fb3952d Step 2 fix: F7 conversion engine record correction
c2cd3a7 Step 3: page_image + text_quality modules (10 test)
8d81bc1 Step 3 fix: clippy pedantic in page_image
9f003ef Step 4: pdf_ocr_apply helper (10 test, F7 split + cancel)
fd918a6 Step 5: [pdf.ocr] config section + PdfOcrOpts doc
4672cba Step 5 fix: clippy::bool_assert_comparison in pdf_ocr tests
b9ee09f Step 6: wire PDF OCR enrichment + cancel propagation
4c5ccd5 Step 7: wire schema additive — IngestEvent + IngestItem + skipped
c9e0594 Step 8: CLI printer activation + ingest_progress test + spec literal
4819768 Step 9: integration smoke + vector regression + alnum e2e
1d4e301 Step 9 follow-up: Cargo.lock for dev-dep additions
90726ab Step 10: docs sync (README + HANDOFF + ARCHITECTURE + SMOKE)

## § Acceptance §9 verifier evidence

K5 의 15 row scriptable verifier 모두 green (또는 manual real-Ollama row 의 결과 보고):
- Row #4 (vector PDF byte-identical): pass.
- Row #5 (Extractor::extract trait byte-identical): 0 line diff.
- Row #6 (wire schema additive): jq + diff exit 0.
- Row #7-#8 (clippy / workspace test): exit 0.
- Row #9-#10 (dep graph baseline -e normal): empty diff.
- Row #11 (docs sync): grep evidence.
- Row #12 (version bump): "0.20.0" + Cargo.lock cascade ≥ 22.
- Row #14 (PR #187 invariant): extract_for(&asset.media_type) ≥ 1.
- Row #15 (DCTDecode-only v1, F6/F7 skip): test green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 11:03:44 +00:00
90726ab283 docs(v0.20): sync README + HANDOFF + ARCHITECTURE + SMOKE for scanned PDF OCR (post-extract enrichment, qwen2.5vl:3b, DCTDecode-only v1)
Step 10 (Group J) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

J0 — release notes path decision: commit body (RELEASE_NOTES.md /
docs/RELEASE_NOTES_*.md 부재, v0.17.x/v0.18.0 patterns 의 commit body
release notes 형식 따름). Step 11 K1 commit body 안 inline.

J1 — README.md:
- Configuration section 의 toml table list 에 `[pdf.ocr]` 추가.
- 새 sub-section `### [pdf.ocr] — scanned PDF OCR (v0.20.0+)`: 11 field
  toml example + `KEBAB_PDF_OCR_*` 11 env override + force-reingest UX
  ("v0.19 indexed scanned PDF 가 v0.20 upgrade 후 자동 OCR 미적용,
  `kebab ingest --force` 필요").

J2 — HANDOFF.md:
- phase status P7 row 확장: 3/3 component + post-extract OCR enrichment
  (v0.20.0 sub-item 1, qwen2.5vl:3b vision LLM).
- "머지 후 발견된 결정" entry: v0.20 sub-item 1 의 design + scope
  (H-1 post-extract pattern + DCTDecode-only v1 + parser_version 보존 + H-4 UX).

J3 — docs/ARCHITECTURE.md:
- OCR row 분리: `OCR (image)` (gemma4:e4b 그대로) + `OCR (PDF, v0.20.0+)`
  (qwen2.5vl:3b, post-extract enrichment via kebab-app::pdf_ocr_apply,
  DCTDecode-only v1, family asymmetry — PoC alnum 94.79% vs gemma4 27%).
- PDF parser row 확장: page_image::extract_dctdecode_page_image (v0.20.0) +
  parser_version "pdf-text-v1" 보존 + provenance event 차별화.

J3 — docs/SMOKE.md:
- `[pdf.ocr]` 격리 config example (enabled=true, model=qwen2.5vl:3b).
- 새 dogfood section `### v0.20 force-reingest (scanned PDF OCR)`:
  v0.19 → v0.20 upgrade path 의 명시적 `kebab ingest --force` invoke.

J4 — release notes draft (Step 11 K1 commit body 의 source):
- result file 안 record (4 topic: opt-in + force-reingest + DCTDecode-only +
  family asymmetry + dogfood/test 결과).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§6.4)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 10 J0-J4)
prior: 1d4e301 (Step 9 + Cargo.lock follow-up)
contract: §9

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:34:24 +00:00
1d4e301e5e chore(deps): Cargo.lock for Step 9 dev-dep additions (strsim + kebab-parse-image)
Step 9 (commit 4819768) 에서 추가된 dev-dep (strsim 0.11 + kebab-parse-image
path) 의 Cargo.lock cascade. worker 가 명시적 commit 에 포함 안 함 — follow-up
commit 으로 lock 동기화.

dep graph baseline (-e normal) 영향 0 (dev-dep 만 추가).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:12:38 +00:00
48197687b7 test(pdf): integration smoke (w/ search + cancel) + vector regression + alnum e2e (#[ignore]) for v0.20 sub-item 1
Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규):
- ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama,
  ingest_with_config production path + IngestItem.pdf_ocr_pages verify.
- ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search
  의 OCR text indexed verify (§ Acceptance #2).
- ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set
  cancel=true + dummy endpoint, no panic/deadlock verify).

I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규):
- vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector
  PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant).
- baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create).
- normalize_provenance_timestamps inline helper (R-3 mitigation, workspace
  전체 부재 — 신규 12-line).

I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규):
- f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real
  Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation.
- alnum metric = strsim::levenshtein (dev-dep 추가).
- truth file copy from PoC scratch (page1.txt + page2-batchim.txt) →
  scanned_page1_truth.txt + scanned_page2_truth.txt.
- kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용).
  parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph
  baseline -e normal 보존).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5)
prior: c9e0594 (Step 8 CLI printer)
contract: §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 10:10:58 +00:00
c9e05941c5 feat(cli): activate per-page PDF OCR progress printer + test(app): ingest_progress emit verify + spec(pdf-ocr): align §4.6.1 literal with option_A (ms/chars)
Step 8 (Group H) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 reviewer concern fix (spec literal deviation).

H1 — kebab-cli/src/progress.rs printer activation:
- 구 no-op stub `IngestEvent::PdfOcr* { .. } => {}` (Step 6 placeholder)
  를 사람-친화 stderr line printer 로 활성화.
- spec §4.6.1 line 1085-1086 wording 그대로:
  - PdfOcrStarted → `  📷 OCR page {page}...`
  - PdfOcrFinished (skipped=false) → `  ✓ OCR page {page} ({chars} chars, {ms}ms via {ocr_engine})`
  - PdfOcrFinished (skipped=true)  → `  ⊘ OCR page {page} skipped (no DCTDecode or engine fail, {ms}ms)` (M-4 의 skipped field carry 활용)
- `!quiet` gate 정합 (AssetStarted/Finished pattern mirror).

H2 — crates/kebab-app/tests/ingest_progress.rs 의 새 test:
- pdf_ocr_progress_emits_started_finished_events (real Ollama 의존, `#[ignore]`).
- F1 fixture (scanned_page1.pdf) ingest 시 pdf_ocr_started + pdf_ocr_finished
  event 가 emit 됨을 verify. Started count == Finished count invariant.
- Manual invoke: `KEBAB_PDF_OCR_ENABLED=true cargo test -p kebab-app --test
  ingest_progress --ignored`.
- mock OcrEngine inject path 부재 (Step 6 의 eager build), Step 9 I5 의
  ocr_e2e pattern (real Ollama + `#[ignore]`) 와 동일.

Step 7 reviewer concern fix — spec §4.6.1 literal:
- line 1076-1077 의 `ocr_ms` / `ocr_chars` literal 을 wire schema 의 실제
  field name `ms` / `chars` (option_A, Rust serde 와 정합) 로 갱신.
- line 1087 의 printer wording 도 `{ocr_chars}` / `{ocr_ms}` → `{chars}` / `{ms}`.
- line 1556 의 rationale 참조 `pdf_ocr_finished.ocr_ms` → `.ms`.
- `skipped` field 도 명시 (Step 6 reviewer M-4 결과).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.6.1)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 8 H1+H2)
prior: 4c5ccd5 (Step 7 wire schema) — Step 7 reviewer concern 1 의 fix
contract: §9 (additive minor wire bump — Step 7 commit 에서 완료)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 09:18:49 +00:00
4c5ccd5447 feat(wire): additive minor — IngestEvent kind 의 pdf_ocr_* + ingest_report.items[] 의 pdf_ocr_pages/ms_total + skipped field carry (Step 6 M-4/M-2)
Step 7 (Group G) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 6 code reviewer Important M-4 (skipped field carry) + Minor M-2
(ordering invariant doc) fix.

G3 — JSON Schema sync (additive minor — schema_version 보존):

ingest_progress.schema.json:
- kind enum 2 추가: pdf_ocr_started + pdf_ocr_finished.
- 새 field: page (1-based PDF page), ocr_engine (engine_name), skipped (bool).
- 기존 ms / chars field 의 description 갱신 (pdf_ocr_finished carry 추가).

ingest_report.schema.json:
- items.items.properties 신규 정의 (이전 stub ["array", "null"] 만).
- pdf_ocr_pages + pdf_ocr_ms_total (nullable integer).
- 모든 기존 IngestItem field 도 명시화 (kind, doc_path, byte_len, ...).

Step 6 reviewer M-4 (Important) — skipped field carry:
- IngestEvent::PdfOcrFinished 에 skipped: bool 추가.
- ingest_one_pdf_asset 의 emit closure (lib.rs:~1864) 가 source
  PdfOcrProgress::Finished { skipped } 를 discard 않고 propagate.

Step 6 reviewer M-2 (Minor) — ordering invariant doc:
- crates/kebab-app/src/ingest_progress.rs 의 ordering text 갱신:
  ScanStarted < ScanCompleted < (AssetStarted [< (PdfOcrStarted <
  PdfOcrFinished)*] < AssetFinished)* < (Completed | Aborted).

.md doc (docs/wire-schema/v1/*.md) 부재 — plan §3 Step 7 G3 의 .md
deliverable retro N/A (해당 file 0).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 7 G3)
prior: b9ee09f (Step 6 wiring) + Step 6 reviewer M-4/M-2 권고
contract: §9 (additive minor wire bump — schema_version 보존)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 08:51:51 +00:00
b9ee09f176 feat(app): wire PDF OCR enrichment + cancel propagation into ingest_one_pdf_asset (H-5 eager init + post-extract hook + per-page cancel) + workspace lopdf dep (Step 4 M-4)
Step 6 (Group E) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 spillover (IngestEvent variant + IngestItem field for compile
boundary) + Step 4 reviewer Minor M-4 fix.

E1 — eager PDF OCR engine build at `ingest_with_config_opts` entry,
mirror of image OCR pattern (lib.rs:338-347). `pdf.ocr.enabled ||
always_on` 시 `OllamaVisionOcr::from_parts(endpoint, model, ...)` 호출
+ fail-fast `?`. App field 추가 0 (local var only, spec L-1 / Step 1
A1 cosmetic fix 정합).

E2 — `ingest_one_pdf_asset` signature extension: +3 param
(`pdf_ocr_engine: Option<&OllamaVisionOcr>`, `progress: Option<&
mpsc::Sender<IngestEvent>>`, `cancel: Option<&Arc<AtomicBool>>`).
`ingest_one_asset` dispatch wrapper + caller (dispatch loop) update.

E3 — post-extract enrichment block at `extract_for` 직후 (line 1779).
`pdf.ocr.enabled || always_on` 시 `apply_ocr_to_pdf_pages` 호출,
PdfOcrProgress → IngestEvent emit (PdfOcrStarted / PdfOcrFinished
with ocr_engine), summary 의 pages_ocrd/ms_total 을 IngestItem field
로 carry. PR #187 registry dispatch invariant 보존
(`extract_for(&asset.media_type, ...)` 그대로).

E4 — cancel handle propagation: ingest_with_config_cancellable →
IngestOpts.cancel → ingest_with_config_opts → ingest_one_asset →
ingest_one_pdf_asset (new `cancel` param) → PdfOcrOpts.cancel chain.
spec §4.8 line 1159 production wiring.

Step 7 spillover (compile boundary):
- `kebab_app::ingest_progress::IngestEvent`: PdfOcrStarted { page } +
  PdfOcrFinished { page, ms, chars, ocr_engine }. serde discriminant
  `pdf_ocr_started` / `pdf_ocr_finished` (Step 7 G3 wire schema 와 일치).
- `kebab_core::IngestItem`: pdf_ocr_pages: Option<u32> +
  pdf_ocr_ms_total: Option<u64> (warnings/error 사이). 11 non-PDF
  IngestItem construct site 가 `None` 채움.
- `kebab-cli/src/progress.rs` + `kebab-tui/src/ingest_progress.rs`:
  새 variant no-op handler (v1에서 per-page progress 미노출, future
  refinement 시 활성화 가능).
- `kebab-store-sqlite/tests/ingest_report_snapshot.rs` + snapshot
  `ingest_report.snapshot.json`: 2 IngestItem fixture 의 새 field 추가.
- Step 7 의 JSON Schema 갱신 + CLI printer activation + snapshot
  regenerate 는 별 commit (G3/H1/H2 deliverable).

M-4 (Step 4 reviewer Minor) — lopdf workspace dep 통합:
- workspace `Cargo.toml [workspace.dependencies] lopdf = "0.32"`.
- kebab-app + kebab-parse-pdf 의 direct dep → `{ workspace = true }`.

Verifier evidence:
- workspace test (`cargo test --workspace --no-fail-fast -j 1`):
  175 test result summary lines, 0 failures, 0 FAILED.
- workspace clippy (`-D warnings`): exit 0, 0 warning.
- dep graph baseline (`.omc/state/pdf-ocr-{parse-pdf,app-parse}-deps.baseline.txt`):
  empty diff for both.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.4 + §4.6 + §4.8)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 6 E1-E4 + Step 7 partial G1+G2)
prior: 4672cba (Step 5 fix) + fd918a6 (Step 5) + 9f003ef (Step 4 helper)
contract: §9 (additive minor wire bump — Step 7 JSON Schema 완료 시)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 08:18:34 +00:00
4672cba6c6 fix(config): satisfy clippy::bool_assert_comparison in pdf_ocr tests
fd918a6 의 F2 test file (crates/kebab-config/tests/pdf_ocr.rs) 의 4 line
`assert_eq!(bool_field, true|false)` 가 workspace clippy pedantic
의 `bool_assert_comparison` 위반 → CI gate
`cargo clippy --workspace --all-targets -- -D warnings` exit 1.

각 assertion 의 canonical form 적용:
- assert_eq!(x, false) → assert!(!x)
- assert_eq!(x, true)  → assert!(x)

semantic + behavior 동일, 4 line edit, logic 변경 0.

review trail:
- impl result: .omc/reviews/2026-05-27-pdf-ocr-step-05-impl-result.md
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-05-spec-review-result.md (I-1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 07:17:46 +00:00
fd918a60ce feat(config): add [pdf.ocr] section — qwen2.5vl:3b default, opt-in + env overrides + doc(app): PdfOcrOpts field doc (Step 4 I-1)
Step 5 (Group F) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 4 reviewer Important I-1 fix (PdfOcrOpts field doc) 동봉.

F1 — `kebab-config::PdfCfg` + `PdfOcrCfg` + 4 default fn:
- PdfCfg { ocr: PdfOcrCfg }.
- PdfOcrCfg with 11 field (enabled/always_on/engine/model/endpoint/
  languages/max_pixels/request_timeout_secs/valid_ratio_threshold/
  min_char_count/lang_hint).
- defaults: opt-in (enabled=false), qwen2.5vl:3b, 0.5 threshold, 20 char.
- mirror of image OCR cfg pattern (spec §4.5).

Config struct extension:
- `pdf: PdfCfg` field with `#[serde(default = "PdfCfg::defaults")]`.

11 env var override (parallel to KEBAB_IMAGE_OCR_*):
KEBAB_PDF_OCR_{ENABLED,ALWAYS_ON,ENGINE,MODEL,ENDPOINT,LANGUAGES,
MAX_PIXELS,REQUEST_TIMEOUT_SECS,VALID_RATIO_THRESHOLD,MIN_CHAR_COUNT,
LANG_HINT}.

F2 — `crates/kebab-config/tests/pdf_ocr.rs` (신규):
- toml roundtrip (11 field).
- defaults (opt-in + qwen2.5vl:3b).
- env override (4 key sample + default preservation).

F3 (Step 4 I-1) — `pdf_ocr_apply.rs` 4 public item 의 doc comment:
- PdfOcrOpts struct + 6 field.
- PdfOcrSummary struct + 2 field.
- apply_ocr_to_pdf_pages fn (Errors block 포함).
- PdfOcrProgress enum + 2 variant + 5 field.
body 변경 0, doc-only.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.5)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 5 F1+F2)
prior: 9f003ef (Step 4) — code reviewer Important I-1 resolution
contract: §9 (additive minor wire bump — Step 7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 07:07:18 +00:00
9f003ef1cd feat(app): add pdf_ocr_apply helper (10 test, F7 split + cancel) — post-extract OCR enrichment for PDF (H-1 resolution)
Step 4 (Group D) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

D1 — `apply_ocr_to_pdf_pages(&mut canonical, &dyn OcrEngine, &bytes, &opts, emit_progress)`
in `kebab-app::pdf_ocr_apply`. spec §4.1 line 381-599 body 그대로 +
PdfOcrOpts.cancel field + per-page cancel check (verifier LOW L-1).

post-extract enrichment pattern (H-1 resolution): kebab-parse-pdf 가
kebab-parse-image::OcrEngine 을 import 하지 않음 (parser isolation 보존).
helper 가 kebab-app 의 facade 안 — both parser crate 의 cross-import 회피.

Per-page decision matrix (spec §4.1 line 459-464):
- always_on=true → 모든 page OCR (dual-block, ordinal = page-1 + page_count).
- always_on=false + needs_ocr → in-place OCR (text-detect block mutate).
- needs_ocr=false → skip.

DCTDecode-only v1 (H-3): FlateDecode / CCITTFaxDecode page 는
extract_dctdecode_page_image=None → Warning event + skip + emit_progress(skipped=true).

OcrEngine.recognize 실패 → Warning event + skip + emit_progress(skipped=true).

D3 — per-page cancel handle (verifier LOW L-1 + spec §4.8 line 1159):
PdfOcrOpts.cancel: Option<Arc<AtomicBool>>. set→true 시
`anyhow::bail!("PDF OCR cancelled mid-PDF at page N")`.

lopdf = "0.32" added to [dependencies] (already transitive via kebab-parse-pdf;
no new crate introduced — dep graph kebab-parse-* baseline unchanged).

Integration test (`tests/pdf_ocr_apply.rs`, 10 test):
- f1_input_with_ocr_enabled_replaces_empty_block — in-place mutate.
- f3_input_with_ocr_enabled_keeps_text_detect_blocks — vector PDF skip.
- f1_input_with_ocr_disabled_keeps_empty_block — disabled no-op.
- f4_input_with_ocr_enabled_replaces_mojibake_block — mojibake → in-place mutate.
- f3_input_with_always_on_pushes_dual_blocks — always_on dual-block.
- f6_flatedecode_skipped_with_warning — FlateDecode skip + Warning event.
- f7_ccittfax_skipped_with_warning — CCITTFax skip + Warning event (verifier M-4 split).
- ocr_engine_failure_surfaces_as_warning — OCR failure → Warning event.
- dual_block_ordinals_are_deterministic_and_unique — ordinal invariant.
- cancel_handle_aborts_mid_pdf — cancel handle 의 production source (D3).

MockOcrEngine fixture: spec §5.5 line 1284-1299. F3 fixture 부재 →
mock CanonicalDocument construction + F1 bytes reuse pattern (Option B:
PdfTextExtractor::extract 를 통한 실제 production path canonical 생성).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 + §5.5)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 4 D1+D2+D3)
prior: c2cd3a7 (Step 3) + 8d81bc1 (Step 3 clippy fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 06:42:01 +00:00
8d81bc1071 style(parse-pdf): satisfy clippy pedantic in page_image (uninlined_format_args + map_unwrap_or)
c2cd3a7 의 `extract_dctdecode_page_image` 에 workspace clippy pedantic 위반 2 건
잔존 → CI gate (cargo clippy --workspace --all-targets -- -D warnings) fail.
두 lint 모두 1-line edit + semantic 동일, logic 변경 0.

- L20  uninlined_format_args: format!("page {} not in get_pages()", page_num)
                              → format!("page {page_num} not in get_pages()")
- L48-52 map_unwrap_or:        .map(|n| n == b"Image").unwrap_or(false)
                              → .is_some_and(|n| n == b"Image")

cargo clippy --workspace --all-targets -j 4 -- -D warnings → exit 0.
cargo test -p kebab-parse-pdf -j 4 → 21 passed (regression 0).

review trail:
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-03-spec-review-result.md (SPEC_COMPLIANT)
- code review: .omc/reviews/2026-05-27-pdf-ocr-step-03-code-review-result.md (Critical C-1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 06:14:00 +00:00
c2cd3a7ab7 feat(parse-pdf): add page_image (DCTDecode passthrough, 2 test) + text_quality (valid char ratio, 8 unit test) modules
Step 3 (Group C) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

C1 — `page_image::extract_dctdecode_page_image(pdf_doc, page_num)` ->
Result<Option<Vec<u8>>>. lopdf 의 Resources/XObject traverse, 첫 image
XObject 의 /Filter 검사 (single Name OR Array form 모두 cover, spec §4.1
line 642-664), DCTDecode + JPEG magic 검증 통과 시 raw bytes 반환. 다른
encoding 또는 image XObject 부재 시 Ok(None). v1 scope = DCTDecode
passthrough only (H-3 invariant, image crate 도입 0).

Integration test (`tests/page_image.rs`, 2 test):
- f1_fixture_yields_dctdecode_jpeg_bytes — F1 fixture happy path.
- flate_raw_fixture_yields_none — F6 fixture negative path.

C2 — `text_quality::compute_valid_char_ratio(s) -> f32`. valid char =
ASCII printable + Hangul (Jamo/Compatibility/Syllables) + CJK + Latin
Extended + common Korean punctuation. 빈 string → 0.0. caller
(`kebab-app::pdf_ocr_apply`) 가 threshold 와 비교 (default 0.5).

Unit test (`mod tests`, 7 + F4 conditional):
- empty / pure ASCII / pure Hangul / pure PUA / mixed half / CJK / Hangul Jamo.
- f4_fixture_ratio_under_threshold: active (case A — lopdf extract_text 가
  ToUnicode CMap 부재 시 빈 string 반환 → valid_ratio = 0.0000 < 0.3).

Also: Cargo.toml description 갱신 ("Text PDF extractor + scanned-page
image extract helpers ...", Step 1 A2 이연분).

fixture fix: mojibake.pdf 의 startxref 22130 → 22114 (16-byte offset 오차
수정 — lopdf strict parser 가 xref 를 찾지 못하는 버그 해결).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 line 600-722)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 3 C1+C2)
prior: aeeff36 (Step 2 fixtures) + fb3952d (Step 2 F7 record fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 05:59:10 +00:00
fb3952d54f docs(pdf-ocr): correct F7 conversion engine record in PoC doc (gs, not ImageMagick)
aeeff36 의 PoC doc append (engine-comparison.md L134, L141) 가 F7 (`ccitt.pdf`)
의 conversion engine 을 "ImageMagick `convert -compress Group4`" 로 기록했으나,
실제 tests/fixtures/_synth/flate_ccittfax.sh:77-83 은
`gs -sDEVICE=pdfwrite -dMonoImageFilter=/CCITTFaxEncode -dEncodeMonoImages=true`
flag 사용 (ImageMagick `convert` 호출 0회).

fixture binary (`/Filter [ /CCITTFaxDecode ]`, 2060 bytes) 는 invariant 충족 OK
(Step 2 spec compliance + code quality review verified). historical record 의
factual correction only.

review trail:
- impl result: .omc/reviews/2026-05-27-pdf-ocr-step-02-impl-result.md
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-02-spec-review-result.md
- code review: .omc/reviews/2026-05-27-pdf-ocr-step-02-code-review-result.md (I1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 05:36:56 +00:00
aeeff3635b poc+test(pdf-ocr): lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7) for v0.20 sub-item 1
Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures,
result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md).

Key findings:
- reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ];
  useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0.
- Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires
  manual PDF construction via zlib.compress.
- ghostscript pdfwrite rejects TIFF input (/undefined in II*) —
  ImageMagick `convert -compress Group4` used for F7 CCITTFax.

B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/:
- F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어).
- F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침).
- F4 mojibake.pdf       — DejaVu TTF + ToUnicode CMap stripped (count=0);
                          Noto CJK TTC has PostScript outlines unsupported by reportlab.
- F6 flate_raw.pdf      — /Filter /FlateDecode, DCTDecode absent (skip path input).
- F7 ccitt.pdf          — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input).

Synth scripts under tests/fixtures/_synth/:
- scanned_pdf.py    — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0).
- mojibake.py       — F4 reportlab DejaVu TTF + ToUnicode strip.
- flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 04:04:47 +00:00
9d7faab650 docs+chore(plan-bootstrap): apply spec L-1 cosmetic fix + capture cargo tree baselines for v0.20 sub-item 1 verifier gates
Step 1 (Group A) of v0.20.0 sub-item 1 (scanned PDF OCR) implementation plan.

A1 — spec §4.2 line 740 prose pseudo-code fix: `app.pdf_ocr_engine.as_ref()`
→ local `pdf_ocr_engine: Option<OllamaVisionOcr>` built in
`ingest_with_config_opts` (정합 with §4.4 eager init, App field 도입 0).

A2 — Cargo.toml dep invariant verified (image crate 미도입 — H-3 DCTDecode-only
v1 invariant 보존; kebab-parse-pdf + kebab-parse-image 가 kebab-app 의 기존
dep). description 갱신은 Step 3 (module 추가 후) 으로 이연.

A3 — cargo tree baseline 캡처 — K5 row #9/#10 의 ground-truth
(.omc/state/pdf-ocr-{app-parse,parse-pdf}-deps.baseline.txt). 본 sub-item
의 다른 step 의 dep graph 변경 0 invariant 의 verifier 의 baseline.
Note: .omc/ 는 .gitignore 대상 — baseline files 는 로컬 파일로 존재.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (round 1c ACCEPT)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 03:40:49 +00:00
bcd1e37dab chore(repo): .omc/ ignore + AGENTS·GEMINI symlinks + release notes 작성 가이드 강화
- .gitignore: .omc/ (OMC state directory) 추가 — .claude/, .superpowers/ 와 동급
- AGENTS.md / GEMINI.md: CLAUDE.md 로의 symlink — Codex / Gemini CLI 도 동일 지침 따르도록
- CLAUDE.md release 절차: release notes 가 commit subject 단순 나열 대신 사용자 친절한 설명 + 도그푸딩/테스트 결과 포함하도록 가이드 강화

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 23:58:04 +00:00
e7a4330798 Merge pull request 'docs: v0.20 image+pdf handoff + sub-item 3 spec/plan backfill' (#188) from docs/v0-20-image-pdf-handoff into main
Reviewed-on: #188
2026-05-26 23:36:49 +00:00
574e1b1ca1 docs: v0.20 image+pdf handoff + sub-item 3 spec/plan backfill
v0.19.0 release 후 다음 session 인계용 handoff 문서 + 사후 backfill.

- docs/superpowers/handoffs/2026-05-26-v0.20-image-pdf-normalize-handoff.md (540 lines, 9 section)
  - sub-item 1/2/3 머지 결과 + 도그푸딩 baseline (1781 doc / 9050 chunks) + user memory + OMC workflow + 빌드 환경
  - 현재 구현 상태 (v0.19.0, image+pdf) — 정확한 file:line + struct/fn signature + flow
  - 8 TODO 상세 (problem + scope + affected files + risk + trigger 조건)
  - 우선순위 + sequencing 권장 + 새 session 첫 단계 제안

- docs/superpowers/specs/2026-05-26-extractor-dispatch-unification-spec.md (sub-item 3 spec)
- docs/superpowers/plans/2026-05-26-extractor-dispatch-unification-plan.md (sub-item 3 plan)

PR #187 머지 시 source code 만 들어가고 spec/plan 누락 — 동일 PR 의 reference link 가 main 에서 404. 본 commit 으로 backfill.

Assisted-by: Claude Code
2026-05-26 23:34:17 +00:00
c1e82cca92 Merge pull request 'refactor(app): extract dispatch polymorphism — App.extract_for(...) + 11 Extractor registry' (#187) from refactor/extractor-dispatch-unification into main
Reviewed-on: #187
2026-05-26 21:07:20 +00:00
2c05dbd0dd refactor(app): extract dispatch polymorphism — App.extract_for(...) + 11 Extractor registry
kebab-app 의 hardcoded extract dispatch (`ImageExtractor` + `PdfTextExtractor` + 9 AST `*Extractor` 의 `::new().extract(…)` callsite 11곳 + 9 AST arm match) 를 `App::extract_for(&MediaType, &ExtractContext, &[u8])` 단일 polymorphic call 로 통합. trait 변경 0, parser source 변경 0, wire schema 변경 0 (success path).

핵심 변경:
- App struct 에 `pub(crate) extractors: Vec<Box<dyn Extractor + Send + Sync>>` field + `pub(crate) fn extract_for(...)` helper method.
- App::open_with_config 의 registry init = 11 Extractor (image + pdf + 9 AST).
- ImagePipeline struct 의 `extractor: &'a ImageExtractor` field 제거 + lib.rs:356 local + lib.rs:1235 alias 삭제 (atomic block).
- 9 AST arm (lib.rs:2012-2047 의 12 arm = 11 explicit + 1 wildcard) → 4 arm (9 AST grouped + 7 manifest + 1 shell + 1 other-bail).
- in-crate unit test (app.rs 의 `mod tests_extractor_dispatch`) 3 class: registry length 11 / mutually-exclusive supports() grid (16 sample MediaType) / extract_for error path (Audio).

scope = AST 9-arm + image + pdf extract callsite only. MarkdownExtractor / Tier 2/3 / outer 4-arm / inner 4 match / Chunker dispatch 모두 future-defer (별 PR — spec §11).

Wire schema (success path) 변경 0 — ingest_report.v1 / search_response.v1 / answer.v1 byte-identical (4-medium SMOKE 비교 검증). error.v1.message 의 internal context string wording 변경 (예: `kb-parse-image::ImageExtractor::extract` → `kb-app::extract_for (image)`) 은 spec §5.5 risk acceptance — `error.v1.code` + `error.v1.schema_version` 보존, user-visible surface 외. Cargo workspace.version bump 0.

Refs:
- docs/superpowers/specs/2026-05-26-extractor-dispatch-unification-spec.md (2 round APPROVE)
- docs/superpowers/plans/2026-05-26-extractor-dispatch-unification-plan.md (3 round APPROVE)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 17:43:44 +00:00
96766406aa Merge pull request 'refactor(parse-md): absorb kebab-normalize + kebab-parse-types — 24 → 22 crates + §3.7b 재작성' (#186) from refactor/normalize-absorption into main
Reviewed-on: #186
2026-05-26 15:37:21 +00:00
710945c4b0 refactor(parse-md): absorb kebab-normalize + kebab-parse-types — 24 → 22 crates + §3.7b 재작성
design §3.7b 의 thin layer (ParsedBlock 류) 가 4 parser 중 1개 (markdown) 만 lift 를
경유하는 현실 — fan-in/fan-out 모두 1 → layer 의미 잃음. kebab-normalize (1097 LOC)
+ kebab-parse-types (98 LOC) 둘을 kebab-parse-md 로 흡수.

설계: docs/superpowers/specs/2026-05-26-normalize-absorption-spec.md
플랜: docs/superpowers/plans/2026-05-26-normalize-absorption-plan.md
HOTFIXES: tasks/HOTFIXES.md 의 2026-05-26 entry (design deviation)

- 5 사용 type + 3 forward-declared struct → kebab-parse-md::types module 의 pub explicit re-export.
- build_canonical_document + derive_title + warning_agent → kebab-parse-md::normalize module.
- 4 hard-coded agent literal (lib.rs:122/128/134/153) + warning_agent body return + tracing target literal 모두 보존 — stage label 일관성.
- kebab-app callsite (lib.rs:51 use + :1119 context string) + Cargo.toml 의 2 dep (regular + dead) 제거.
- kebab-chunk + kebab-store-sqlite 의 [dev-dependencies] kebab-normalize → 제거 (kebab-parse-md 로 갈음). 통합 test source 의 use shift.
- test file 이동 (kebab-normalize/tests/normalize_snapshot.rs → kebab-parse-md/tests/).
- workspace Cargo.toml: Hunk (a) members 2 entry 삭제 + Hunk (b) version 0.18.0 → 0.19.0 (frozen contract 변경).
- design §3.7b 4-단락 재작성 (원래 intent 보존 + 현재 상태 + 보존된 surface + future re-extraction trigger).
- design §8 graph 갱신 (3 edge 제거 + 2 forbidden bullet 의미 갱신 + commentary).
- ARCHITECTURE.md crate graph + directory tree mechanical 갱신.
- tasks/INDEX.md L169 closure mention + "Future work / deferred" 섹션 신설 (image/pdf normalize integration entry).
- tasks/HOTFIXES.md 신규 entry (4-block — design deviation Symptom).
- HANDOFF.md cross-link 한 줄.
- 3 dead struct (ParsedImageRegion / ParsedPdfPage / ParsedAudioSegment) 는 보존 — v0.20+ image/pdf normalize integration 의 future surface (spec §11).

Wire / surface impact: 0건. CLI / TUI / MCP / --json 출력 / config / XDG path /
parser_version 모두 unchanged. wire-invisible provenance.events[].agent + tracing target
literal "kb-normalize" 도 보존 — old DB row 와 new DB row 의 audit log 일관성.

Verification: cargo test --workspace --no-fail-fast -j 1 → 1313 passed / 0 failed (172 result blocks).
cargo clippy --workspace --all-targets -j 1 -- -D warnings → 0 warning (5m 46s).
cargo metadata --no-deps --format-version 1 | jq '.workspace_members | length' = 22.
cargo tree -p kebab-app --depth 2 | grep -E "kebab_(parse_types|normalize)" = 0 줄.
2026-05-26 15:00:59 +00:00
d4395a306b Merge pull request 'refactor(source-fs): drop kebab-parse-code dep — 9 tree-sitter grammars drag 제거' (#185) from refactor/source-fs-dep-lightening into main
Reviewed-on: #185
2026-05-26 12:31:29 +00:00
bd48baa19a refactor(source-fs): drop kebab-parse-code dep — 9 tree-sitter grammars drag 제거
kebab-source-fs 가 kebab-parse-code 의 9 tree-sitter grammars 를 drag 했던 무거운 의존성 제거. 4 surface (code_lang_for_path / is_generated_file / is_oversized / BUILTIN_BLACKLIST) 만 사용하지만 dep 그래프에서 9 grammar 전체 link → kebab-source-fs::code_meta 로 이전 + kebab-parse-code 측 cleanup.

핵심 변경:
- kebab-source-fs::code_meta 신설: 4 surface 이전 (BUILTIN_BLACKLIST `pub` for frozen contract + 3 helper fn `pub(crate)`). lib.rs 의 `pub use code_meta::BUILTIN_BLACKLIST` 1 줄 추가 (Option A — 다른 mod surface 무근거 확장 0).
- callsite migration: media.rs (1) + walker.rs (2) + connector.rs (2) 모두 `kebab_source_fs::code_meta::*` 로 갱신.
- kebab-parse-code 측 cleanup: skip.rs 삭제 + lang.rs narrow edit (code_lang_for_path body + unit test 2 + Path import 삭제, module_path_for_* 보존) + lib.rs 헤더 doc rewrite (migration breadcrumb 포함).
- tests/{lang,skip}.rs 13 test 이동 — 12 unit (`src/code_meta.rs::tests`) + 1 integration (`tests/code_meta.rs` for BUILTIN_BLACKLIST frozen contract).
- design §8 graph: edge 제거 + p10-2 inline note. ARCHITECTURE.md 산문 1 줄 갱신. kebab-core::metadata.rs:36 stale dep reference 정정.

G1+G5: cargo tree -p kebab-source-fs | grep tree-sitter = 0 줄.
G2+G3: workspace test 회귀 0 + 13 test 1:1 이동.
G4: design §8 + ARCHITECTURE.md 갱신.

Wire 영향: 없음 (internal Rust crate-API surface 만, user-facing 0). Cargo workspace.version bump 불필요.

Refs:
- docs/superpowers/specs/2026-05-26-source-fs-dep-lightening-spec.md (v3, 4-round APPROVE)
- docs/superpowers/plans/2026-05-26-source-fs-dep-lightening-plan.md (v4, 4-round ACCEPT)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:19:32 +00:00
b02ac8200e Merge pull request 'fix(rag): S3 NLI unavailable — hypothesis char budget + token-count fallback retry' (#184) from fix/s3-nli-model-unavailable-diagnose into main
Reviewed-on: #184
2026-05-26 09:17:12 +00:00
336962715a fix(rag): S3 NLI unavailable — hypothesis char budget + token-count fallback retry
S3 dogfood query 의 `nli_model_unavailable` consistent fail root cause = mDeBERTa-v3 tokenizer 의 `OnlyFirst` strategy + 949-token hypothesis. 기존 char-budget 단독 fix 의 KR-extreme density 미해결 → token-count fallback retry + RC1-residual trait dispatch 정합.

핵심 변경:
- kebab-nli::NliVerifier: `hypothesis_token_count(&str) -> Result<usize>` trait method 추가 (default `Ok(0)` backward-compat). `OnnxNliVerifier` 가 *trait impl block* 안에서 real mDeBERTa tokenize override — vtable 등록 보장 (round-3 critic RC1-residual closure).
- kebab-rag::pipeline: `MAX_NLI_HYPOTHESIS_CHARS_INITIAL = 1200` + `MAX_NLI_HYPOTHESIS_CHARS_MIN = 150` const + `pub(crate) fn truncate_chars` pure-fn + `pub fn truncate_hypothesis_for_nli_with_budget` retry helper (char budget 반감 retry, min floor 시 graceful unavailable). step 8.5 hook 의 callsite explicit `match` + `return self.refuse_nli_model_unavailable` 패턴 (`?` 금지 — round-2 plan critic CRITICAL #1 closure).
- SpyNliVerifier 신규 helper (closure score_fn + hypothesis_token_count_fn, 2-arg constructor).
- §5.1 의 2 ignored test (EN-long err + vtable dispatch RC1-residual pin) + §5.2 의 4 boundary test (truncate_chars) + §5.3 의 3 mock multi-hop test (long_en_grounded / long_kr_retries / unrelenting_fallback). +7 new tests (2 ignored default skip).
- tasks/HOTFIXES.md 신규 dated entry `## 2026-05-26 — S3 NLI unavailable ...` — Symptom / Root cause / Action / Amends 4-block.
- spec + plan (`docs/superpowers/{specs,plans}/2026-05-26-s3-nli-model-unavailable-diagnose-*.md`) — 4 round spec + 3 round plan OMC reviewer ACCEPT 산출물.

검증:
- cargo test -p kebab-nli -j 1 → 11/11 pass + 7 ignored default skip.
- cargo test -p kebab-rag -j 1 → 19+3+3+... 전체 pass + 3 new mock + 4 new boundary.
- cargo test --workspace --no-fail-fast -j 1 → **1313 pass (+7 new)**, 0 failed. 회귀 0 (HOTFIX #15 이미 fixed, no remaining flaky).
- cargo clippy --workspace --all-targets -j 1 -- -D warnings clean (type_complexity allow on Arc<dyn Fn> type aliases).

KR safe (token-count retry path) + graceful fallback (min floor 시 기존 unavailable wire 유지, regression 0). Wire 영향 없음 (additive trait method). Cargo bump 불필요.

Refs:
- spec: docs/superpowers/specs/2026-05-26-s3-nli-model-unavailable-diagnose-spec.md (4 round APPROVE — analyst → critic + verifier × 4 rounds)
- plan: docs/superpowers/plans/2026-05-26-s3-nli-model-unavailable-diagnose-plan.md (3 round ACCEPT — planner → critic-plan + verifier-plan × 3 rounds)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 09:12:21 +00:00
1a224bf983 Merge pull request 'fix(mcp): HOTFIX #15 — MCP ask multi_hop dispatch-divergence assertion (fixture 보강)' (#183) from fix/hotfix-15-mcp-ask-multi-hop-flaky into main
Reviewed-on: #183
2026-05-26 07:02:26 +00:00
a210bf5d52 docs(rag): HOTFIX #15 spec + plan (3 round OMC reviewer approve)
OMC team `hotfix-15-mcp-flaky` 의 spec + plan 작성 + 리뷰 산출물.

- spec: analyst 가 진단 (root cause = PR-7 probe-first 가 PR-5 test 의 stale empty-KB contract 와 mismatch) + Option A 권장 (test-only fix). 3 round review (critic + verifier): CRITICAL C1 (fixture/query FTS5 0 hits) + MAJOR M1/M2 + 등 closure.
- plan: planner 가 7 steps + subagent dispatch task 작성. 3 round review (critic-plan + verifier-plan): empirical SQLite REPL 검증, level-1 dated entry placement, actual KebabHandler/KebabAppState pattern 정합.

implementation = 429287f commit (executor).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 06:52:04 +00:00
429287f6cb fix(mcp,tests): HOTFIX #15 — MCP ask multi_hop dispatch-divergence assertion (fixture 보강)
PR-7 (v0.18 dogfood probe-first) 머지 후 PR-5 의 test `ask_tool_routes_multi_hop_true_to_decompose_first` 가 stale empty-KB contract 로 deterministic fail. test-only fix — production code 0 touch.

- `minimal_config`: `score_gate = 0.0` (probe 의 second gate `top_score < score_gate` 우회, test config isolation).
- fixture `workspace_root/note.md`: "This note is about a compound containing X and Y in detail." — build_match_string 의 token_and branch (FTS5 implicit-AND) 가 `compound` + `about` + `and` 셋 다 매칭 필요. empirical SQLite REPL (V007 trigram DDL) 로 1 hit 확정.
- 기존 assertion 보존, single-pass branch 도 query "anything" 으로 fixture 미매칭 → NoChunks refusal 유지.
- 신규 `_multi_hop_short_circuits_when_probe_empty` test (REQUIRED — round-1 critic HIGH + verifier 격상): probe-empty short-circuit 의 MCP-layer wire shape pin (kebab-rag::multi_hop_empty_probe_pool_refuses_before_any_llm_call 은 RAG-layer 만 pin, MCP-layer 안전망 부재).
- module doc 갱신: 두 test 가 각각 pin 하는 contract enumerate. inline 주석 (line 94-101) 도 새 contract 정합.
- HOTFIXES.md 신규 dated entry \`## 2026-05-26 — HOTFIX #15 ...\` (date-top convention).

검증: cargo test --workspace -j 1 — 회귀 0 (known flaky 1 → 0). cargo clippy --workspace --all-targets -j 1 -- -D warnings clean.

Wire / behavior / version cascade: 0.

Refs: docs/superpowers/specs/2026-05-26-hotfix-15-mcp-ask-multi-hop-flaky-spec.md (review 3 rounds APPROVE)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 06:51:06 +00:00
08495eb425 Merge pull request 'chore(release): bump version 0.17.2 → 0.18.0 + cut fb-41 multi-hop' (#182) from chore/v0-18-0-cut into main
Reviewed-on: #182
2026-05-26 05:36:05 +00:00
98cf4e8a04 chore(release): bump version 0.17.2 → 0.18.0 + cut fb-41 multi-hop
v0.18.0 cut PR. fb-41 multi-hop RAG + NLI verification 의 user-visible surface (PR #176-180) + post-PR9 cleanup/refactor (PR #181) ship 마무리.

## 변경 사항

### Version
- workspace `Cargo.toml`: 0.17.2 → 0.18.0. Cargo.lock 자동 cascade (24 kebab-* crate 모두 0.18.0).

### Frozen design contract
- `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md`:
  - §3.8 RAG types — RefusalReason 에 NliVerificationFailed + NliModelUnavailable + MultiHopDecomposeFailed 추가 + Multi-hop RAG + NLI verification 의 ask_multi_hop facade + step 8.5 NLI hook + HopRecord / VerificationSummary 명시.
  - §9 versioning rules 표 — nli_model_version row 신규 (선택 — v0.19+ second adapter 시 wire surface candidate).

### Status transitions
- `docs/superpowers/specs/2026-05-25-p9-fb-41-finalize-spec.md`: status approved-by-team → completed.
- `docs/superpowers/plans/2026-05-25-p9-fb-41-finalize-plan.md`: status approved-by-team → completed (spec_status 도).

### User-facing docs
- `README.md`: 명령 표의 `kebab ask` row 에 `--multi-hop` flag + NLI 옵션 안내 한 단락 (mDeBERTa-v3 XNLI 280 MB 자동 다운로드 / RAM peak ~7-8 GB / threshold tuning 0.5 prod / 0.0 disable).
- `docs/SMOKE.md`: `[rag] nli_threshold = 0.0` config 예시 + 활성화 절차 + first-run download + RAM 권장 inline 안내.

### Handoff + dashboard
- `HANDOFF.md`: 한 줄 요약 의 현재 version 0.17.2 → 0.18.0. v0.18.0 cut entry 추가 (fb-41 multi-hop + NLI + cleanup ship). Component 카운트 단락에 fb-41 PR-9 의 kebab-nli + ask_multi_hop 추가 명시. 머지 후 결정 절 맨 위에 v0.18.0 fb-41 entry 신규.
- `tasks/INDEX.md`: p9-fb-41  머지 (v0.18.0). v0.18.0 subsection 신규 — PR #176-181 의 6 sub-PR + cleanup 각 한 줄 요약.

## 비범위 / 별 작업
- HOTFIXES.md 의 fb-41 entry 는 이미 PR #180 (PR-9d closure) 에서 작성 완료 — 본 cut PR 에서 추가 anchor 불필요.
- SKILL.md 의 v0.18+ NLI 안내는 이미 PR-9c-2 에서 inline 추가 완료.

## 검증
- `cargo check --workspace -j 1` 통과 (모든 24 crate v0.18.0 확인).
- frozen design 의 RefusalReason enum 확장이 kebab-core 의 production code 와 정합 (PR-9c-1 시점부터 동일 variants 있음).

Wire 영향: 없음 (additive minor 는 PR-9c-1 에서 이미 ship, 본 commit 은 documentation cascade only).
Behavior 영향: 없음.

머지 후 `gitea-release v0.18.0` 으로 tag + release notes 작성.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 05:18:08 +00:00
4030f04f37 Merge pull request 'chore: workspace-wide cleanup — clippy::pedantic baseline + auto-fix' (#181) from chore/workspace-wide-cleanup-pre-v0-18 into main
Reviewed-on: #181
2026-05-26 04:48:50 +00:00
7c27633df2 chore(rag): post-PR9 refactor — H1/H2/H3/D/E + test coverage + post-refactor dogfood retest
OMC team `post-pr9-refactor` 의 architectural cleanup. architect priorities 분석 후 executor + test-engineer 가 file edits, system-architect 가 component-level review 로 *pre-cut nothing — all v0.18.1+ defer* 결론.

## Executor 작업 (H1/H2/H3/D/E)

- **H1** (kebab-nli/src/onnx.rs): `[models.nli]` config wire 활성화. `DEFAULT_MODEL_ID` const 제거 (kebab-config 의 NliCfg::defaults 가 single source). OnnxNliVerifier::new 가 config.models.nli.model 읽고 config.models.nli.provider 가 "onnx" 아니면 anyhow::bail. 3 stale "PR-9c-1 will wire this" 코멘트 제거. 2 unit test 추가 (`new_uses_config_model_id`, `new_rejects_unsupported_provider`).
- **H2** (kebab-rag/src/pipeline.rs): `truncate_for_nli(premise: &str, _hypothesis: &str)` → `truncate_for_nli(premise: &str)`. v0.18.1 placeholder doc 제거. 4 callsite (tests/multi_hop.rs) 갱신 + test rename `multi_hop_truncate_for_nli_preserves_hypothesis` → `multi_hop_truncate_for_nli_char_budget` (contract 정합).
- **H3** (kebab-rag/src/pipeline.rs:1041): `was_truncated` 가 tracing::debug! 으로 surface (observability 추가, signature 보존 — caller logging contract).
- **D** (kebab-mcp/tests/tools_call_ask_multi_hop.rs): request_timeout_secs 2 → 5 (slow CI 안정성), `mh_code` discriminator 제거. dispatch contract = `mh.is_error.unwrap_or(false)` (기존 assertion 으로 충분).
- **E** (tasks/HOTFIXES.md + pipeline.rs:1633-1638): fb-41 PR-9 closure entry 의 sibling 으로 "### PR-9 NLI refusal: terminal Synthesize hop omitted from hops trace" subsection 추가. pipeline 의 "cleanup deferred to a follow-up" → "// See tasks/HOTFIXES.md ... for follow-up" cross-link.

## Test-engineer 작업 (T1/T2/T3/T4, 9 new tests)

- **T1** (kebab-nli/src/onnx.rs::tests): sanitize_model_id 3 unit (replaces_slash / idempotent / leaves_other_chars).
- **T2** (kebab-rag/tests/multi_hop_nli_panic.rs 신규): 2 panic-path tests — facade invariant (`expect("verifier must be Some when nli_threshold > 0.0")`) 의 #[should_panic] + threshold=0 의 companion.
- **T3** (kebab-rag/tests/multi_hop_nli_stream.rs 신규): 2 StreamEvent::Final tests — refuse_nli_verification + refuse_nli_model_unavailable 의 stream_sink Final 분기 wire shape pinning.
- **T4** (kebab-app/tests/open_with_config_nli.rs 신규): 2 NLI failure path — model_dir 가 unwritable 일 때 App::open_with_config 의 Result<App> Err (with "OnnxNliVerifier" in chain) + threshold=0 일 때 graceful skip.

## System-architect 결론

3 lenses (absorption / duplication / under-engineered interface) 분석 결과 — *pre-cut nothing*. Top-3 items 모두 v0.18.1+ defer:
- Lens 1: kebab-normalize + kebab-parse-types 흡수 가능 (parse-md 만 사용, 5 parsers 우회) → v0.18.1+.
- Lens 3: Extractor + Chunker trait 의 dead polymorphism (모든 callsite 가 hardcoded) → v0.18.1+.
- Lens 1 bundled: kebab-source-fs 가 kebab-parse-code 의 9 tree-sitter grammars drag → low-risk dep-graph win, v0.18.1+ bundled.
- Defer-with-intent: LanguageModel async refactor (cloud-LLM 시), NliVerifier::score_batch + typed NliError (2nd impl 시), compute_stale → kebab-core::stale.

보고서: /build/cache/tmp/post-pr9-refactor-priorities.md, /build/cache/tmp/system-architecture-priorities.md (둘 다 repo 외 — analysis 보존).

## 검증

- cargo test -p kebab-nli -j 1 → 11/11 pass.
- cargo test -p kebab-rag -j 1 → 75/75 pass (5 NLI multi-hop + 4 신규 T2/T3 포함).
- cargo test -p kebab-app -j 1 → 23 pass + 2 ignored (T4 의 2 포함).
- cargo test -p kebab-mcp --test tools_call_ask_multi_hop -j 1 → 1 pass + 1 pre-existing flaky (HOTFIX #15, no_chunks short-circuit, executor D fix 와 무관 — line 86 의 base assertion 이 fixture 없어서 fail).
- cargo clippy --workspace --all-targets -j 1 -- -D warnings clean.
- cargo test --workspace --no-fail-fast -j 1 → 1304 passed (+11 new) + 1 pre-existing flaky 동일.
- **Post-refactor dogfood retest byte-identical** (PR-9d / post-cleanup / post-refactor 3번 모두): S7 0.0035389824770390987, S1 0.058334656059741974, S10 0.0027875436935573816, S3 nli_model_unavailable.

docs/dogfood/v0.18.0/SUMMARY.md 에 "Post-architectural-refactor retest" section 추가.

Wire 영향: 없음.
Behavior 영향: 없음 (H1 의 config wiring 가 default 와 같은 model → byte-identical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 04:42:37 +00:00
3712d005cc chore(rag): post-cleanup dogfood retest — byte-identical 회귀 0
workspace-wide cleanup commit 직후 동일 dogfood S7/S1/S10/S3 재실행. NLI score byte-identical 확인:

- S7: nli_verification_failed, 0.0035389824770390987 (PR-9d 동일)
- S1: nli_verification_failed, 0.058334656059741974 (PR-9d 동일)
- S10: nli_verification_failed, 0.0027875436935573816 (PR-9d 동일)
- S3: nli_model_unavailable (PR-9d 동일, cleanup 무관 — v0.18.1 follow-up)

cleanup = mechanical refactor only. behavior 회귀 0. cut PR v0.18.0 진행 가능.

docs/dogfood/v0.18.0/SUMMARY.md 에 "Post-cleanup retest" section 추가.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 03:19:15 +00:00
7c85de065a chore: workspace-wide cleanup — clippy::pedantic baseline + auto-fix
cut PR v0.18.0 전 마지막 정리. 사용자 요청: "전체 코드베이스를 깔끔하고 알아보기 쉽게".

## Workspace lints

- `Cargo.toml` 의 `[workspace.lints.clippy]` 에 `pedantic = "warn"` (priority -1) + 의도적 allow-list 추가:
  - cast_possible_truncation / cast_possible_wrap / cast_sign_loss / cast_precision_loss — ONNX i64 / hash modular reduction 등 의도적 truncation.
  - doc_markdown / missing_errors_doc / missing_panics_doc — cosmetic doc style.
  - too_many_lines / module_name_repetitions / must_use_candidate / needless_pass_by_value / manual_let_else / items_after_statements / similar_names — informational only.
  - format_collect / match_wildcard_for_single_variants / trivially_copy_pass_by_ref / unnecessary_wraps — intentional patterns (exhaustive match, future Result variants 등).
  - default_trait_access — `Foo::default()` 가 idiomatic.
  - float_cmp — NLI / RRF score 의 explicit threshold 비교 의도.
  - struct_excessive_bools / case_sensitive_file_extension_comparisons / naive_bytecount / ignore_without_reason — domain-specific 의도.
  - format_push_string / return_self_not_must_use / match_same_arms — builder / wire-label / hot-path 패턴 보존.
  - needless_continue / used_underscore_binding / nonminimal_bool / unreadable_literal / many_single_char_names / doc_link_with_quotes / assigning_clones / collapsible_str_replace / trivial_regex / elidable_lifetime_names / range_plus_one / explicit_iter_loop / implicit_hasher / ref_option — remaining low-value style.
- 각 24 crate `Cargo.toml` 에 `[lints] workspace = true` 추가.

## Auto-fix

`cargo clippy --workspace --all-targets --fix` 적용 — 128 files changed, 552 insertions / 472 deletions. 주로:
- uninlined_format_args (~18): `format!("{}", x)` → `format!("{x}")`.
- redundant_closure_for_method_calls (~33): `.map(|x| x.foo())` → `.map(T::foo)`.
- 그 외 mechanical refactor.

## 검증

- `cargo clippy --workspace --all-targets -j 1 -- -D warnings` clean (pedantic + 모든 lint group).
- `cargo test --workspace --no-fail-fast -j 1` — **1293 tests pass + 1 pre-existing flaky fail** (`kebab-mcp::tools_call_ask_multi_hop::ask_tool_routes_multi_hop_true_to_decompose_first`, HOTFIX candidate, cleanup 무관). 회귀 0.

Wire 영향: 없음.
Behavior 영향: 없음 (mechanical refactor only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 03:01:58 +00:00
a0ccc7b021 Merge pull request 'feat(rag): fb-41 PR-9d — dogfood retest + HOTFIXES closure + corpus 보존' (#180) from feat/fb-41-pr-9d-dogfood-retest into main
Reviewed-on: #180
2026-05-26 02:01:56 +00:00
a8fd6994d2 chore(rag): PR-9d SUMMARY 의 latency 표 정정
PR-8 baseline 의 S1/S10 latency 추정값 (~150s, ~80s) 이 부정확. `results/s1-multihop.json` + `results/s10-multihop.json` 가 실제로 614s / 589s (`jq '.usage.latency_ms'` 측정) — *PR-8 시점 baseline 이 아닌 더 이전 timeline*. S7 만 `results/post-pr8/` 에 retest 보존되어 비교 의미 있음 (158s baseline → PR-9 241s with NLI first-run download).

SUMMARY.md 의 latency 표를 정정 — S1/S10 의 *동일 시점 baseline 부재* 명시 + S7 의 단일 비교만 의미 있음 caveat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 01:47:38 +00:00
505b3889fb feat(rag): fb-41 PR-9d — dogfood retest + HOTFIXES PR-9 closure + docs/dogfood/v0.18.0/ 보존
PR-9 의 진짜 작동 확인 — PR-1~PR-9c-2 머지 후 `/build/cache/dogfood-v018/` corpus 의 S7/S1/S3/S10 multi-hop retest.

핵심 결과: **S7 hallucination root cause 해결 확정**.
- PR-8 baseline: `grounded=true, refusal_reason=null`, **답변=Adam gradient 공식** (caffeine 질문에 무관 hallucination, silent).
- PR-9 retest: `refusal_reason=nli_verification_failed, nli_score=0.0035` (graceful refuse, NLI 가 entailment 0.35% 검출).

전체 비교 (4 case):
- S7  hallucination FIXED.
- S1  둘 다 reject, NLI 가 더 deterministic (0.058).
- S3 ⚠ consistent fail (`nli_model_unavailable`, 313s) — *v0.18.1 follow-up* (kebab-nli 의 특정 input 의존 fail, debug log emit 안 됨 → 진단 어려움).
- S10  둘 다 reject, NLI 가 더 deterministic (0.0028).

- docs/dogfood/v0.18.0/SUMMARY.md (sanitized 보고서) + s{1,3,7,10}-multihop-post-pr9.json (sample wire output, repo 보존).
- tasks/HOTFIXES.md 의 fb-41 PR-9 entry: "예정" → "완료 (2026-05-26)" + 비교 표 inline + S3 follow-up subsection (v0.18.1 candidate).

RAM: 5-6 GB → 7-8 GB (ONNX session ~600 MB), 16 GB 안전.
Disk: NLI model cache 1.1 GB (XDG default 또는 storage.model_dir).
Wire 영향: 없음 (PR-9c-1 의 schema 변경만 + 측정값 sample 보존).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 01:44:57 +00:00
772575d8f0 Merge pull request 'feat(rag): fb-41 PR-9c-2 — pipeline integration + mock test + SKILL.md (★ NLI 실 활성화)' (#179) from feat/fb-41-pr-9c-2-pipeline-integration into main
Reviewed-on: #179
2026-05-26 01:03:18 +00:00
00ffe9c792 feat(rag): fb-41 PR-9c-2 — pipeline integration + mock test + SKILL.md (★ NLI 실 활성화)
PR-9c-1 의 wire surface 위에 behavior 활성화 — `ask_multi_hop` 의 step 8.5 hook 가 `[rag] nli_threshold > 0` 일 때 NLI 검증 실 수행. **첫 user-visible behavior change** in PR-9.

- crates/kebab-rag/src/pipeline.rs:
  - ask_multi_hop step 8.5 NLI hook (empty answer 가드 + truncate_for_nli + verifier.score + verification field + refusal 분기).
  - refuse_nli_verification helper (verification: Some(...) + RefusalReason::NliVerificationFailed).
  - refuse_nli_model_unavailable helper (verification: None + RefusalReason::NliModelUnavailable).
  - truncate_for_nli helper (module-level pub fn, MAX_NLI_PREMISE_CHARS = 4 * 400 = 1600 chars 의 chars-based budget, _hypothesis 미사용 placeholder — v0.18.1 token-budget 갱신 candidate).
  - PR-9c-1 의 #[allow(dead_code)] 두 곳 제거 (verifier field + with_verifier builder; doc 의 transitional sentence 도 정리). round-1 PR-9c-1 review N1 carry-forward closure.
- crates/kebab-app/src/app.rs:
  - App::open_with_config 의 NliVerifier construction — config.rag.nli_threshold > 0 → OnnxNliVerifier::new + Arc::new wrap + 후속 RagPipeline 초기화 시 with_verifier 호출. 실패 시 ? 전파 (시그니처 Result<Self> 그대로 — caller cascading 0).
  - kebab-app/Cargo.toml 에 kebab-nli path 의존 추가.
- crates/kebab-rag/tests/multi_hop.rs + tests/common/mod.rs:
  - MockNliVerifier (pass / fail / err 생성자 + score call_count instrumented).
  - multi_hop_nli_pass_keeps_grounded — entailment 0.9 → grounded=true, verification.nli_passed=true.
  - multi_hop_nli_fail_refuses — entailment 0.1 → refusal=NliVerificationFailed.
  - multi_hop_nli_disabled_skip_verify — threshold 0.0 → verify skip, verification=None.
  - multi_hop_nli_model_unavailable_refuses — verifier Err → refusal=NliModelUnavailable.
  - multi_hop_truncate_for_nli_preserves_hypothesis — long premise truncation + hypothesis 보전.
- integrations/claude-code/kebab/SKILL.md: mcp__kebab__ask 절에 NLI 안내 한 단락 (verification.nli_passed 의미 + threshold tuning + nli_verification_failed/nli_model_unavailable refusal handling).

검증: cargo test --workspace -j 1 — 5 신규 multi-hop pass + 회귀 0 (pre-existing kebab-mcp::tools_call_ask_multi_hop 동일 flaky). cargo clippy --workspace --all-targets -j 1 -- -D warnings clean.
Wire 영향: PR-9c-1 의 schema 변경에 *behavior wiring* — answer.v1.verification field 가 multi-hop happy path + refuse path 양쪽에서 채움.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 00:55:02 +00:00
681c48b2a3 Merge pull request 'feat(rag): fb-41 PR-9c-1 — core types + wire scaffolding (NLI verification)' (#178) from feat/fb-41-pr-9c-1-core-types-wire into main
Reviewed-on: #178
2026-05-26 00:09:03 +00:00
546c1564b0 feat(rag): fb-41 PR-9c-1 — core types + wire scaffolding (NLI verification)
Surface-only PR (no behavior wiring — that's PR-9c-2):
- kebab-core: RefusalReason::NliVerificationFailed + NliModelUnavailable (serde rename_all="snake_case", wire = identical strings).
- kebab-core: Answer.verification: Option<VerificationSummary> field (additive minor wire — pre-v0.18 reader 무영향).
- kebab-core: VerificationSummary { nli_score: f32, nli_threshold: f32, nli_passed: bool } struct + lib.rs 재-export.
- kebab-config: NliCfg { model, provider } + ModelsCfg.nli (default Xenova/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7).
- kebab-config: RagCfg.nli_threshold: f32 (default 0.0 = disabled, spec §2.6 single gate).
- kebab-config: env override KEBAB_MODELS_NLI_MODEL/PROVIDER + KEBAB_RAG_NLI_THRESHOLD (parse 실패 시 tracing::warn + default 유지).
- kebab-rag: RagPipeline.verifier: Option<Arc<dyn NliVerifier>> field + with_verifier builder (모두 #[allow(dead_code)] — PR-9c-2 의 step 8.5 hook 가 활성화 시 제거). RagPipeline::new signature 유지 (round-2 NEW-M1 Option B).
- kebab-rag: Cargo.toml 에 kebab-nli path 의존 추가.
- kebab-store-sqlite + kebab-tui: 두 신규 RefusalReason variant 에 대한 exhaustive match arm 추가 (snake_case label / 표시 문구).
- 모든 Answer 구축 site (rag 6 + cli/tui/eval 3 fixture) 에 verification: None 추가.
- wire schemas: answer.schema.json verification field + \$defs.VerificationSummary + refusal_reason.enum 2 추가. error.schema.json code.enum + details.description 2 추가 (forward-looking reserved).
- docs/ARCHITECTURE.md: Mermaid Adapters subgraph 의 nli 노드 + rag→nli + app→nli (forward-looking) + nli→config edges. nli→core edge 는 skip (kebab-nli/Cargo.toml direct dep 가 config 만, ARCHITECTURE 컨벤션 = direct deps only). 디렉토리 트리에 crates/kebab-nli/ 추가.

Tests: kebab-core 3 (serde rename + verification skip + struct shape) + kebab-config 6 (defaults + legacy + env + malformed env) + kebab-cli wire 5 (schema verification + enum 검증).
검증: cargo test --workspace -j 1 회귀 0 (pre-existing kebab-mcp::tools_call_ask_multi_hop flaky 1개 동일 — spec 에 명시된 known-flaky). cargo clippy --workspace --all-targets -D warnings clean.
Wire 영향: additive minor — answer.v1 의 verification optional + refusal_reason.enum 확장 + error.v1.code 확장.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 23:27:36 +00:00
79ad6e376f Merge pull request 'feat(nli): fb-41 PR-9b — OnnxNliVerifier ONNX inference + model download' (#177) from feat/fb-41-pr-9b-onnx-nli-inference into main
Reviewed-on: #177
2026-05-25 22:24:01 +00:00
6ffbe0a5a3 chore(nli): PR #177 회차 1 리뷰 반영 (N1 cache-hit probe + N2 test pollution)
- N1: fetch 의 cache-hit 검사 경로가 실제로는 download 트리거 (ApiRepo::get 가 cache miss 시 download 후 path 반환). log 의 "NLI artifact cache hit" 가 *방금 download 한 직후* 출력 — misleading. hf_hub::Cache::new(cache_dir).repo(repo).get(filename).is_some() 로 변경 — Cache::get 은 fs lookup only, 네트워크 안 탐. actual download 횟수는 변화 없음 (1번), log accuracy 만 개선.
- N2: new_succeeds_on_default_config / score_empty_hypothesis_returns_err 가 XDG 실 디렉토리 (`~/.local/share/kebab/models/nli/...`) 를 create_dir_all → test pollution. tempdir_config() 헬퍼 추가 — TempDir 으로 storage.data_dir override, model_dir 는 `{data_dir}/models` 그대로 두어 expand_path 의 substitution 검증도 유지.

cargo test -p kebab-nli -j 1 → 6 passed / 0 failed (unit) + 5 ignored (integration, manual).
cargo clippy -p kebab-nli --all-targets -j 1 -- -D warnings clean.
inference.rs 미수정 → manual --ignored smoke 결과 (5/5 PASS) 그대로 유효.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 22:22:30 +00:00
ab3408cb49 chore(nli): PR-9b inference test 2 의 expectation 정정
기존 expectation `entailment < 0.3` 가 너무 strict — mDeBERTa-v3 multilingual NLI 가 두 caffeine 사실 (premise: "Caffeine is a stimulant.", hypothesis: "The chemical formula of caffeine is C8H10N4O2.") 의 *neutral* 을 0.53 으로, entailment 를 0.43 으로 판단함 (서로 entail 안 하지만 모순도 아님 = 정확히 neutral).

spec §3 PR-9b 의 "entailment 낮음 — neutral/contradiction 이 winning channel" 의 *spirit* 은 *neutral 이 max* 임. expectation 을 `s.neutral > s.entailment && s.neutral > s.contradiction` 로 변경.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 22:10:51 +00:00
b807fd5aa5 feat(nli): fb-41 PR-9b — OnnxNliVerifier 의 ONNX inference + model download
- OnnxNliVerifier fields: model_id, cache_dir (XDG model_dir/nli/<sanitized>), session/tokenizer OnceLock.
- new(): eager cache_dir stamp만 — actual model download + Session::commit_from_file 는 첫 score 호출 시 ensure_loaded() 가 lazy 수행.
- score(): ensure_loaded → tokenizer.encode(pair, OnlyFirst truncation max_length=512) → ndarray Array2<i64> → ort::Session::run → logits[1,3] → NliScores::from_xnli_logits.
- empty hypothesis edge: defense-in-depth bail (spec §2.3 의 caller-side skip 외 추가).
- sanitize_model_id helper: "/" → "_".
- 5 #[ignore] integration tests (EN self-entailment, EN unrelated, KR entailment, long premise truncation, empty hypothesis err) — manual smoke 가 PR description 첨부.

Cargo.toml: `download-binaries` feature 를 kebab-nli 의 ort dep 에 활성화 (PR-9b prep commit 의 후속). 단독 `cargo test -p kebab-nli` 의 per-crate feature 유니온은 fastembed 없이 ort/download-binaries 가 OFF 되어 ort-sys link 가 실패 — kebab-nli 측에서 명시적으로 켜 줘야 standalone build 가 ONNX 런타임 link 됨. workspace 전체 빌드에서는 fastembed 의 동일 opt-in 과 union 되어 부작용 없음.

Verification:
- cargo test -p kebab-nli -j 1 — PR-9a 의 6 unit pass (`score_returns_err_in_skeleton` → `score_empty_hypothesis_returns_err` 로 stub→실 path 갱신, 갯수 유지).
- cargo clippy -p kebab-nli --all-targets -- -D warnings clean.
- cargo build --workspace -j 1 — 회귀 0.
- Manual --ignored smoke 결과 PR body 첨부.

Wire 영향: 없음 (crate-internal).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:56:22 +00:00
93436f9eca feat(nli): fb-41 PR-9b prep — activate ort/tokenizers/hf-hub/ndarray/tracing deps in kebab-nli
PR-9a 의 workspace.dependencies 만 declared 였던 5 crate 의존을 kebab-nli/Cargo.toml 에 활성화. PR-9b 의 OnnxNliVerifier 실 구현이 본 commit 위에서 빌드 가능.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:42:07 +00:00
11ce7847a1 Merge pull request 'feat(nli): fb-41 PR-9a — kebab-nli crate skeleton + workspace deps' (#176) from feat/fb-41-pr-9a-kebab-nli-crate into main
Reviewed-on: #176
2026-05-25 21:34:49 +00:00
1d88dccf8a chore(nli): PR #176 회차 1 리뷰 반영
- lib.rs::NliScores::faithfulness doc 의 `rag.nli_faithfulness_min` → `rag.nli_threshold` (spec §2.5/§2.6 의 실 config knob 이름 정합).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:25:44 +00:00
1eb0bbecb3 feat(nli): fb-41 PR-9a — kebab-nli crate skeleton + workspace deps
- 신규 crate kebab-nli (trait + impl 동일 crate, v0.18 scope = ONNX adapter 1개).
- NliVerifier trait + NliScores struct (XNLI 3-channel: entailment/neutral/contradiction).
- private softmax3 (log-sum-exp 안전).
- OnnxNliVerifier placeholder (PR-9b 가 ONNX inference + model download 추가).
- workspace.dependencies 추가: ort 2.0-rc.9, tokenizers 0.21 (default-features=false, onig), hf-hub 0.4, ndarray 0.16.

Pre-flight (PR-9 design contract 의 gate):
- HF Xenova/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7 model.onnx + tokenizer.json → HTTP/2 302 (HF S3 routing, file 존재).
- tokenizers --no-default-features -F onig 의 standalone repro: SentencePiece mDeBERTa tokenizer.json 로드 OK (KR 9 tokens / EN 11 tokens 정상 encode).
- Cargo features 결정 trace: tokenizers = { default-features = false, features = ["onig"] } lock.

Tests: 6 unit (softmax3 정규화 + 불변성 + XNLI logits 변환 + faithfulness + new + score stub) — 통과.
Verification: cargo test -p kebab-nli -j 1 (6/6) + cargo clippy -p kebab-nli --all-targets -j 1 -- -D warnings clean.
Workspace: cargo test --workspace -j 1 — pre-existing kebab-mcp::tools_call_ask_multi_hop 1 fail (main baseline 동일 fail, PR-9a 무관 — ingest fixture/Ollama 의존 flaky).

Wire 영향: 없음 (crate 도입만).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:22:38 +00:00
44fbffff26 docs(rag): fb-41 PR-9 spec + plan — NLI verification + v0.18.0 cut
fb-41 multi-hop RAG 의 dogfood S7 hallucination root cause = LLM-self-judge ceiling.
대응 = NLI-based post-synthesis verification (mDeBERTa-v3 XNLI, 280 MB ONNX).

산출물:
- docs/superpowers/specs/2026-05-25-p9-fb-41-finalize-spec.md (review_round=5,
  4 OMC reviewer APPROVE: 1 CRITICAL + 9 MAJOR + 3 MINOR → 1 NIT carry-forward).
- docs/superpowers/plans/2026-05-25-p9-fb-41-finalize-plan.md (plan_review_round=3,
  4 OMC reviewer APPROVE: 15 issues → 0 actionable).

5 sub-PR (PR-9a~9d) + cut PR. 작업 21-31h / wall time 28-44h.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:22:20 +00:00
63aece3ea1 Merge pull request 'fix(rag): fb-41 PR-8 multi-hop synthesize safety in depth (pool 15 + self-check rule)' (#175) from feat/fb-41-pr-8-multi-hop-synthesize-safety into main 2026-05-25 12:51:46 +00:00
28a8bbeace chore(rag): PR #175 회차 1 리뷰 반영
HOTFIXES.md 의 fb-41 entry 에 *post-PR-7 dogfood retest + PR-8 partial
mitigation* sub-section 추가 + *PR-9 NLI plan* anchor + 사용자 영향
절 갱신. config.rs 의 doc reference 가 정확한 entry sub-section
가리키도록 조정 — dangling reference 해소.

검증
- `cargo test -p kebab-config -j 1` — 모든 test 통과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:51:15 +00:00
52a97303dc fix(rag): fb-41 PR-8 — multi-hop synthesize safety in depth (pool 15 + self-check rule)
v0.18 cut 전 fb-41 multi-hop RAG **layered defense** — PR-7 의 pre-decompose
probe gate 위에 추가 safety. PR-7 의 fix 만으로는 hybrid mode 의 RRF
top_score 가 gate 통과 시 (도그푸딩 S7 의 caffeine query) hallucination
여전히 발생 — synthesize 단계 자체의 safety 보강 필요.

**중요**: 본 PR 만으로는 S7 hallucination 완전 차단 안 됨 (gemma3:4b 의
prompt-following 한계 — 추가 dogfood S7 retest 에서 확인). 진짜 fix 는
PR-9 (NLI-based post-synthesis verification). PR-8 은 그 사이의 *partial
mitigation + safety in depth* — latency 4× 개선 (614s → 158s) + future
larger LLM 용 prompt rule.

설계: docs/superpowers/specs/2026-05-25-p9-fb-41-multi-hop-rag-design.md
계획: /build/cache/dogfood-v018/results/PR-9-DESIGN.md (사용자 결정 후
spec/plan 으로 promotion)

## 변경

- `crates/kebab-config/src/lib.rs`:
  - `RagCfg::multi_hop_max_pool_chunks` default **30 → 15**.
  - rationale doc — gemma3:4b 가 30-chunk large prompt 에서 citation
    rule 잃는 측정 결과.
  - 2 unit test (`default_*` rename + `legacy_*` assert) 갱신.
- `crates/kebab-rag/src/pipeline.rs`:
  - `MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT` 에 **답하기 전 self-check** rule
    추가 — "[원본 질문] 의 핵심 entity (고유명사, 화학식, 수치 단위,
    코드명, 약자) 가 [근거] 본문에 literal 으로 등장하지 않으면 다른
    entity 의 정보로 답을 합성하지 말고 '근거가 부족하다' 답한다". example
    (caffeine + Adam optimizer chunk) 도 명시.

## 도그푸딩 결과 (retest with PR-7 + PR-8)

| query | path | grounded | latency | answer |
|---|---|---|---|---|
| caffeine formula | single-pass | false (LlmSelfJudge) | 30s | "근거가 부족하다" ✓ |
| caffeine formula | multi-hop pre-fix | true ✗ | 141s | hallucination |
| caffeine formula | multi-hop PR-7 | true ✗ | 143s | hallucination (probe gate top_score 0.5 > 0.30) |
| caffeine formula | multi-hop PR-8 | true ✗ | **158s** | hallucination (LLM 가 새 rule 무시) — **latency 4× 개선** |

PR-8 의 부분 성과:
- pool 30→15 로 synthesize prompt size ↓ → latency 614s → 158s.
- prompt rule 은 future larger LLM (gemma2:9b, qwen2.5:7b 등) 에서 가치 ↑.

PR-8 의 한계:
- gemma3:4b 의 prompt-following 한계 — strong rule 도 무시하고 다른 entity
  chunk (Adam optimizer formula) 의 본문을 caffeine 화학식 출처로 인용.
- LLM-self-judge 기반 safety 의 ceiling.

## 진짜 fix → PR-9 (별 PR)

학계 / industry 표준 검색 결과 (Self-RAG, CRAG, Auto-GDA, MedTrust-RAG):
deterministic post-synthesis verification 이 정답 path. **NLI-based
groundedness check** — mDeBERTa-v3-base-xnli (280 MB multilingual) ONNX
model 이 (premise=packed_chunks, hypothesis=answer) entailment 검사. score
< 0.5 면 refuse. PR-8 위에 layered defense.

## 검증

- `cargo test -p kebab-config -p kebab-rag -j 1` — 모든 test 통과
  (config default test 2개 갱신, rag tests 영향 없음).
- `cargo clippy -p kebab-config -p kebab-rag --all-targets -j 1 --
  -D warnings` clean.
- 단일 crate 직렬 build (16 GB RAM 제약).
- S7 dogfood retest — hallucination 여전 (PR 본문에 정직 명시).

## 변경 없음

- Wire schema — additive (config knob default 만 변경).
- PR-7 의 probe gate — 그대로 작동 (gate 통과 시 PR-8 의 추가 safety
  layer).
- 다른 도그푸딩 P1 항목 (citation 일관성, binary path) — 별 PR.

## 다음

- **PR-9a/b/c**: NLI-based post-synthesis verification — 진짜 fix.
- PR-9 머지 후 dogfood S7 재검증 (예상: refuse + nli_score < 0.5).
- v0.18.0 cut.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:44:31 +00:00
71fb2cbcb3 Merge pull request 'fix(rag): fb-41 PR-7 multi-hop pre-decompose score-gate (S7 hallucination 회귀 핀)' (#174) from feat/fb-41-pr-7-multi-hop-score-gate-fix into main 2026-05-25 12:05:23 +00:00
85855ef596 chore(rag): PR #174 회차 1 리뷰 반영
`ask_multi_hop` 의 probe_hits 가 gate 검사 후 throw away 되는 의도
명시 — pool 초기값으로 재사용 안 하는 *invariant clarity* rationale 을
코드 안에 doc. 향후 retrieve cost 가 multi-hop bottleneck 이 될 경우
재검토 hint 도 함께.

검증
- `cargo test -p kebab-rag -j 1 --test multi_hop` 10 모두 통과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:04:53 +00:00
da25ce330b fix(rag): fb-41 PR-7 — multi-hop pre-decompose score-gate (S7 hallucination 회귀 핀)
v0.18 cut 전 fb-41 multi-hop RAG 도그푸딩에서 발견된 **safety regression**
fix. 자세한 도그푸딩 결과는 `tasks/HOTFIXES.md` 의 2026-05-25 fb-41
pre-v0.18 entry + `/build/cache/dogfood-v018/results/SUMMARY.md` 참조.

## 문제 (S7)

Query: `What is the chemical formula of caffeine?` (KB 에 없는 fact).

- Single-pass `kebab ask`: retrieve top score 가 default `rag.score_gate
  = 0.30` 미만 → `refuse_score_gate` → 안전한 refusal.
- Multi-hop `kebab ask --multi-hop`: **`grounded = true`**, 본문
  `"카페인의 화학식은 C₉H₁₅N₃O 입니다 [#6]"` (hallucination — 실제
  C₈H₁₀N₄O₂) + `[#6]` 가 Adam optimizer chunk 의 `g_t = ∂L/∂θ_i` 본문을
  인용 (시각적 short structured token 매칭 trigger).

원인: `ask_multi_hop` 의 score-gate 검사가 *pool 의 top_score* 만 봤다.
multi-hop 의 pool 은 5 sub-queries 의 union — 한 sub-query 의 top score
가 gate 위면 다른 chunks 가 원본 query 와 무관해도 gate 통과 + synth →
LLM hallucinate.

## Fix

`ask_multi_hop` entry 에 **pre-decompose probe** 추가:

1. *원본 query* 로 retrieve 한 번 (LLM call 0회, ~ms).
2. probe empty → `refuse_no_chunks(None)` (decompose 안 함, hops=None).
3. probe top_score < gate → `refuse_score_gate(None)` (decompose 안 함).
4. probe pass → 기존 decompose / decide / synthesize flow 그대로.

Multi-hop 의 safety floor 가 single-pass 와 정확히 일치 — multi-hop 은
*원본 query 가 이미 KB 범위 내* 일 때만 cross-doc reasoning 추가.

비용: 한 번의 retrieve (수 ms), LLM call 없음. multi-hop 의 LLM-dominated
latency 대비 무시 가능.

## Tests

신규 3 회귀 핀 (`crates/kebab-rag/tests/multi_hop.rs`):

- `multi_hop_below_probe_gate_refuses_before_any_llm_call` — **S7 직접
  회귀 핀**. low-score chunk + empty LM script → score_gate refusal, LM
  calls 0회, hops=None. fix revert 시 즉시 panic.
- `multi_hop_empty_probe_pool_refuses_before_any_llm_call` — empty
  retrieve 시 NoChunks refusal, LM calls 0회.
- `multi_hop_above_probe_gate_proceeds_to_decompose` — probe pass 시
  full multi-hop flow 정상 (decompose + decide + synth).

기존 7 multi-hop test 의 `ScriptedRetriever` 에 *probe-pass entry*
prepend + `retriever_handle.calls()` expectation +1. test 2 / test 4
처럼 entry 두 개였던 곳도 prepend (3 entries).

`multi_hop_refuse_no_chunks_preserves_hops_trace` /
`multi_hop_refuse_score_gate_preserves_hops_trace` 의 의미 좁힘 — 이제
*decompose-driven* refusal (probe pass 후 sub-query retrieve 가 empty
또는 below-gate) 만 검증. *probe-driven* refusal 은 hops=None
(decompose 안 함) — 신규 test 가 그 path 핀.

## 검증

- `cargo test -p kebab-rag -j 1` — 10 multi-hop (7 갱신 + 3 신규) + 19
  pipeline + 31 unit + 3 prompt_template + 3 streaming 모두 통과. 회귀
  없음.
- `cargo clippy -p kebab-rag --all-targets -j 1 -- -D warnings` clean.
- 단일 crate 직렬 build (16 GB RAM 제약).

## 변경 없음

- Wire schema — `Answer.hops` shape 동일, `refusal_reason` enum 동일.
- 다른 도그푸딩 발견 (synthesize citation 일관성, latency, binary path
  confusion) — v0.18.1 또는 별 PR 의 책임. HOTFIXES 의 "다른 도그푸딩
  발견" 절에 명시.

## 다음

PR-7 머지 후:
1. Workspace `Cargo.toml` version 0.17.2 → 0.18.0 (minor bump).
2. HANDOFF.md / INDEX.md 갱신 + frozen design §3.8 multi-hop sub-section.
3. `gitea-release v0.18.0 --auto-notes`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:02:11 +00:00
5bfea3c28b Merge pull request 'feat(tui): fb-41 PR-6 TUI Ask multi-hop toggle + hop trace summary' (#173) from feat/fb-41-pr-6-tui-multi-hop-toggle into main 2026-05-25 09:30:06 +00:00
b6756f8ce3 chore(tui): PR #173 회차 1 리뷰 반영
test `spawn_snapshot_multi_hop_into_askopts` →
`ask_state_multi_hop_field_default_false_and_round_trips` 로 rename.
이전 이름은 spawn 동작 검증을 약속했으나 본문은 단순 field
default + setter round-trip 만 검증 — name 과 실제 의도의 mismatch.
새 이름이 실제 검증 (field shape pin) 과 정확히 일치.

doc string 도 spawn 동작은 별 path (live dogfood) 로 검증된다고
명확히 표기 — test 의 책임 범위가 무엇인지 reader 가 즉시 파악.

검증
- `cargo test -p kebab-tui -j 1 --test ask` — 42 test (6 multi-hop
  포함) 모두 통과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 09:29:36 +00:00
016f380428 feat(tui): fb-41 PR-6 — TUI Ask multi-hop toggle + hop trace summary
fb-41 multi-hop RAG 의 **마지막 component PR** (PR-5 머지 직후). TUI Ask
패널의 user-facing surface — F2 toggle, multi-hop badge, status panel
의 hop count summary, cheatsheet 안내. v0.18.0 cut 준비.

설계: docs/superpowers/specs/2026-05-25-p9-fb-41-multi-hop-rag-design.md
계획: docs/superpowers/plans/2026-05-25-p9-fb-41-multi-hop-rag.md (PR-6 단락)

## TUI surface

- `crates/kebab-tui/src/app.rs`:
  - `AskState.multi_hop: bool` field + Default false. 사용자 토글
    상태를 인-패널 보존, 대화 history 와 직교 — F2 flipping mid-
    conversation 도 turns 보존 (다음 turn 만 다른 pipeline 으로 route).
- `crates/kebab-tui/src/ask.rs`:
  - `handle_key_ask` 에 `(KeyCode::F(2), _) → s.multi_hop = !s.multi_hop`.
    Mode-agnostic (physical function key — Normal/Insert 양쪽 작동, typing
    ambiguity 없음). Briefing 의 candidate (F2 vs Ctrl-T) 중 F2 채택 —
    Ctrl-M 은 Enter 와 collision 이미 명시, F2 가 cleanest.
  - `spawn_ask_worker` 의 `AskOpts.multi_hop` 가 spawn 시점에 토글값
    snapshot. 이후 F2 flip 은 다음 Enter 부터 적용 (in-flight turn 무영향).
  - `render_input` 의 input pane title 에 `F2=multi-hop` binding 안내
    추가 + prompt row 에 `multi-hop` badge (Success 녹색, toggled-on 일
    때만). 사용자가 어떤 pipeline 으로 다음 query 를 보낼지 항상 가시.
  - `render_status` 의 status panel 에 `multi-hop: N hops` line 추가
    (last_answer.hops 가 Some 일 때만). forced_stop 발생 시
    `forced_stop=K` suffix — depth/pool cap tuning 단서.
- `crates/kebab-tui/src/cheatsheet.rs`:
  - Ask section 에 `F2 toggle multi-hop pipeline` entry 추가.

## 변경 없음 (의도된 deferral)

- `InspectTarget::Hop(turn_index)` variant — plan 의 PR-6 stretch goal.
  per-iter hop trace detail 을 Inspect 패널에 노출하는 기능은 별 PR
  (PR-6b 또는 v0.18 dogfood follow-up). PR-6 의 핵심 가치
  (사용자가 multi-hop pipeline 을 토글하고 결과의 hop count 를 본다)
  는 status panel 의 한 줄 summary 로 100% cover. Inspect 진입은
  multi-hop 사용자가 *드물게* 필요한 surface — v0.18 cut 부담 회피.
- prompt_template_version (`rag-multi-hop-v1`) — 그대로.
- MCP / CLI surface — PR-4 / PR-5 의 책임.

## Tests (`tests/ask.rs` 신규 6 multi-hop pins)

- `f2_toggles_multi_hop_flag_from_insert_mode`: Insert 에서 F2 toggle
  (fresh_app default mode).
- `f2_toggles_multi_hop_flag_from_normal_mode`: Normal 에서도 동일
  — mode-agnostic 회귀 핀.
- `input_pane_shows_multi_hop_badge_when_toggled_on`: 토글 on 시
  prompt row 에 `multi-hop` 등장 + title 의 `F2=multi-hop` binding
  hint 등장.
- `input_pane_omits_multi_hop_badge_when_toggled_off`: 토글 off 시
  prompt row 의 badge 부재 (title hint 는 유지 — 사용자 discoverability).
- `status_panel_summarizes_hops_when_answer_has_trace`: 3-hop trace
  (Decompose + Decide + Synthesize) → `multi-hop: 3 hops` line.
- `status_panel_omits_hops_summary_for_single_pass`: hops=None → 본문
  에 summary line 부재 (title binding hint 만).
- `spawn_snapshot_multi_hop_into_askopts`: AskState.multi_hop 의
  field shape 회귀 핀 (default false / settable / round-trip).

## 검증

- `cargo test -p kebab-tui -j 1` — 신규 6 multi-hop + 기존 ask /
  search / library / mode / cheatsheet / inspect / status_bar 모두
  통과 (42 ask test + 10 mode + 기타). 회귀 없음.
- `cargo clippy -p kebab-tui --all-targets -j 1 -- -D warnings` clean.
- 단일 crate 직렬 build (16 GB RAM 제약).

## v0.18.0 cut (다음 단계)

- Workspace `Cargo.toml` version 0.17.2 → 0.18.0 (minor — surface
  확장 + new prompt_template_version `rag-multi-hop-v1`).
- HANDOFF.md / HOTFIXES.md / INDEX.md 갱신 (fb-41 entry 정리).
- `gitea-release v0.18.0 --auto-notes`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 09:26:29 +00:00
bf28a1e4d9 Merge pull request 'feat(mcp): fb-41 PR-5 MCP ask multi_hop arg + SKILL.md 안내' (#172) from feat/fb-41-pr-5-mcp-multi-hop-arg into main 2026-05-25 09:09:06 +00:00
24221826ed chore(mcp): PR #172 회차 1 리뷰 반영
`ask_tool_routes_multi_hop_true_to_decompose_first` 의 error code
검증을 더 견고하게 — `model_unreachable | timeout` 둘 다 accept.
환경 차이 (즉시 ECONNREFUSED vs connect timeout) 가 다른 wire code
로 분류돼도 dispatch divergence 자체 (schema_version=error.v1 +
isError=true vs single-pass 의 answer.v1 grounded=false) 는 동일하게
검증.

검증
- `cargo test -p kebab-mcp -j 1 --test tools_call_ask_multi_hop` 2 통과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 09:08:40 +00:00
8a2f7affa6 feat(mcp): fb-41 PR-5 — MCP ask multi_hop arg + SKILL.md 안내
fb-41 multi-hop RAG 의 **PR-5** (PR-4 머지 직후). PR-4 의 CLI `--multi-hop`
flag 와 sister surface — agent (Claude Code 등 MCP host) 가 `mcp__kebab__ask`
호출 시 `multi_hop: true` 옵션 사용 가능.

설계: docs/superpowers/specs/2026-05-25-p9-fb-41-multi-hop-rag-design.md
계획: docs/superpowers/plans/2026-05-25-p9-fb-41-multi-hop-rag.md (PR-5 단락)

## MCP surface

- `crates/kebab-mcp/src/tools/ask.rs`:
  - `AskInput.multi_hop: Option<bool>` 추가. JsonSchema derive 가 tools/list
    에 자동 반영 — agent capability discovery 가 새 필드 인식.
  - `handle()` 가 `AskOpts.multi_hop = input.multi_hop.unwrap_or(false)` —
    기존 caller (필드 누락 / null) 는 single-pass 그대로.
- `crates/kebab-mcp/src/lib.rs` (tools/list):
  - `ask` tool description 에 multi-hop 한 줄 (decompose → retrieve →
    synthesize, 2-5× LLM cost, per-hop trace on Answer.hops).

## SKILL.md 안내

- `integrations/claude-code/kebab/SKILL.md` 의 `mcp__kebab__ask` 절:
  - Input shape JSON 예제에 `multi_hop: false` 추가.
  - Returns 절에 `hops` (multi-hop only) 추가.
  - 신규 bullet (p9-fb-41) — opt-in 조건 / 비용 trade-off / 사용 케이스
    (compound questions / prereq chains / cross-doc reasoning) /
    `Answer.hops` 의 per-hop trace shape / `multi_hop_decompose_failed`
    refusal 처리.

## Tests (`tests/tools_call_ask_multi_hop.rs` 신규, 2 Ollama-free pins)

- `ask_tool_routes_multi_hop_true_to_decompose_first`: dispatch
  divergence 핀. invalid LLM endpoint (`http://127.0.0.1:1`,
  request_timeout_secs=2) 로 force unreachable. multi_hop=true 는
  decompose 먼저 호출 → `error.v1` (code=model_unreachable) /
  isError=true. multi_hop=false (single-pass) 는 empty KB 에서 retrieve
  먼저 → no LLM call → `answer.v1` grounded=false / isError=false. 두
  shape 의 분기가 dispatch 가 실제로 다른 path 로 라우팅됨의 증거.
- `ask_input_schema_advertises_multi_hop_field`: AskInput 의 JsonSchema
  가 `multi_hop` property 노출 — MCP host capability discovery
  (tools/list 의 input schema) 회귀 핀.

기존 `tools_call_ask.rs` 의 AskInput literal 도 `multi_hop: None`
추가 (struct field 추가에 따른 minimal cascade).

## 변경 없음

- `prompt_template_version` (`rag-multi-hop-v1`) — 그대로.
- TUI surface — PR-6 의 책임.
- error.v1 매핑 — PR-4 의 enum reservation 그대로 (no error_wire
  promotion).

## 검증

- `cargo test -p kebab-mcp -j 1` — 신규 tools_call_ask_multi_hop 2 +
  기존 ask / search / bulk_search / fetch / ingest / schema / doctor /
  tools_list / initialize 등 모두 통과 (회귀 없음).
- `cargo clippy -p kebab-mcp --all-targets -j 1 -- -D warnings` clean.
- 단일 crate 직렬 build (16 GB RAM 제약).

## 다음 PR

- PR-6: TUI Ask 패널 multi-hop toggle (F2 / Ctrl-T) + hop trace render +
  cheatsheet 갱신.
- v0.18.0 cut (PR-6 머지 후).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 09:06:28 +00:00
f28a422f79 Merge pull request 'feat(cli): fb-41 PR-4 CLI --multi-hop flag + answer.v1 / error.v1 wire' (#171) from feat/fb-41-pr-4-cli-multi-hop-flag into main 2026-05-25 08:48:08 +00:00
c56242d04f chore(cli): PR #171 회차 1 리뷰 반영
`answer.schema.json` 의 `refusal_reason` description 의 PR 번호 정정:
`multi_hop_decompose_failed` 도입 시점 = PR-2 (#167, RefusalReason variant
+ ask_multi_hop decompose-failure 분기). PR-3a (#168) 는 `Answer.hops`
field + RagCfg knob 만 — refusal variant 와 무관.

검증
- `cargo test -p kebab-cli -j 1 --test wire_ask_multi_hop` 4 모두 통과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 08:47:35 +00:00
17c48a0ee6 feat(cli): fb-41 PR-4 — CLI --multi-hop flag + answer.v1 / error.v1 wire 확장
fb-41 multi-hop RAG 의 **PR-4** (PR-3b-ii 의 ScriptedLm + tests 위에서
user-facing CLI surface + JSON Schema 확장). PR-3b-i / PR-3b-ii 의 multi-hop
pipeline 을 `kebab ask --multi-hop` 으로 사용자에게 노출.

설계: docs/superpowers/specs/2026-05-25-p9-fb-41-multi-hop-rag-design.md
계획: docs/superpowers/plans/2026-05-25-p9-fb-41-multi-hop-rag.md (PR-4 단락)

## CLI surface

- `kebab ask --multi-hop <query>` — 새 flag (default false). `AskOpts.multi_hop`
  로 전달, stream + non-stream 두 callsite 모두 갱신.
- `--show-citations` / `--hide-citations` / `--stream` / `--session` 등 기존
  flag 와 orthogonal.
- `--json` 모드에서 `Answer.hops` 배열이 multi-hop happy path / refusal-with-
  partial-trace 양쪽 경로에서 노출됨 (PR-3b-i + PR-3b-ii 의 wiring).

## Wire schema 확장

- `docs/wire-schema/v1/answer.schema.json`:
  - 신규 `hops: array | null` 필드 (optional, additive). `HopRecord` 의
    `$defs` 추가 — `iter` / `kind` (decompose|decide|synthesize) /
    `sub_queries` / `context_chunks_added` / `forced_stop` / `llm_call_ms`
    6 필드 + per-field doc.
  - `refusal_reason` 필드를 `anyOf [enum, null]` 로 명시 — 6 variant
    (`score_gate`, `llm_self_judge`, `no_index`, `no_chunks`,
    `llm_stream_aborted`, `multi_hop_decompose_failed`). 이전 schema 는
    `type: string|null` 만 명시 → enum 명시는 agent / consumer 의 strict
    validate 강화 (additive — 기존 producer 값 모두 enum 안).
  - `$id` / `schema_version` 변경 없음 — additive minor.
- `docs/wire-schema/v1/error.schema.json`:
  - `code` enum 에 `multi_hop_decompose_failed` 추가. **이는 forward-looking
    enum extension** — 현재 RefusalReason 은 `Answer.refusal_reason` (stdout)
    으로만 노출되고 `error.v1` (stderr) 경로 안 거침. 미래 PR 에서 fatal
    promotion 정책 결정 시 trigger 가능하도록 enum 만 미리 reserve.
  - details.description 의 per-code 안내에 `multi_hop_decompose_failed: {}`
    note 추가 — reserved 상태 명시.

## Tests

- `crates/kebab-cli/tests/wire_ask_multi_hop.rs` 신규 (4 Ollama-free pins):
  - `cli_ask_help_advertises_multi_hop_flag`: clap-level smoke, `kebab ask
    --help` 출력에 `--multi-hop` 등장 확인.
  - `answer_schema_declares_hops_property_with_hop_record_defs`: `hops`
    property 존재 + `$defs.HopRecord` 의 `kind` enum 3 variant
    (decompose/decide/synthesize) 회귀 핀.
  - `answer_schema_refusal_reason_enum_includes_multi_hop_decompose_failed`:
    6 variant 모두 enum 에 존재 — 기존 5 도 함께 핀 (회귀 방지).
  - `error_schema_code_enum_includes_multi_hop_decompose_failed`: 신규
    code enum 확장 + 기존 code (config_invalid / not_indexed / ...) 보존 핀.

End-to-end multi-hop ask 의 live Ollama 검증은 후속 `#[ignore]` test 로
(같은 `wire_ask_stale.rs` 패턴). PR-4 의 범위 = clap + schema 정합성 만.

## 변경 없음

- `crates/kebab-app/src/error_wire.rs` — plan 의 "error_wire 매핑" 항목은
  현재 RefusalReason 가 `Answer.refusal_reason` 로만 노출 (anyhow chain 안
  거침) 라 trigger 가 없음. enum reservation 만으로 충분, 매핑 코드는 dead
  code 회피. 향후 fatal-promotion 정책 (refusal → error.v1) 결정 시 PR-4b
  로 split.
- `prompt_template_version` — `rag-multi-hop-v1` 그대로.
- TUI / MCP surface — PR-5 / PR-6 에서.

## 검증

- `cargo test -p kebab-cli -j 1` — 모든 test 통과 (신규 wire_ask_multi_hop 4 +
  기존 ask / search / schema / ingest / mcp / reset 등 모두).
- `cargo clippy -p kebab-cli --all-targets -j 1 -- -D warnings` clean.
- 단일 crate 직렬 build (16 GB RAM 제약).

## 다음 PR

- PR-5: MCP `ask` tool 의 `multi_hop: bool` argument + `integrations/claude-
  code/kebab/SKILL.md` 의 ask 절 갱신.
- PR-6: TUI Ask 패널 multi-hop toggle (F2 / Ctrl-T) + hop trace render.
- v0.18.0 cut (PR-6 머지 후): `Cargo.toml` 0.17.2 → 0.18.0 + HANDOFF /
  HOTFIXES / INDEX 갱신 + gitea-release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 08:45:01 +00:00
64a009314c Merge pull request 'feat(rag): fb-41 PR-3b-ii ScriptedLm + multi-hop tests + refusal hop trace' (#170) from feat/fb-41-pr-3b-ii-scripted-lm-tests into main 2026-05-25 08:25:42 +00:00
ddfe7ba099 chore(rag): PR #170 회차 2 리뷰 반영
test 7 의 `i32_below_gate_chunk` helper rename → `seed_low_score_chunk` +
반환 shape 을 `(chunk_id, doc_id)` tuple 로 확장. `i32` prefix 가 Rust
integer 타입과 충돌하던 가독성 문제 해소 + 호출자가 `id32("d_low")` 를
재계산하지 않도록 id 페어를 single source of truth 로 통합.

검증
- `cargo test -p kebab-rag -j 1 --test multi_hop` — 7 모두 통과.
- `cargo clippy -p kebab-rag --all-targets -j 1 -- -D warnings` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 08:24:36 +00:00
104363a0db chore(rag): PR #170 회차 1 리뷰 반영
(A) ScriptedLm doc 의 `Arc<Vec<String>>` 표기 → 실제 구현 (`Vec<String>` +
    `AtomicUsize`, 외부에서 `Arc::new(ScriptedLm::new(...))` 로 wrap)
    반영.
(B) ScriptedLm::new doc 의 미존재 `with_*` builder 언급 제거.
(C) refuse path 의 hops 보존 회귀 핀 2 건 추가 (`tests/multi_hop.rs`):
    - `multi_hop_refuse_no_chunks_preserves_hops_trace`: empty pool →
      `refuse_no_chunks(Some(hops))` → Answer.hops = Some([Decompose,
      Decide]).
    - `multi_hop_refuse_score_gate_preserves_hops_trace`: top score 0.10
      < 0.30 gate → `refuse_score_gate(Some(hops))` → 같은 shape.
    refuse_* widening + ask_multi_hop 의 forwarding wiring 이 reverting
    되면 두 test 가 회귀 잡음.
(D) test 5 의 redundant `assert_ne!(.., Some(MultiHopDecomposeFailed))`
    제거 — `assert_eq!(.., None)` 이미 함의. 메시지에 의도 통합.

검증
- `cargo test -p kebab-rag -j 1 --test multi_hop` — 7 (5+2) 모두 통과.
- `cargo clippy -p kebab-rag --all-targets -j 1 -- -D warnings` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 08:22:58 +00:00
6188a50c1c feat(rag): fb-41 PR-3b-ii — ScriptedLm + 5 multi-hop tests + refusal hop trace + carry-over
PR-3b 의 분할 두 번째 PR — PR-3b-i 의 dynamic decide loop 위에서:

1. **ScriptedLm + ScriptedRetriever helper** (kebab-rag tests/common/mod.rs)
   per-call 다른 response 반환. decompose / decide×N / synthesize 의 각
   LLM call 을 구분하는 다단계 multi-hop 시나리오를 mock-only 로 exercise
   가능. `Vec<&str>` / `Vec<Vec<SearchHit>>` 받아 call sequence 순서대로
   emit. Send + Sync.

2. **5 multi-hop integration tests** (kebab-rag tests/multi_hop.rs 신규)
   - decide_stop_triggers_synthesize: decide [] → 즉시 synthesize
   - decide_continue_adds_more_chunks: decide ["q2"] → iter 2 retrieve + pool 확장
   - max_depth_force_stops: depth cap → forced_stop + decide LLM call skip
   - pool_chunks_dedup_by_chunk_id: 같은 chunk_id 두 sub-query 에서 1 회
   - decide_parse_failure_falls_through_to_synthesize: parse fail = graceful
     synthesize (refusal 아님, spec §9)

3. **refuse_* helper hops trace 보존** (회차 1 carry-over)
   refuse_no_chunks / refuse_score_gate 시그니처에 `hops:
   Option<Vec<HopRecord>>` 인자 추가. ask_multi_hop 의 score-gate /
   no-chunks refusal 시 누적된 hops 그대로 Answer.hops 에 보존.
   single-pass ask 는 None 전달 — wire 변동 없음 (skip_serializing_if).

4. **HopRecord doc 보강** (회차 1 carry-over)
   sub_queries 의 per-kind 의미 명시 (Decompose=initial / Decide=next-iter
   or empty=stop / Synthesize=always empty). llm_call_ms=0 의 ambiguity
   (no call vs 0ms call) doc 명시.

5. **MULTI_HOP_MAX_SUB_QUERIES_DEFAULT → _HARD_CAP rename** (회차 1 carry-over)
   const 의 의도 명확화 — config knob `multi_hop_max_sub_queries_per_iter`
   (5, prompt-side soft hint) 와 const (10, parse-side hard ceiling)
   분리. 두 layer 의 책임 doc 동기화. test 도 rename.

6. **decide guard 단순화 + preview budget doc** (회차 1 carry-over)
   parse_decompose_response 의 post-condition (Some=non-empty 보장)
   doc 명시. defensive `Some(qs) if !qs.is_empty()` →
   `decide_result.unwrap_or_default()` 단순화. decide preview 의
   snippet-only path (full chunk text 안 fetch) 의도 doc.

검증
- `cargo test -p kebab-rag -j 1` — 31 unit + 19 pipeline + 5 multi_hop
  + 3 prompt_template + 3 streaming 모두 통과.
- `cargo clippy -p kebab-rag --all-targets -j 1 -- -D warnings` clean.

Spec / plan
- design: docs/superpowers/specs/2026-05-25-p9-fb-41-multi-hop-rag-design.md
- plan: docs/superpowers/plans/2026-05-25-p9-fb-41-multi-hop-rag.md (PR-3b 단락)

다음 단계 = PR-4 (CLI --multi-hop + wire schema + error_wire).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 08:17:37 +00:00
94e6146013 Merge pull request 'feat(rag): fb-41 PR-3b-i dynamic decide loop + helpers' (#169) from feat/fb-41-pr-3b-decide-loop into main 2026-05-25 07:32:25 +00:00
12c7dc9efb feat(rag): fb-41 PR-3b-i — dynamic decide loop + helpers + format! named arg
PR-3b 의 분할 첫 PR. ask_multi_hop 의 fixed depth=2 → dynamic N-hop.
ScriptedLm helper + 5+ integration tests (happy-path 통합 검증) 는
PR-3b-ii 분리. 본 PR 의 회귀 핀 = 기존 PR-2 의 2 integration test
통과 (decompose garbage refusal + multi_hop=false single-pass keep).

- `RagPipeline::multi_hop_decompose` 시그니처 변경 — `Result<
  (Option<Vec<String>>, u32)>` (parsed result + LLM call latency_ms).
  caller (`ask_multi_hop`) 가 hop trace 의 `llm_call_ms` stamp.
- `RagPipeline::multi_hop_decide` helper 신규. decide LLM call →
  `parse_decompose_response` 으로 `Option<Vec<String>>` 반환. None
  또는 empty array 가 stop signal (refusal 아닌 graceful degrade).
- `MULTI_HOP_DECIDE_SYSTEM_PROMPT` const 신규.
- `MULTI_HOP_DECOMPOSE_USER_TEMPLATE` const 제거 + `format!` named
  arg 사용 (PR-2 회차 1 carry-over fix). compile-time substitution
  check — 사용자 query 안에 `{max_sub_queries}` literal 있어도
  mis-replace 회피.
- `ask_multi_hop` 의 §1 (Decompose) + §2 (Retrieve) 영역을 dynamic
  loop 으로 재작성:
  - iter 0 = decompose, HopRecord 추가 (kind=Decompose).
  - iter 1..=max_depth = retrieve current_sub_queries → pool dedup
    → decide LLM call (forced_stop / pool_cap_hit 시 skip).
    HopRecord 추가 (kind=Decide, sub_queries=new_sub_queries,
    context_chunks_added, forced_stop, llm_call_ms).
  - `max_pool_chunks` 도달 시 `pool_cap_hit = true` → 그 iter 의
    HopRecord 가 `forced_stop = true` + decide LLM call skip.
  - depth 도달 (`iter >= max_depth`) 시 동일하게 forced_stop.
  - decide parse failure 또는 empty array → loop break (early
    synthesize, NOT refusal — spec §9 graceful degrade).
- §6 (Generate) 시작 시 `synthesize_started: Instant::now()` 별
  stamp → §8 Build Answer 직전 `HopRecord { kind=Synthesize,
  llm_call_ms = synth_ms }` 추가. happy path 의 Answer literal
  `hops: Some(hops)` 채움 (`hops: None` → `Some(...)` 변경).
- doc comment 갱신: "PR-2 scope (fixed depth=2)" → "PR-3b-i
  scope (dynamic N-hop)". refusal path 의 hops trace 손실 caveat
  명시 (PR-3b-ii / follow-up 에서 helper signature 확장 시 해결).

기존 회귀 핀 (PR-2 의 2 integration test):
- `ask_multi_hop_dispatches_and_decompose_garbage_refuses`:
  decompose garbage → RefusalReason::MultiHopDecomposeFailed +
  정확히 1 LLM call. PR-3b-i 의 시그니처 변경 후도 통과.
- `ask_with_multi_hop_false_keeps_single_pass_path`: 영향 없음.

56 unit + integration test 모두 통과 (kebab-rag).

Wire 영향: `Answer.hops` 가 multi-hop happy path 에서 emit. JSON
Schema additionalProperties default `true` 라 wire breaking 아님
(PR-3a 의 review 확인). schema.json 명시 갱신은 별 PR
(PR-3b-ii 또는 PR-4 의 schema sweep).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 07:29:46 +00:00
cd1d4fb807 Merge pull request 'feat(rag): fb-41 PR-3a HopRecord wire + RagCfg multi-hop knobs' (#168) from feat/fb-41-pr-3-dynamic-decide-loop into main 2026-05-25 07:18:27 +00:00
7150c376bb feat(rag): fb-41 PR-3a — HopRecord wire + RagCfg multi-hop knobs
PR-3 의 분할 첫 PR. wire additive (HopRecord + HopKind + Answer.hops
field) + RagCfg 의 multi_hop_* 3 노브. RAG pipeline 동작 미변경 —
모든 Answer literal 의 `hops = None`. PR-3b (후속) 가 ask_multi_hop
의 happy path 에서 dynamic decide loop 구현 + hops trace 채움.

분할 이유: 원래 PR-3 가 wire + cfg + decide loop + ScriptedLm +
helper refactor + 5+ tests 단일 PR 였는데 ~1500 줄 단일 patch 가
review 부담 + 회기 위험 ↑. additive foundation 부터 ship 후 decide
loop 별 PR — 사용자 결정 (2026-05-25).

- `kebab_core::HopRecord` (iter, kind, sub_queries,
  context_chunks_added, forced_stop, llm_call_ms) + `HopKind`
  (Decompose / Decide / Synthesize) — wire-additive shape.
- `kebab_core::Answer.hops: Option<Vec<HopRecord>>` —
  `#[serde(default, skip_serializing_if = "Option::is_none")]`,
  single-pass / refusal path 는 None, PR-3b 의 multi-hop happy
  path 가 Some.
- `kebab_config::RagCfg` 에 3 신규 노브:
  - `multi_hop_max_depth: u32` (default 3)
  - `multi_hop_max_sub_queries_per_iter: u32` (default 5)
  - `multi_hop_max_pool_chunks: u32` (default 30)
  3 모두 `#[serde(default)]` + env override
  (`KEBAB_RAG_MULTI_HOP_MAX_*`) + legacy parse 핀
  (`LEGACY_PRE_TIMEOUT_TOML` 공유).
- 9 Answer literal site (pipeline.rs ×6 + kebab-cli + kebab-tui
  tests + kebab-eval test) 에 `hops: None` 명시 추가. exhaustive
  field check 가 자동 guard — 빠진 site 시 compile fail.
- plan 의 PR-3 단락 → PR-3a / PR-3b 분할 명시 + scope 정정.

Tests (163 passing across kebab-config + kebab-core + kebab-rag):
- 5 신규 multi-hop knob test (default / env override / legacy parse).
- 기존 50+57+31+19+3+3 test 모두 hops:None 추가 후도 통과.

Wire 영향: `answer.v1` 의 optional `hops` 필드 — `skip_serializing_
if = None` 이라 single-pass response 에 emit 안 됨. wire breaking
아님, JSON Schema 갱신은 PR-3b 또는 PR-4 (실제 emit 시점).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 07:15:01 +00:00
6280abf2df Merge pull request 'feat(rag): fb-41 PR-2 ask_multi_hop skeleton (fixed depth=2)' (#167) from feat/fb-41-pr-2-ask-multi-hop-skeleton into main 2026-05-25 06:50:02 +00:00
192da45dbf chore(rag): PR #167 회차 1 리뷰 반영
- `parse_decompose_response_drops_partial_empty_keeps_valid` 신규
  회귀 핀 — `["", "valid q", "  "]` → `["valid q"]` (trim+filter
  chain 동작 pin).
- `multi_hop_decompose` 의 `stop: Vec::new()` 옆 doc comment 추가 —
  의도 명시 (instruction-following 모델 기대 + prose 추가 시
  MultiHopDecomposeFailed refusal 가 policy). 회차 1 question 의
  답변.
- plan 의 PR-3 implementation order 에 회차 1 carry-over 추가:
  1) ask + ask_multi_hop 의 §4-§9 mirror → 공통 helper 추출,
  2) decompose template 의 substitution corner case → format!
     named arg 으로 교체.

회차 1 의 다른 suggestion (mirror refactor, substitution corner
case, history block helper) 는 PR-3 합리적 timing 으로 plan 에
명시 — 회차 2 reply 에 정리.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 06:49:21 +00:00
cf35f36f88 feat(rag): fb-41 PR-2 — RagPipeline::ask_multi_hop skeleton (fixed depth=2)
PR-2 of fb-41 multi-hop RAG. Decompose + retrieve + synthesize 3-stage
pipeline가 `opts.multi_hop=true` 일 때 dispatch. Dynamic decide loop
는 PR-3.

- `AskOpts.multi_hop: bool` 필드 추가 + `impl Default for AskOpts`
  도입 (HOTFIXES 2026-05-07 의 known limitation 해소). 9 explicit
  init site 모두 `multi_hop: false` 추가 — Default 도입으로 향후
  `..Default::default()` 점진 migrate 가능.
- `RagPipeline::ask` 의 entry 에 dispatcher 한 줄
  (`if opts.multi_hop { return self.ask_multi_hop(...) }`).
- `RagPipeline::ask_multi_hop` 신규 method. 1) decompose LLM call
  → JSON array of strings parse, 2) 각 sub-query 로 retrieve +
  chunk_id dedup pool, 3) score gate / no-chunks 가드, 4)
  pack_context (single-pass 와 helper 공유), 5) synthesize LLM
  call w/ MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT, 6) citation extract
  + Answer build. `prompt_template_version` = "rag-multi-hop-v1"
  로 stamp — eval `compare` 가 single-pass vs multi-hop 분리.
- Prompt const 신규: MULTI_HOP_DECOMPOSE_SYSTEM_PROMPT +
  MULTI_HOP_DECOMPOSE_USER_TEMPLATE + MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT
  + PROMPT_TEMPLATE_VERSION_MULTI_HOP + MULTI_HOP_MAX_SUB_QUERIES_DEFAULT.
- `kebab_core::RefusalReason::MultiHopDecomposeFailed` variant 신규.
  Cascade: kebab-store-sqlite `refusal_reason_label` + kebab-tui `ask
  refusal render` exhaustive match 갱신.
- `parse_decompose_response` + `strip_markdown_json_fence` helper —
  markdown code fence (```json / ```) strip + JSON array of strings
  parse + trim + drop empty + cap at MULTI_HOP_MAX_SUB_QUERIES_DEFAULT.
  None 반환 시 caller 가 `MultiHopDecomposeFailed` refusal.

Tests (55 passing total, 8 신규):
- 6 unit (parse_decompose_response 의 bare array / fence variants /
  garbage / cap / trim 회귀 핀).
- 2 integration: `ask_multi_hop_dispatches_and_decompose_garbage_refuses`
  (decompose garbage → MultiHopDecomposeFailed + 정확히 1 LLM call) +
  `ask_with_multi_hop_false_keeps_single_pass_path` (회귀 핀, 기존
  caller 자동 backwards-compat).

Happy-path multi-hop (decompose 성공 → synthesize) 의 integration
test 는 ScriptedLm helper 가 PR-3 의 decide loop 와 함께 도입될
때 같이 추가. 현 `MockLanguageModel` 는 canned single response 라
2-LLM-call sequence 핀 불가.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 06:45:32 +00:00
ed34f2e03f Merge pull request 'feat(eval): fb-41 multi-hop golden set + spec/plan' (#166) from feat/fb-41-multi-hop-eval-golden into main 2026-05-25 06:27:06 +00:00
624b44c46b chore(eval): PR #166 회차 1 리뷰 반영
- `mh-s-004` 의 `must_contain: ["i"]` 한 글자 → `["INSERT", "i 입력모드"]`
  보강. trigram 0-hit + noise 매칭 위험 해소.
- 3 question 영어 변경 (`mh-c-005` / `mh-i-001` / `mh-s-002`) — fixture
  의 lang 다양성 mix (12 ko + 3 en). 영어 dogfood 시 measurement gap
  회피.
- plan 의 PR-1 단락이 outdated (kebab-eval crate 미survey 단계 작성 →
  실제 PR 와 deviation). actual 변경 명시 + 초안 대비 deviation 명시.

회차 1 의 다른 2 suggestion (mh-c-002 의 `v0.17.2` hard-coded, 15
question / 5-per-bucket 회귀 핀의 frozen size) 은 baseline anchor 의도
적 freeze — 회차 2 reply 에 명시.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 06:26:15 +00:00
caf690dc72 feat(eval): fb-41 multi-hop golden set + spec/plan
PR-1 of fb-41 multi-hop RAG (spec: docs/superpowers/specs/2026-05-25-
p9-fb-41-multi-hop-rag-design.md, plan: docs/superpowers/plans/2026-
05-25-p9-fb-41-multi-hop-rag.md).

XL 작업의 첫 PR — baseline 측정 anchor 만 추가. RAG pipeline 미변경,
fixture file + parse 회귀 핀.

사용자 결정 4 axis (2026-05-25):
- approach: query decomposition (LLM 서브-질문)
- trigger: explicit `--multi-hop` flag
- MVP scope: dynamic N-hop (LLM 이 depth 결정, decompose seed +
  ReAct-style decide loop hybrid)
- eval: multi-hop golden set 먼저 (본 PR)

본 PR:
- `fixtures/multi_hop_golden.yaml` 신규. 15 question (5 cross-doc +
  5 intra-doc + 5 single-fact negative). 기존 `GoldenQuery` struct
  그대로 사용 — 별 loader / type 변경 없음. `expected_chunk_ids`
  비어 있어 curator 가 `kebab ingest` 후 채울 수 있는 template
  형태. `must_contain` 으로 baseline 측정 가능 (P5-2 metric).
- `crates/kebab-eval/tests/loader.rs::loads_multi_hop_golden_fixture`
  신규 회귀 핀. fixture parse OK + 15 question + 5/5/5 bucket
  분포 + 모든 question 에 must_contain 최소 1 개.

baseline 측정 protocol (별 run, commit 에 artifact 안 포함):
1. v0.17.2 binary 로 single-pass `kebab eval run --fixture
   multi_hop_golden.yaml` 실행
2. P@5, P@10, must_contain pass rate, citation_coverage 캡처
3. PR-3 (dynamic iter 머지) 후 동일 fixture + `multi_hop=true` 로
   재실행 → Δ 비교

PR 분할 6 단계 (plan 참조): PR-1 (본 PR — fixture only), PR-2
(RagPipeline::ask_multi_hop fixed depth=2), PR-3 (dynamic iter),
PR-4 (CLI flag + wire), PR-5 (MCP + SKILL.md), PR-6 (TUI toggle +
trace render). 마지막 PR 후 v0.18.0 cut.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 06:22:08 +00:00
1640ecf288 chore: bump version 0.17.1 → 0.17.2
v0.17.1 post-dogfood polish cut. 두 PR 묶어 release:

- PR #164 — `[image.ocr] request_timeout_secs` 별 노브 (v0.17.1
  미진행 closure). LLM 패턴을 OCR 어댑터에 동일 적용, 별 노브로
  분리 (OCR vs LLM 의 cold start 패턴 차이로 독립 조절).
- PR #165 — `heading_path` FTS5 column filter 로 text-only 매칭
  + raw-mode escape hatch (2026-05-24 v0.17.0 trigram entry 의
  JSON 노이즈 closure). lexical.rs 가 non-raw 분기 결과를
  `text : (<expr>)` 로 wrap, 색인 자체는 V007 verbatim 그대로
  유지. raw mode `'heading_path : <token>'` 로 opt-in 가능.

둘 다 additive (옛 config 호환) + re-ingest 불필요. binary 교체만.

HANDOFF 한 줄 요약 + 머지 후 결정 절에 v0.17.2 entry 추가.
HOTFIXES 의 두 entry anchor 가 `post-v0.17.1 dogfood` → `v0.17.2`
로 갱신.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 05:55:50 +00:00
90e77631a8 Merge pull request 'feat(search): heading_path FTS5 text column filter' (#165) from feat/heading-text-column-filter into main 2026-05-25 05:48:22 +00:00
fa251db48f chore(search): PR #165 회차 2 리뷰 반영
HOTFIXES entry 의 **MCP / agent 가시성** 단락이 회차 1 의 SKILL.md
추가 결정과 contradiction (`별도 SKILL.md 갱신 불필요` 잘못된
표기). 갱신 사실 + 새 escape hatch 가 v0.17.0 raw mode pattern
위에 build 됐다는 점 명시.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 05:45:41 +00:00
3114c31841 chore(search): PR #165 회차 1 리뷰 반영
- HOTFIXES test 카운트 표기 정정: `9 신규 / 갱신 unit test` 의 산수
  ambiguity → `9 unit test (8 갱신 + 1 신규) + 2 신규 통합 test = 11
  total` 로 명시.
- SKILL.md (Claude Code integration) 의 search 절에 column scoping +
  heading_path raw-mode escape hatch 안내 한 bullet 추가. 회차 1
  의 follow-up suggestion 반영 — heading 검색 의도 agent 가 새
  escape hatch `'heading_path : <token>'` 를 발견 가능.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 05:44:21 +00:00
271329efbd feat(search): heading_path FTS5 text column filter (default text-only matching)
v0.17.0 trigram tokenizer entry 가 미수정으로 남겨둔
heading_path_json JSON 노이즈 (HOTFIXES 2026-05-24) closure.
trigram 이 chunks_fts.heading_path 컬럼 (V002/V007 트리거가
chunks.heading_path_json 그대로 INSERT) 의 JSON 표기 + 안의 path
세그먼트 (app, src) 까지 3-gram 색인해서 query 가 우연히 false
positive hit 하는 문제. column filter 채택 — heading 색인 유지
(V007 verbatim 불변), 매칭 대상만 text 컬럼 한정.

- build_match_string 가 non-raw 분기에서 combined expression 을
  `text : (<expr>)` 로 wrap. FTS5 column filter syntax 가 OR/AND
  sub-expression 허용.
- Raw mode (`'...'`) 는 그대로 — 사용자가 명시 의도로
  `'heading_path : agent'` 같은 explicit opt-in 가능 (escape hatch).
- 8 기존 build_match_string unit test expected string 갱신 +
  `build_match_string_raw_mode_preserves_heading_filter` 신규.
- `lexical_heading_only_token_does_not_hit_default_mode` 신규 회귀 핀
  (heading-only unique token 이 default mode 에서 0 hit).
- `lexical_raw_mode_can_opt_into_heading_path_filter` 신규 — 같은
  fixture 가 raw mode 로 hit 확인 (escape hatch 동작 핀).

사용자 영향: lexical / hybrid 검색의 본문 precision ↑. recall
변화 없음 (text 본문 token 매칭은 동일). re-ingest 불필요 (FTS
query 시점 매칭만 변경). lexical_snapshot_run_1 + hybrid_snapshot
도 fixture regenerate 불필요 (text 본문 매칭 query 라 BM25 동일).

HOTFIXES: 2026-05-24 v0.17.0 entry 의 `heading_path_json` 노이즈
항목 closure 표기 + 새 2026-05-25 post-v0.17.1 dogfood entry 추가.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 05:40:51 +00:00
f2867540d2 Merge pull request 'feat(ocr): request_timeout_secs config knob' (#164) from feat/ocr-timeout-config into main 2026-05-25 05:14:27 +00:00
e118844256 chore(ocr): PR #164 회차 1 리뷰 반영
- HOTFIXES 헤더 `v0.17.2` (vaporware) → `post-v0.17.1 dogfood`
  로 변경, release tag 결정과 무관하게 정확한 anchor.
- HOTFIXES caller 수 `6 (5+3)` → `9 call site (6+3)` 으로 정정.
- OcrCfg.request_timeout_secs doc 의 edge case 가 LlmCfg sister
  doc 과 동일한 구체 예제 (`u64::MAX`, `86400`) + reqwest 0.12.x
  명시 주석으로 강화.
- LLM + OCR 양쪽의 legacy TOML fixture (78 줄 거의 동일) 를
  module-level `LEGACY_PRE_TIMEOUT_TOML` const 로 추출. 두 test
  가 동일 source 공유 → 옛 schema 가 또 변하면 한 곳만 수정.

reqwest::Duration::ZERO fact-check (회차 1 점 5) 는 회차 2
reply 에서 검증 결과 보고.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 05:13:09 +00:00
41c5edc517 feat(image.ocr): request_timeout_secs config knob + closure of v0.17.1 미진행
v0.17.1 (PR #162) 가 LLM 쪽 hard-coded 300s 를 [models.llm]
request_timeout_secs 로 풀어준 것과 같은 패턴을 OCR 어댑터에 적용.
사용자 결정으로 별 노브 분리 ([image.ocr] request_timeout_secs) —
OCR 는 LLM 대비 cold start 패턴이 달라 독립 조절이 편함.

- OcrCfg.request_timeout_secs: u64 (serde default 300)
- KEBAB_IMAGE_OCR_REQUEST_TIMEOUT_SECS env override
- OllamaVisionOcr::build / from_parts 시그니처에 timeout 인자 추가
- REQUEST_TIMEOUT 상수 제거
- 3 신규 unit test (default / env / legacy parse) — LlmCfg 패턴 그대로
- HOTFIXES 2026-05-25 v0.17.1 entry 의 두 미진행 항목 모두 closure
  (OCR timeout = 본 PR, --stream docs = PR #163 에서 이미 완료)

기존 config / 옛 KB 영향 없음 — 새 필드는 default 로 채워지고
동작도 동일 (300s). vision 모델 cold start 가 길면 env 또는
config 로 늘릴 수 있음.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 05:06:53 +00:00
d02149c010 docs(v0.17.1): HANDOFF + INDEX — v0.17.1 cut sync
- HANDOFF 한 줄 요약 v0.17.0 → v0.17.1 + release URL 추가
  (v0.17.1 cut: PR #162 + #163 한 묶음 안내).
- 머지 후 발견 deviation 절: 2026-05-25 v0.17.1 entry 추가.
- INDEX P10 Dogfooding Feedback section 하단에 'v0.17.1
  post-dogfood polish' subsection 추가 (PR #162, #163 각 한 줄).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 03:35:58 +00:00
0c69b9621b chore: bump version 0.17.0 → 0.17.1
v0.17.1 patch release — v0.17.0 post-dogfood follow-up 두 PR 머지 후.

- PR #162: [models.llm] request_timeout_secs config + 권장 모델 가이드
- PR #163: sudo 없이 ollama 설치 가이드 + kebab ask --stream UX 권장

둘 다 additive only (config field) + docs only — wire breaking 없음,
기존 사용자 영향 없음. patch bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 03:34:12 +00:00
0d69d85757 Merge pull request 'docs: sudo 없이 ollama 설치 + ask --stream 권장 (v0.17.0 post-dogfood)' (#163) from docs/ollama-install-and-stream into main
Reviewed-on: #163
2026-05-25 03:26:24 +00:00
a67300317b docs(ollama): sudo 없이 설치 가이드 + ask --stream 권장 (v0.17.0 post-dogfood)
확장 도그푸딩에서 사용된 두 패턴을 README + SMOKE 에 옮김.

(1) sudo / systemd 없이 격리 디렉토리에 ollama 설치 — tarball 받아
    /opt/ollama/{bin,models,logs} 같은 사용자 디렉토리에 풀고
    OLLAMA_MODELS env 로 모델 위치 분리. 컨테이너 / WSL2 / 회사
    머신 등 root 권한 제약 환경에 유용. 도그푸딩 머신에서
    /build/cache/ollama 로 같은 패턴 검증.

(2) cold start 가 긴 모델 (8B+ 또는 첫 호출) 은 `kebab ask --stream`
    권장 — 동일 inference 시간이라도 progressive 토큰이 5분 timeout
    한도 안에서 빠르게 surface 됨. p9-fb-33 의 streaming 경로를
    UX 개선 권고로 명시.

코드 변경 없음 — docs only. README + SMOKE 두 군데 동일 패턴
sub-bullet + bash snippet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 03:23:35 +00:00
abb05ebc23 Merge pull request 'feat: [models.llm] request_timeout_secs config + 권장 모델 가이드' (#162) from feat/llm-timeout-config into main
Reviewed-on: #162
2026-05-25 03:21:19 +00:00
26fdc4f344 docs(llm-timeout): 0-as-disable 함정 명시 + HOTFIXES typo + 용어 정리
PR #162 워커 리뷰 반영.

- MEDIUM (W2) + LOW (W1): request_timeout_secs = 0 이 reqwest 의
  의미상 disable 이 아닌 instant timeout (모든 요청 즉시 실패).
  LlmCfg field rustdoc + ollama.rs module-level comment + README
  세 군데에 명시 + u64::MAX / 86400 같은 large finite 값 권장.
- NIT (W1): HOTFIXES 2026-05-25 entry 의 '답변이 인 5분' typo →
  '답변이 5분' (1자 삭제).
- NIT (W1): README + HOTFIXES 의 '확장 도그푸딩' 내부 jargon →
  '후속 도그푸딩' 으로 통일.

코드 동작 변경 없음 — doc only. cargo test request_timeout 3 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 03:14:41 +00:00
3f5e0e6e90 feat(llm): [models.llm] request_timeout_secs config + 권장 모델 가이드
v0.17.0 확장 도그푸딩 (2026-05-25) 에서 발견된 두 가지를
한 PR 에 묶음.

(1) llm.generate_stream 의 hard-coded 300s timeout 을 config 노브로
    빼냄. 8B+ 모델 (gemma4:e4b 등) 은 CPU only 환경에서 5분
    안에 첫 RAG 답변 못 마치고 `error: kb-rag: llm.generate_stream`
    으로 떨어지던 문제.

    - kebab-config::LlmCfg 에 request_timeout_secs: u64 additive
      필드 (#[serde(default = "default_llm_request_timeout_secs")]
      default 300). 옛 config 가 키 누락해도 그대로 파싱 + 동일
      동작.
    - env override KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS.
    - kebab-llm-local::ollama.rs 의 REQUEST_TIMEOUT 상수 제거 →
      OllamaLanguageModel::new 가 Duration::from_secs(
      llm.request_timeout_secs) 로 reqwest client 빌드. doc
      comment 도 동일 갱신.
    - 신규 unit test 3 — default 300 핀 / env override / legacy
      config (필드 누락) backward-compat.

(2) docs — README 사전 요구 절 + docs/SMOKE.md ollama 안내에 한 단락:
    CPU only / RAM ≤ 16 GB 환경 ⇒ ≤ 4B Q4 모델 권장
    (gemma3:4b / qwen2.5:3b / phi3:mini). 8B+ 시도 시 timeout
    패턴 사전 안내. request_timeout_secs 노브 사용법.

    HOTFIXES 2026-05-25 entry — 위 두 변경 + 미진행 사항
    (kebab-parse-image OCR 의 같은 hard-coded 300s 는 scope 외
    follow-up 으로 등재 + ask --stream 권장 강조 후속) 기록.

workspace cargo test -j 1 + clippy 통과. 코드 변경은 backwards-compat
(additive serde field) 라 기존 사용자 영향 없음.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 03:01:03 +00:00
578a60e3bb docs(v0.17.0): HANDOFF — version + PR-A/B/C closure entries (R1)
- 한 줄 요약 v0.16.1 → v0.17.0 + release notes URL + PR-A/B/C
  한 줄 요약.
- 머지 후 발견 deviation 절: PR-A 외 PR-B / PR-C 의 2026-05-24
  closure entry 추가.
- '다음 task 후보' 의 P10 round 2 follow-up 라인: 세 항목 모두
  v0.17.0 closure 표시.
- 'P10 dogfooding 백로그' 의 chunk_breakdown + C typedef 두
  항목도  v0.17.0 closure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:55:47 +00:00
64f518e08e docs(v0.17.0): HANDOFF + INDEX — v0.17.0 cut sync (R1)
- HANDOFF 한 줄 요약 v0.16.1 → v0.17.0, release notes URL,
  PR-A/B/C 셋 한 줄 요약. 머지 후 발견 deviation 절에 PR-B / PR-C
  closure entry 추가. "다음 task 후보" + "P10 백로그" 의 세 항목
   v0.17.0 closure 표시.
- INDEX 의 P10 섹션 하단에 신규 "P10 Dogfooding Feedback (v0.17.0)"
  subsection — PR-A/B/C 3 항목 listup (Gemini round 2 권장 형식).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:54:39 +00:00
fa9f91ead4 chore: bump version 0.16.1 → 0.17.0
v0.17.0 release cut — PR-A (한국어 trigram FTS tokenizer + lexical
builder + hint surface) + PR-B (C typedef alias unit + parser_version
cascade + orphan purge) + PR-C (code_lang_chunk_breakdown additive
wire field) 셋 머지 후.

Breaking changes:
- V007 migration (chunks_fts unicode61 → trigram) — chunks 원본 /
  embedding / vector 불변, FTS shadow 자동 backfill. 사용자는 다음
  open 시 V007 즉시 적용 (re-ingest 불필요). kebab.sqlite 파일 크기
  ~2-5배 또는 수백 MB 증가.
- 영어 lexical 검색이 substring 매칭으로 동작 변경 (token →
  tokenization/tokenizer 도 hit, recall ↑ / 단어 경계 ↓).
- C parser_version code-c-v1 → code-c-v2 (typedef alias 추출
  cascade). 같은 file 의 옛 doc/chunks/vector 는 same-workspace_path
  orphan purge 가 자동 정리.

Additive (backwards-compat):
- SearchResponse.hint additive field — 한국어 2자 query 등 trigram
  비호환 시 안내.
- schema.v1.stats.code_lang_chunk_breakdown additive field — chunk
  단위 언어별 분포.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:52:14 +00:00
9ee89c2a94 Merge pull request 'feat: v0.17.0 PR-C — code_lang_chunk_breakdown additive wire field' (#161) from feat/code-lang-chunk-breakdown into main
Reviewed-on: #161
2026-05-24 20:35:28 +00:00
13a3361ba2 docs(v0.17.0/PR-C): rustdoc — code_lang_breakdown / repo_breakdown 가
실제로 doc count 임을 명시 (PR #161 워커 리뷰 MEDIUM 반영)

JSON schema description 은 PR-C 본체에서 'code chunk count' →
'doc count' 로 정정했으나 Rust struct field 의 rustdoc 은 같은
오기재를 그대로 carry — Gemini round 2 가 JSON schema 만 봤고
rustdoc 은 miss. 워커 둘 다 동일 finding (MEDIUM).

implementation 변경 없음 — 의미가 doc count 였던 사실이 처음부터
일관. wording 만 맞춤.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:35:01 +00:00
0def913abd feat(v0.17.0/PR-C): code_lang_chunk_breakdown additive wire field
closure of HOTFIXES 2026-05-22 "code_lang_breakdown chunk granularity"
LOW. Chunk-level companion of the existing doc-count metric.

- crates/kebab-store-sqlite/src/store.rs: code_lang_chunk_breakdown()
  method. chunks INNER JOIN documents → COUNT(c.chunk_id) GROUP BY
  metadata_json.code_lang, NULL skipped. BTreeMap<String, u32>.
  + lib unit test code_lang_chunk_breakdown_counts_chunks_not_docs
  (1 rust doc + 3 chunks → rust=3 chunks vs rust=1 doc).
- crates/kebab-app/src/schema.rs: Stats.code_lang_chunk_breakdown
  additive field + collect_stats builder. tests_stats_ext 의
  stats_includes_code_lang_and_repo_breakdown_fields 가 신규 필드도
  검증.
- docs/wire-schema/v1/schema.schema.json: 신규 additive 필드
  명세 + 기존 code_lang_breakdown / repo_breakdown description
  정정 ("code chunk count" → "doc count", Gemini round 2 권고).
- tasks/HOTFIXES.md: 2026-05-24 PR-C closure entry.

wire additive, schema_version bump 불필요. v0.16.x 호출 호환.
cargo test --workspace --no-fail-fast -j 1 + clippy 통과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:35:01 +00:00
ff9d5f5f86 Merge pull request 'feat: v0.17.0 PR-B — C typedef-wrapped struct/enum/union → typedef alias unit' (#160) from feat/c-typedef-struct-unit into main
Reviewed-on: #160
2026-05-24 20:33:15 +00:00
70a5068c0d docs(v0.17.0/PR-B/B2): HOTFIXES 2026-05-24 closure + p10-1d Risks 갱신
- tasks/HOTFIXES.md: 새 2026-05-24 PR-B closure entry — extractor 의
  type_definition 분기, PARSER_VERSION bump, same-workspace_path
  orphan purge, 사용자 영향, 잔여 nested typedef Risks.
- tasks/HOTFIXES.md: 기존 2026-05-21 typedef 항목의 Status / Next step
  을 v0.17.0 closure 표현으로 갱신 (관찰 기록은 frozen 유지).
- tasks/p10/p10-1d-c-cpp-ast-chunker.md: Risks 의 typedef idiom 라인
  을 closure  + 잔여 nested typedef 안내로 갱신.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:32:36 +00:00
93ddece111 feat(v0.17.0/PR-B/B1): C typedef extractor + parser_version bump + orphan purge cascade
closure of HOTFIXES 2026-05-21. C typedef-wrapped anonymous
struct/enum/union 이 typedef alias 이름으로 symbol unit 방출.

- crates/kebab-parse-code/src/c.rs: type_definition 분기 추가.
  inner anonymous struct_specifier / enum_specifier / union_specifier
  탐지 → declarator field 의 type_identifier 재귀 추출 → synthetic
  unit (typedef alias). named inner aggregate / plain alias 는
  기존대로 glue. PARSER_VERSION code-c-v1 → code-c-v2.
  recover_typedef_alias + extract_typedef_alias_name helper 추가.

- crates/kebab-store-sqlite/src/store.rs: 두 helper 신규
  (parser_version bump cascade 용 doc-id 기반 orphan purge).
  - stale_chunk_ids_for_workspace_path_except_doc_id(workspace_path,
    keep_doc_id) — sister of stale_chunk_ids_at, doc_id 기반.
  - purge_document_at_workspace_path_except_doc_id(workspace_path,
    keep_doc_id) — CASCADE document/chunks 제거, assets 보존.
  keep_doc_id="" 가 "모든 doc 제거" 사용.

- crates/kebab-app/src/lib.rs: try_skip_unchanged 의 parser_mismatch
  분기에서 purge_workspace_path_for_parser_bump 호출. helper 가
  app.vector() 로 lazy 접근 + delete_by_chunk_ids + SQLite document
  row 제거. Ok(None) 반환 전 cleanup 끝나서 caller 의 새 INSERT 시
  idx_docs_workspace_path UNIQUE 충돌 회피.

- tests:
  - c.rs unit tests 4 신규 — typedef_struct_emits_unit /
    typedef_enum_emits_unit / typedef_union_emits_unit /
    typedef_to_existing_type_stays_glue (negative).
  - tier1_c_ingest_searchable: parser_version assertion code-c-v1 →
    code-c-v2.
- 회귀: bytes-edit 경로 (asset_id 변경) 의 기존 purge_orphan_at_workspace_path
  + purge_vector_orphans_for_workspace_path 는 그대로 — 신규 분기와
  공존, 기존 test 모두 PASS.

미해결 (Risks): nested typedef (typedef struct { struct {...} inner; }
Outer;) 의 inner 익명 struct 는 여전히 glue — v2 의 1차 범위는
top-level typedef alias 만.

cargo test --workspace --no-fail-fast -j 1 + clippy 통과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:30:57 +00:00
67559fb3ce Merge pull request 'feat: v0.17.0 한국어 trigram FTS tokenizer + lexical builder + hint surface' (#159) from feat/korean-trigram-tokenizer into main
Reviewed-on: #159
2026-05-24 20:29:00 +00:00
d79e432916 test(v0.17.0/A5): CLI hint surface e2e coverage (worker-1 nit)
PR #159 worker-1 review 의 LOW 가독성 nit 반영 — CLI stderr [hint]
line + --json hint shape 통합 test 가 없었음.

- search_plain_emits_short_query_hint_to_stderr — 빈 KB + 2자 query
  → stderr 가 "[hint]" + "3자 이상" 포함 확인.
- search_json_emits_hint_field_for_short_query — 동일 입력 --json
  → search_response.v1.hint 필드 set + 표준 advisory 문자열 정합.
- search_json_omits_hint_field_when_query_is_long_enough — 3자
  query → hint 필드 absent (additive serializer 의 None 제외 동작).

wire_search_response 5 → 8 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 12:45:11 +00:00
0ee18149e7 test(v0.17.0/A5 follow-up): trigram tokenizer downstream test fixes
trigram tokenizer 가 snippet 단위 + 단어 경계 + BM25 raw score 분포를
모두 바꿔서 unicode61 assumption 기반의 3 test 가 regression.

- wire_search_response::search_json_truncates_with_max_tokens +
  search_plain_emits_truncated_hint_to_stderr: 단일 doc + 작은
  max_tokens 로는 snippet 이 짧아서 budget loop 가 trip 안 함.
  다중 doc fixture (5 doc) + budget 30 token 으로 hit-pop 경로
  통해 truncated=true 보장.
- fetch_integration::fetch_chunk_with_context_returns_neighbors:
  fixture body 의 2-char tokens (A1/A3 등) 가 trigram 비호환으로
  0-hit. apples/banana/cherry/durian/elder 5-char unique words
  로 갱신, query 도 cherry 로 deterministic pin.
- eval/runner::runner_per_query_snapshot_matches_fixture: trigram
  token stream 으로 BM25 raw score 변동. UPDATE_SNAPSHOTS=1 로
  regenerate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 12:21:34 +00:00
8a68289499 docs(v0.17.0/A6): HANDOFF + HOTFIXES + README + SMOKE + SKILL — 한국어 trigram closure
- HOTFIXES: 새 2026-05-24 절 — v0.17.0 closure 영향 (한국어
  lexical 3-gram, 영어 substring 변경, BM25 분포, 디스크 용량,
  heading_path JSON 노이즈 관찰). 기존 2026-05-22 한국어 lexical
  항목의 Status / Next step 을 closure 표현으로 갱신.
- HANDOFF: 머지 후 발견 deviation 절에 2026-05-24 entry +
  기존 2026-05-22 항목을 closure cross-link 로 정리. P10
  백로그 한국어 tokenizer 항목  v0.17.0 + "다음 task 후보"
  follow-up 라인의 상태 갱신.
- README: 검색 명령 행에 trigram 동작 + hint + 디스크 용량 한 줄.
- SMOKE: 새 "한국어 trigram 검색 (v0.17.0)" 절 — 도그푸딩 query
  시퀀스 (충돌은 raw / 해시 충돌 multi-token / Rust 충돌은
  mixed / 충돌 2자 + stderr / --json hint 검증) + 영어 substring
  동작 변경 안내.
- SKILL.md: search 절에 hint 필드 안내 한 줄 — agent 가
  short query 케이스에서 같은 query 재시도 대신 사용자에게
  surface 하도록.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 11:54:44 +00:00
6ac7fea7b9 feat(v0.17.0/A5): trigram-aware build_match_string + SearchResponse.hint
PR-A 본체. plan Task A4 Step 1c + A5.

- lexical.rs::build_match_string 재설계: whole-phrase + token-AND
  OR-combined, 3자 미만 토큰 drop, 후보 없음 시 None (빈 MATCH
  회피). raw single-quote mode 유지.
- SearchResponse.hint additive — empty result + trimmed < 3 chars
  + non-raw 케이스에 short_query_hint helper 가 set.
- CLI 'kebab search' 가 [hint] stderr 한 줄 (text mode).
- TUI SearchState.short_query_hint + poll_worker stale-aware set
  + fire_search/mark_input_changed reset + dynamic_status 표시.
- docs/wire-schema/v1/search_response.schema.json hint additive.
- 신규 unit tests (lexical 9 PASS, 기존 2 expectation 갱신) +
  통합 회귀 (search_korean: multi_token + mixed, 3 PASS) +
  BM25 snapshot regen (trigram token stream).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 11:54:25 +00:00
fe123c0c6d test(A4): korean + english trigram matching at FTS level
3개 신규 unit tests in tests/fts.rs §7:

1. fts_trigram_korean_3char_substring_hits — Codex sqlite 3.45.1 검증
   동작 5개 assert pin: raw 3자 substring hit (충돌은/발생한),
   quoted phrase hit (\"해시 충돌\"/\"시 충\"), raw 해시충 0-hit (원문
   미존재).
2. fts_trigram_korean_short_query_zero_hit_pinned — 2자 한국어 query
   (충돌·키) 0-hit 회귀 감지. trigram 구조 변경 시 먼저 fail.
3. fts_trigram_english_substring_hits — substring recall 동작 변경
   pin (token→tokenizer, to 0-hit).

검증: cargo test -p kebab-store-sqlite --test fts → 13/13 PASS
(신규 3 + 기존 10).

Step 1c (multi-token 한국어 query e.g. \"해시 충돌\") 와 Step 5
(lexical BM25 snapshot 갱신) 는 Task A5 의 build_match_string()
재설계 후 진행.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:57:37 +00:00
753b1ff5e5 task(A4-step0): synthetic korean fixture for trigram tests
도그푸딩 실 한국어 위키 문서 (hash-table.md, 4512줄 mediawiki HTML,
CC-BY-SA) 는 크기·라이선스 부담으로 직접 commit 회피. 대신 도그푸딩
query 들 (해시 충돌·충돌은·시 충·해시충·충돌) 을 모두 cover 하는 합성
fixture 작성. trigram tokenizer 의 정확한 매칭 동작 (3자 substring
hit, 2자 0-hit, raw vs quoted phrase) 검증용.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:54:30 +00:00
8dcedc4b11 feat(p10-r2): V007 trigram migration + design §5.5 + fts diff-check
Task A2 + A3 한 묶음.

migrations/V007__fts_trigram.sql 신규:
- chunks_fts shadow 를 DROP + 재생성 (tokenize = trigram).
- chunks_ai/ad/au trigger 재생성 (V002 와 동일).
- chunks 에서 backfill INSERT — 사용자 re-ingest 불필요, V007 자동.
- V002 는 historical cold-upgrade replay 위해 그대로 유지.

design §5.5 갱신:
- verbatim block 의 tokenize 만 trigram 으로 교체.
- §5.5 본문 상단에 한국어 채택 사유 + trade-off (영어 lexical 변경,
  BM25 분포, 디스크 ~2-10x, contentless 아님) prose 한 단락 추가.

crates/kebab-store-sqlite/tests/fts.rs:
- fts_v002_matches_design_section_5_5_verbatim →
  fts_v007_matches_design_section_5_5_verbatim 으로 rename.
- extract_migration_5_5_verbatim_block() 의 include_str! path 를
  V007__fts_trigram.sql 로 변경. 주석/assertion msg V007 로.
- V002 cold-upgrade test 들 (fts_v002_backfill_*) 은 그대로 유지.

검증: cargo test -p kebab-store-sqlite --test fts → 10/10 PASS
(`fts_v007_matches_design_section_5_5_verbatim` 포함).

Codex round 1/2 의 design §5.5 contentless 정정·trigram tokenizer
채택 사유 명시 발견 반영.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:52:40 +00:00
8781c6112b task(A1): builder baseline + sqlite version + snapshot locations
Task A1 step 1-3 완료. plan A5 의 baseline 노트 슬롯 채움.

핵심 발견:
- build_match_string() (lexical.rs:177-200): trim → strip_single_quotes
  raw FTS verbatim / 그 외 whitespace split + escape_fts5_token (\"...\"
  + inner doubling) + space join (implicit AND).
- raw mode = single quote '...' 가 trimmed 전체 감쌈 (lexical.rs:167).
- SQLite: rusqlite 0.32 + libsqlite3-sys 0.30.1 bundled (in-tree, SQLite
  ~3.46.x) → trigram 사용 가능.
- Snapshot: tests/lexical.rs::lexical_snapshot_run_1 + tests/hybrid.rs::
  hybrid_snapshot_run_1 (KEBAB_UPDATE_SNAPSHOTS=1 로 regenerate).
  inline normalize_bm25_top_score 는 numerical 무관.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:47:24 +00:00
14197b5e02 docs(p10-round-2): HANDOFF + HOTFIXES sync for v0.17.0 follow-up
P10 도그푸딩 round 2 의 follow-up 후보를 HANDOFF "다음 task" /
"P10 백로그" 절에 반영. HOTFIXES 의 round 2 항목 (한국어 lexical
한계 + code_lang_breakdown + ranking deferred) 정합.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:43:31 +00:00
584247f1ea spec+plan(v0.17.0): korean trigram tokenizer + dogfood fixes
P10 도그푸딩 round 2 (2026-05-22) follow-up. SQLite FTS5 tokenizer
unicode61 → trigram 으로 교체해 한국어 lexical 검색 지원 + 작은
버그픽스 2 (C typedef-wrapped struct 미노출, code_lang_breakdown
집계 단위).

Codex + Gemini round 1/2/3 리뷰 반영:
- [r1] 2자 한국어 query 0-hit, build_match_string() multi-token 깨짐,
  contentless → shadow, parser_version cascade, BM25/heading_path/디스크
- [r2] same-workspace_path orphan purge (parser bump cascade 실제 동작),
  trigram 테스트 예시 sqlite 3.45.1 검증, builder 권장안 (whole phrase OR)
- [r3] SMOKE 시나리오 정정, TUI stale hint 방지, search_response.v1 hint
  필드, new purge helpers, single quote raw mode 통일, fixture 도입

PR 구성: PR-A (trigram + builder + 안내), PR-B (C typedef + orphan
purge), PR-C (stats + wire). 셋 머지 후 v0.17.0 release cut.

design: docs/superpowers/specs/2026-05-22-korean-trigram-tokenizer-design.md
plan:   docs/superpowers/plans/2026-05-22-korean-trigram-tokenizer.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:43:31 +00:00
a0c0dca321 fix(dogfood): k8s multi-resource YAML chunk_id collision (#158) 2026-05-21 23:57:49 +00:00
667495ae6a docs(dogfood): HOTFIXES entry for k8s multi-resource chunk_id collision
PR #158 code-reviewer recommendation. Records the dogfood-discovered
k8s multi-resource chunk_id collision + the deliberate decision NOT to
bump chunker_version (dogfood-only stage, single-resource k8s chunk_id
shift is benign churn). Cross-link added to p10-2 spec Risks/notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:57:34 +00:00
08d72a12e0 chore: bump version 0.16.0 → 0.16.1 (k8s multi-resource chunk_id fix)
Patch bump — bug fix only (P10 dogfood-discovered k8s multi-resource
chunk_id collision). New binary needed to resume dogfooding. No wire
schema change, no DB migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:54:33 +00:00
1969c8e3b5 fix(dogfood): k8s multi-resource YAML chunk_id collision
P10 dogfooding found that a k8s manifest with 2+ documents (e.g.
Deployment + Service in one file) fails to ingest:
  UNIQUE constraint failed: chunks.chunk_id

Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch
hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per
resource; with split_key None every resource from the same document gets
the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's
code_text_paragraph_v1 had the same bug (fixed in df3c5b8) but it calls
build_chunk_no_symbol directly — the push_chunks_with_oversize path was
never fixed.

Fix: push_chunks_with_oversize gains a base_split_key parameter for the
non-oversize single-chunk case. k8s chunker passes Some(resource.line_start)
so each resource gets a distinct chunk_id; dockerfile / manifest pass None
(1 chunk per file — no sibling collision, chunk_id stays stable).

Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts
chunk_id distinctness; new integration test
tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real
2-document YAML end-to-end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 23:49:37 +00:00
c6207d196e chore(p10-1d-followup): reviewer nit cleanup — C extractor tests + HOTFIXES + cpp snapshot (#157) 2026-05-21 22:47:38 +00:00
840c6c40a6 test(p10-1d-followup): cpp snapshot exercises actual CppAstExtractor
Reviewer nit #3: the hand-built fixed_doc() only verified chunker 1:1
mapping. New tests invoke CppAstExtractor against tests/fixtures/sample.cpp
and snapshot the real extractor → chunker pipeline (14 blocks emitted
covering namespace::chunk::Class, ctor/dtor/operator/template/free-fn
convention, glue <top-level> blocks between units).

Adds kebab-parse-code as a dev-dep of kebab-chunk (same precedent as
kebab-parse-md). Both the existing hand-built test AND the new
extractor-driven tests are kept — the former for fast chunker-only
validation, the latter for end-to-end regression detection.

Added tests:
- code_cpp_ast_extractor_snapshot: asserts all 8 named symbol units are present
- code_cpp_ast_extractor_chunks_deterministic: chunker output is stable

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:43:57 +00:00
b81574afa9 docs(p10-1d-followup): HOTFIXES entry — typedef-wrapped struct/enum in C falls into glue
PR #156 reviewer nit #2. Documents the tension between spec body
("struct_specifier (named, top-level) → 1 unit") and the actual behavior
for the C idiom `typedef struct { ... } Foo;` — the inner struct_specifier
is anonymous, so the extractor falls into glue. Workaround: dogfood-driven
revisit if frequent pain point emerges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:40:04 +00:00
6beff35a2f test(p10-1d-followup): add in-file unit tests to C AST extractor
Mirrors the cpp.rs 15-test pattern. Covers function_definition (incl.
pointer-return, static/extern/inline), struct_specifier / enum_specifier /
union_specifier (named), anonymous struct/enum/union → glue, typedef-wrapped
struct → glue (per spec risks note), preprocessor directives → glue, empty
file → <module> post-pass, preprocessor-only → <module>, mixed fn + glue →
<top-level> present, determinism (20 runs). 17 tests total.

Reviewer nit #1 (PR #156 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 22:39:36 +00:00
75a4207aa1 feat(p10-1d): C + C++ AST chunkers — P10 Tier 1 chunker family complete (#156) 2026-05-21 15:48:34 +00:00
86aa180ad7 chore: bump version 0.15.0 → 0.16.0 (p10-1d C + C++ AST chunkers)
Minor bump — additive new chunker_versions code-c-ast-v1 + code-cpp-ast-v1
+ new routing langs c / cpp + new tree-sitter-c / tree-sitter-cpp workspace
deps. P10 Tier 1 chunker family complete. No DB migration, no wire schema
major bump.

Also lands the missing p10-3 try_skip_unchanged fallback-aware fix (Option
B1 — 7th param) that PR #155 was supposed to ship but never made it to main
(implementer reported commit SHA 2a39513 that didn't exist in the merged
branch). Same commit extends tier3_fallback_cv to include c/cpp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 15:38:00 +00:00
802c573c07 docs(p10-1d): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync
P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java +
Kotlin + C + C++).

- README adds C/C++ to the ingest row + --code-lang c/cpp + Mermaid brace.
- HANDOFF flips p10-1D to  (v0.16.0), updates 한 줄 요약 + 다음 후보.
- ARCHITECTURE adds C/C++ to the code-parser row, extends flowchart pcode
  node, adds chunker tree entries.
- SMOKE adds P10-1D walkthrough section + verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-1D to .

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:35:59 +00:00
438870ee25 docs(p10-1d): activate C + C++ in frozen design §10
P10 Tier 1 chunker family complete (Rust + Python + TS + JS + Go + Java +
Kotlin + C + C++).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:32:26 +00:00
192835e5bf test(p10-1d): integration smoke tests for C + C++
Verifies end-to-end ingest + search + Citation::Code shape:
- tier1_c_ingest_searchable: .c file → --code-lang c search → symbol
  = function name (no nesting), lang = "c", chunker_version = "code-c-ast-v1".
- tier1_cpp_ingest_searchable: .cpp file → --code-lang cpp search →
  symbol starts with namespace::Class prefix, lang = "cpp",
  chunker_version = "code-cpp-ast-v1".

Brings code_ingest_smoke to 18 tests (Tier 1: 9 → 11, Tier 2: 3,
Tier 3: 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:31:35 +00:00
1034de25a2 fix(p10-3+p10-1d): land the missing try_skip_unchanged fallback-aware fix
PR #155 (p10-3) merged WITHOUT the reviewer's required Option B1 fix —
the implementer reported a commit SHA (2a39513) that never made it to main.
Result: every reingest of a Tier 3-fallback file (non-k8s YAML, invalid
YAML, AST extractor failure) re-runs full extract + chunk + embed because
the parser/chunker version comparison can never match (stored is
code-text-paragraph-v1 / none-v1, but caller uses Tier 1/2 dispatch
values).

This commit:
1. Adds the 7th param `fallback_chunker_version: Option<&ChunkerVersion>`
   to try_skip_unchanged + the stored_is_tier3_fallback detection branch
   (skip parser/chunker equality, keep embedder check).
2. Threads `None` through non-code call sites (md / image / pdf).
3. Code call site computes tier3_fallback_cv covering all Tier 1/2 langs
   that can fall back: rust / python / ts / js / go / java / kotlin /
   yaml / dockerfile / toml / json / xml / groovy / go-mod / c / cpp
   (p10-1D additions).
4. Adds tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged
   regression tests (the originally-promised PR #155 regression coverage
   that also never made it to main).

Smoke tests: 14 + 2 = 16 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:19:17 +00:00
d1560be80d feat(p10-1d): activate C + C++ in ingest_one_code_asset dispatch
Extends 4-arm match (parser_version / chunker_version / extract / chunks)
+ allowlist + tier3_fallback_cv with "c" + "cpp" arms. C uses CAstExtractor
+ CodeCAstV1Chunker; C++ uses CppAstExtractor + CodeCppAstV1Chunker. Both
langs are Tier 3-fallback-eligible (e.g. .h file with C++ syntax may fail
tree-sitter-c parse → Tier 3 paragraph fallback per p10-3 wrapper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:56:45 +00:00
b2a2902e38 feat(p10-1d): code-cpp-ast-v1 chunker + snapshot test
Identical chunker body to code-c-ast-v1 (per-language work happens in the
CppAstExtractor, Task C). Snapshot fixture covers nested namespace + class
+ ctor/dtor + method + operator overload + template fn + free fn + top-level
main, verifying namespace::Class::method symbol convention per design §3.4.

5 chunks emitted:
- <top-level> (includes, namespace opening)
- kebab::chunk::MdHeadingV1Chunker (class unit)
- kebab::identity (template function)
- kebab::global_helper (free function in namespace)
- main (top-level main function)

Template function symbols emit without <T> parameters per spec convention.
Namespace::Class::method pattern verified. All tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:46:12 +00:00
03cd41c48f feat(p10-1d): code-c-ast-v1 chunker + snapshot test
Mirrors code-go-ast-v1's chunker pattern. Snapshot test against
tests/fixtures/sample.c (function + typedef struct + typedef enum +
preprocessor) verifies symbol list + lang=c stamping.

Chunks produced (4 total):
- <top-level> glue: includes, defines, static vars, typedefs (lines 1-18)
- parse_record function (lines 20-23)
- print_record function (lines 25-27)
- main function (lines 29-33)

All chunks stamped with lang=c and chunker_version=code-c-ast-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:41:19 +00:00
926042049c feat(p10-1d): C++ AST extractor (tree-sitter-cpp)
Symbol = namespace::Class::method via recursive build_blocks. namespace_definition
pushes namespace name (anonymous → <anonymous>). nested_namespace_specifier
(outer::inner) flattens all segments and pushes them. class_specifier / struct_specifier
(named) emit class unit + recurse with class name pushed. function_definition emits
method unit; symbol resolution unpacks declarator chain (pointer_declarator /
reference_declarator → function_declarator → identifier / field_identifier /
qualified_identifier / operator_name / destructor_name).

operator_cast (conversion operators, e.g. operator bool) handled as a direct
declarator kind on function_definition. template_declaration recurses with same
prefix (template params NOT in symbol). enum_specifier + concept_definition emit
type-level units. linkage_specification (extern "C") recurses into body with same
prefix. Other top-level nodes → <top-level> glue.

All 15 unit tests pass; build and clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:37:58 +00:00
e0a29225da feat(p10-1d): C AST extractor (tree-sitter-c)
Top-level units: function_definition (symbol = fn name from declarator's
innermost identifier), struct_specifier, enum_specifier, union_specifier
(each emits 1 unit with the named identifier as symbol). Preprocessor
directives + top-level declarations group into a <top-level> glue chunk.
Empty file or zero units → <module> post-pass.

C symbol = function name only — no namespace, no class nesting (design §3.4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:29:36 +00:00
b541567946 build(p10-1d): add tree-sitter-c + tree-sitter-cpp workspace deps
Standard crate names resolved cleanly: tree-sitter-c v0.24.2 and
tree-sitter-cpp v0.23.4 are both compatible with workspace tree-sitter 0.26.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:19:00 +00:00
a58d400abd docs(p10-1d): implementation plan (11 tasks A-K, subagent-driven)
Tasks: workspace deps / C extractor / C++ extractor / C chunker + snapshot /
C++ chunker + snapshot / ingest dispatch + tier3_fallback_cv extension /
2 smoke tests / frozen design §10 / docs sync / workspace test gate /
version bump 0.15.0 → 0.16.0 + gitea PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:15:22 +00:00
8add684ffc docs(p10-1d): task spec for C + C++ AST chunkers
Frozen contract: single PR with code-c-ast-v1 + code-cpp-ast-v1. C symbol
= function name only (no nesting). C++ symbol = namespace::Class::method
(recursion). .h → C (design §3.5); C++ headers' parse failure picked up
by p10-3 Tier 3 fallback. tree-sitter-c + tree-sitter-cpp workspace deps,
version bump 0.15.0 → 0.16.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:12:11 +00:00
7a90df1485 feat(p10-3): Tier 3 paragraph + line-window fallback chunker — shell direct + Tier 1/2 0-chunk/Err 자동 picked up (#155) 2026-05-21 12:27:18 +00:00
46f408dc0f chore: bump version 0.14.0 → 0.15.0 (p10-3 Tier 3 paragraph fallback)
Minor bump — additive new chunker_version "code-text-paragraph-v1" + new
routing lang "shell" + new Tier 1/2 → Tier 3 fallback wrapper behavior.
No DB migration, no wire schema major bump (Citation::Code.lang values
remain a free string field).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 12:05:53 +00:00
49e60fb314 docs(p10-3): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync
- README adds Tier 3 to the ingest row (shell + fallback) and the Mermaid
  chunker enumeration; --code-lang shell admitted.
- HANDOFF flips p10-3 to  (v0.15.0) and updates the 한 줄 요약 + next
  candidates.
- ARCHITECTURE adds Tier 3 to the code-parser row, extends the flowchart
  pcode node, and lists code_text_paragraph_v1.rs in the chunker tree.
- SMOKE adds a P10-3 walkthrough (shell + non-k8s YAML fallback) and a
  verification checklist entry.
- tasks/INDEX + tasks/p10/INDEX flip p10-3 to .

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:43:38 +00:00
6bc7a83d3c docs(p10-3): activate Tier 3 in frozen design §10.1
Add p10-3 activation log entry for Tier 3 paragraph fallback chunker
(code-text-paragraph-v1) with shell direct routing and fallback wrapper
for invalid YAML / AST failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:39:49 +00:00
df3c5b8caf test(p10-3): integration smoke tests for Tier 3 (shell + yaml fallback)
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
  Citation::Code { symbol: None, lang: "shell" }, chunker_version
  "code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
  (no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
  retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
  chunker_version "code-text-paragraph-v1".

Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs
(≤80 lines) were emitted with split_key=None, causing all paragraphs from the
same document to share the same chunk_id (UNIQUE constraint violation at
put_chunks). Fix: always use para.line_start as split_key so every paragraph
gets a distinct id regardless of size.

Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:37:44 +00:00
5051ea7534 feat(p10-3): Tier 1/2 → Tier 3 fallback wrapper in ingest_one_code_asset
After the chunks match resolves, an Ok(empty) result (Tier 2 invalid YAML
/ non-k8s YAML / similar) or Err (Tier 1 extractor / chunker failure) is
retried against CodeTextParagraphV1Chunker. On retry, chunker_version is
swapped to "code-text-paragraph-v1" and canonical.parser_version to
"none-v1" so downstream stamping + try_skip_unchanged remain consistent.

Extract failure is handled similarly — when a Tier 1 extractor errors
(e.g. tree-sitter parse failure), a synthesize_tier2_document-shaped
fallback doc is built from raw bytes and routed through Tier 3 chunker
directly (extract_fell_back guard).

shell direct path + Tier 2 extract synthesize_tier2_document failures
are exempted from the fallback chain (they ARE Tier 3 already, or the
error is real).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:32:49 +00:00
88d7fbc182 feat(p10-3): activate shell direct routing through Tier 3 chunker
Extends ingest_one_code_asset's allowlist + 4-arm match (parser_version /
chunker_version / extract / chunks) to admit code_lang "shell" and route it
to CodeTextParagraphV1Chunker. parser_version "none-v1" + synthesize_tier2_document
reused.

Tier 1/2 fallback wrapper lands in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:28:41 +00:00
0b7d8af759 feat(p10-3): code-text-paragraph-v1 chunker — paragraph + line-window fallback
Blank-line paragraph segmentation (whitespace-only lines as boundaries,
blank lines themselves never in any chunk's range). Paragraphs > 80 lines
split into 80-line windows with 20-line overlap (stride 60), sharing the
input lang and symbol=None per spec §9.3. tier2_shared exposes a new
build_chunk_no_symbol helper so Chunk id/hash/token semantics stay
identical with Tier 1/2. Extracts build_chunk_from_span as private core
so build_chunk and build_chunk_no_symbol share mechanics without drift.

4 unit tests cover multi-paragraph shell (4 paragraphs, blank-line
boundaries verified), 200-line oversize line-window split (chunks
1-80 / 61-140 / 121-200), empty file, and lang preservation when
input is yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:22:48 +00:00
9342b9543f refactor(p10-3): expose tier2_shared::build_chunk as pub(crate)
Tier 3 chunker (next task) needs to call the same Chunk-construction helper
to keep id / hash / token-count / policy_hash semantics identical with
Tier 2. Visibility-only change; signature and body unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:17:51 +00:00
a8aa03042f docs(p10-3): implementation plan (9 tasks A-I, subagent-driven)
Tasks: tier2_shared visibility upgrade / Tier 3 chunker + 4 unit tests /
shell direct routing / Tier 1/2 fallback wrapper / 2 smoke tests / frozen
design §10.1+§10 / docs sync (6 files) / workspace test gate / version
bump 0.14.0→0.15.0 + gitea PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:16:55 +00:00
9d4a60aac5 docs(p10-3): task spec for Tier 3 paragraph + line-window fallback chunker
Frozen contract for p10-3 single PR: code-text-paragraph-v1 chunker
(blank-line paragraph split + 80-line/20-overlap line-window for oversize),
shell direct routing, Tier 1/2 fallback wrapper (0-chunk or Err → Tier 3
retry with chunker_version + parser_version swap), tier2_shared::build_chunk
pub(crate) exposure, frozen design §10.1 + §10 deltas, version bump
0.14.0 → 0.15.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:55:16 +00:00
8ce7a911ee chore(p10-2-followup): reviewer nit cleanup — Mermaid + 주석 + oversize test (#154) 2026-05-20 14:44:39 +00:00
75c1c7b911 test(p10-2-followup): cover tier2_shared oversize fallback with >200-line k8s ConfigMap
Spec p10-2 risks section calls out "거대 ConfigMap" but no test exercised
the line-window split branch of tier2_shared::push_chunks_with_oversize.
This adds a 256-line ConfigMap fixture (generated inline) and asserts:
- ≥2 chunks emitted (split happened),
- all chunks share symbol `ConfigMap/prod/big`,
- chunk_ids all distinct (id_for_chunk's #L{k} suffix disambiguation),
- line ranges form a contiguous partition (prev.line_end + 1 == next.line_start).

Reviewer nit #1 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:41:16 +00:00
b5c12ecb6f docs(p10-2-followup): clarify synthesize_tier2_document path resolution comment
Earlier comment claimed the function "mirrors RustAstExtractor pattern" but
the two differ: RustAstExtractor joins ctx.workspace_root to handle relative
paths, while Tier 2 trusts FsSourceConnector's absolute-path invariant.
Rephrase to document the actual rationale + the Kb URI fallback.

Reviewer nit #3 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:39:02 +00:00
a1192ce3b2 docs(p10-2-followup): README Mermaid chunker_version list — Java/Kotlin + Tier 2
p10-1C-JK 이후 누락된 code-java-ast-v1 / code-kotlin-ast-v1 + p10-2 의
k8s-manifest-resource-v1 / dockerfile-file-v1 / manifest-file-v1 추가.
표기 단순화를 위해 code-* 는 brace 묶음.

Reviewer nit #2 (PR #153 code-reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:35:20 +00:00
17ee400fd5 feat(p10-2): Tier 2 resource-aware chunkers (k8s + Dockerfile + manifest) — 코드 색인 외 리소스 파일 활성화 (#153) 2026-05-20 14:22:55 +00:00
217dddb4ba chore: bump version 0.13.0 → 0.14.0 (p10-2 Tier 2 resource-aware)
Minor bump — additive code_lang values (xml / groovy / go-mod) + 3 new
chunker_version labels (k8s-manifest-resource-v1 / dockerfile-file-v1 /
manifest-file-v1) + frozen design §3.5 / §10.1 deltas. No DB migration,
no wire schema major bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:14:38 +00:00
308666dbd5 docs(p10-2): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX sync + tasks/p10/INDEX
User-visible surface sync per the docs-split rule:
- README adds Tier 2 langs (yaml / dockerfile / toml / json / xml / groovy / go-mod) to the ingest支援 list and --code-lang options.
- HANDOFF flips p10-2 phase row to  (v0.14.0) and updates the next-task candidates.
- ARCHITECTURE extends crates/kebab-chunk/src/ tree with k8s_manifest_resource_v1.rs / dockerfile_file_v1.rs / manifest_file_v1.rs / tier2_shared.rs, plus a Tier 2 note on the code-parser row and flowchart node.
- SMOKE adds a Tier 2 smoke walkthrough (k8s yaml + Dockerfile + Cargo.toml ingest + --code-lang search) and a P10-2 entry in the verification checklist.
- tasks/INDEX + tasks/p10/INDEX flip p10-2 to  (v0.14.0).

Workspace test gate (-j 1) + clippy --workspace pass cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:10:13 +00:00
522ae7b8bc docs(p10-2): activate Tier 2 in code-ingest design §10.1 + §3.5 mappings
§3.5: add code_lang_for_path mappings xml / groovy / go-mod.
§10.1: add deactivation log entry for p10-2 (3 Tier 2 chunkers active).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:24:16 +00:00
166e1ddfaf test(p10-2): integration smoke tests for Tier 2 (k8s yaml + Dockerfile + Cargo.toml)
Three new tests in code_ingest_smoke.rs verifying isolated-TempDir ingest +
--code-lang filter + Citation::Code.lang / .symbol shape for each Tier 2
chunker. Brings the suite to 12 tests (Rust 3 + Python 1 + TS 1 + JS 1 +
Go 1 + Java 1 + Kotlin 1 + yaml 1 + dockerfile 1 + manifest 1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:23:01 +00:00
226ce8b744 feat(p10-2): activate Tier 2 chunkers in ingest_one_code_asset dispatch
Adds yaml / dockerfile / toml / json / xml / groovy / go-mod arms to the
existing 7-arm AST match. parser_version unified to "none-v1" for Tier 2.
synthesize_tier2_document builds a minimal Document (single Block::Code
with raw file text) since Tier 2 has no parse step. allowlist in
ingest_one_asset extended to admit Tier 2 langs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:19:54 +00:00
22d4161728 feat(p10-2): manifest-file-v1 chunker (whole-file 1 chunk, symbol <manifest>)
Emits 1 Chunk per manifest file (Cargo.toml / pyproject.toml / package.json /
tsconfig.json / pom.xml / build.gradle / go.mod). Symbol unified to
"<manifest>"; manifest type distinguished by code_lang (toml / json / xml /
groovy / go-mod) read from Block::Code.lang. Oversize >200 lines splits via
tier2_shared::push_chunks_with_oversize.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:11:46 +00:00
51004ac593 feat(p10-2): dockerfile-file-v1 chunker (whole-file 1 chunk, symbol <dockerfile>)
Reads entire Dockerfile / Dockerfile.* / *.dockerfile content and emits a
single Chunk with symbol "<dockerfile>", code_lang "dockerfile", line range
1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via
tier2_shared::push_chunks_with_oversize.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:09:13 +00:00
8996e73282 feat(p10-2): k8s-manifest-resource-v1 chunker + tier2_shared helper
Splits multi-document YAML by ^---\s*$, requires apiVersion + kind string
fields per document, emits 1 chunk per recognized k8s resource. Symbol =
<kind>/<namespace>/<name> or <kind>/<name> (cluster-scoped). Invalid YAML
returns 0 chunks (handled by p10-3 paragraph fallback). Oversize >200 lines
splits into line-windows sharing the same symbol.

tier2_shared module hosts the oversize fallback + Chunk-construction helper
mirroring code_rust_ast_v1's Chunk shape. Task E (dockerfile) and Task F
(manifest) will reuse it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:06:47 +00:00
22dba09857 refactor(p10-2): media.rs delegates code lang to code_lang_for_path
Replaces 1A-1 era inline match block with a single call to
kebab_parse_code::code_lang_for_path, per design §3.5 single-source-of-truth
rule. Adds Tier 2 routing test (yaml / dockerfile / toml / json / xml /
groovy / go-mod) and preserves all non-code extension branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:01:14 +00:00
aaa90b1754 feat(p10-2): extend code_lang_for_path with Tier 2 basenames + extensions
Adds basename-first matching for Dockerfile / Cargo.toml / pyproject.toml /
package.json / tsconfig.json / go.mod / pom.xml / build.gradle plus
Dockerfile.* prefix variant. Extension fallback adds .yaml/.yml/.dockerfile/
.toml/.json/.xml/.gradle → yaml/dockerfile/toml/json/xml/groovy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:59:11 +00:00
077f92f41e build(p10-2): add serde_yaml dep to kebab-chunk for k8s-manifest-resource-v1
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:57:06 +00:00
5ce7f60932 docs(p10-2): implementation plan (11 tasks A-K, subagent-driven)
Branch feat/p10-2-tier2-resource. Tasks: serde_yaml dep / lang.rs basenames /
media.rs source-of-truth consolidation / 3 chunkers (k8s + dockerfile +
manifest) + tier2_shared helper / ingest dispatch / smoke tests / frozen
design §3.5+§10.1 / docs sync / version bump 0.13.0→0.14.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:55:36 +00:00
47857b2622 docs(p10-2): task spec for Tier 2 resource-aware chunkers (k8s + Dockerfile + manifest)
Frozen contract for the p10-2 single PR: 3 chunker activation, k8s
identification via apiVersion+kind, Dockerfile/manifest basename matching,
code_lang_for_path source-of-truth consolidation, frozen design §3.5 +
§10.1 deltas, and version bump 0.13.0 → 0.14.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:43:34 +00:00
1e4cff879b Merge pull request 'feat(p10-1C-JK): Java + Kotlin AST chunkers — JVM family 코드 색인 활성화' (#152) from feat/p10-1c-jk into main 2026-05-20 11:57:39 +00:00
2d7a566624 docs(p10-1c-jk): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX + design §10.1; chore: bump version 0.12.0 → 0.13.0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 11:38:40 +00:00
813bdd1a16 test(p10-1c-jk): code-java-ast-v1 + code-kotlin-ast-v1 chunker snapshots
Mirrors code_go_ast_snapshot pattern. In-memory CanonicalDocument (no
kebab-parse-code dep — boundary §6.3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:57:37 +00:00
ff1bedbef5 feat(p10-1c-jk): activate Kotlin in ingest_one_code_asset dispatch
Replaces Kotlin bail! arms with KotlinAstExtractor + CodeKotlinAstV1Chunker.
Adds kotlin_file_ingests_and_searches_as_code_citation integration test —
asserts citation.lang=kotlin, symbol=com.foo.Foo.bar, code_lang=kotlin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:54:55 +00:00
30e03c7a12 feat(p10-1c-jk): code-kotlin-ast-v1 chunker (1:1 + oversize split)
Duplicate of code-java-ast-v1 with language-agnostic body unchanged. Cross-
chunker policy_hash identity asserted vs md-heading-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:52:24 +00:00
2ce6ae47c5 feat(p10-1c-jk): tree-sitter-kotlin-ng AST extractor (KotlinAstExtractor)
Uses tree-sitter-kotlin-ng (bare tree-sitter-kotlin is stuck on tree-sitter
0.21-0.23, incompatible with our 0.26). Mirrors JavaAstExtractor (JVM family,
source-side package extraction + class-nesting) with Kotlin grammar quirks:

- Root is `source_file`, not `program`.
- `package_header` child is `qualified_identifier` (its slice text is the
  dotted path); the bare `identifier` shape is also accepted as a fallback.
- `class_declaration` is the single node kind for `class` / `data class` /
  `sealed class` / `interface` / `enum class` — distinguished only by its
  `modifiers` child. Body is `class_body` for non-enum, `enum_class_body`
  for enum class; neither carries a `body` field name, so the extractor
  looks the body up by node kind rather than `child_by_field_name("body")`.
- `companion_object` is its own node kind (NOT object_declaration with a
  modifier); its `name` field is optional, so the extractor fills in the
  implicit Kotlin convention name `Companion`.
- `function_declaration` is allowed at top level (unlike Java), emitted as
  `<pkg>.<fn_name>`; the same node kind nested in `class_body` becomes
  `<pkg>.<...>.<Class>.<method>` via the same mod_path mechanism.
- `secondary_constructor` has no `name` field; symbol uses the enclosing
  class name (Java duplication convention: `<pkg>.<...>.<Class>.<Class>`).
- Enum bodies (`enum_class_body`) are NOT recursed — `enum_entry` is not
  emitted as a unit (matches Java 1차 scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:49:57 +00:00
ebc4ef2eea feat(p10-1c-jk): activate Java in ingest_one_code_asset dispatch
Replaces Java bail! arms with JavaAstExtractor + CodeJavaAstV1Chunker. Adds
java_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=java, symbol=com.foo.Foo.bar, code_lang=java.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:44:05 +00:00
7bda1509b7 feat(p10-1c-jk): code-java-ast-v1 chunker (1:1 + oversize split)
Duplicate of code-rust-ast-v1 / code-go-ast-v1 with language-agnostic body
unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:41:27 +00:00
61d48d67a3 feat(p10-1c-jk): tree-sitter-java AST extractor (JavaAstExtractor)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:39:02 +00:00
f4c840b994 refactor(p10-1c-jk): add java + kotlin to dispatch allowlist (bail until Tasks F/I)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:33:27 +00:00
15244b7494 feat(p10-1c-jk): route .java/.kt/.kts to MediaType::Code
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:31:29 +00:00
a7f7ab9f93 build(p10-1c-jk): add tree-sitter-java + tree-sitter-kotlin-ng workspace deps
Bare tree-sitter-kotlin v0.3.8 requires tree-sitter >=0.21,<0.23 which
conflicts with the workspace's tree-sitter 0.26 (links = "tree-sitter"
is a singleton). tree-sitter-kotlin-ng v1.1.0 (from
tree-sitter-grammars/tree-sitter-kotlin) uses the tree-sitter-language
0.1 shim which is compatible with tree-sitter 0.26. Using
tree-sitter-kotlin-ng as the Kotlin grammar crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:30:03 +00:00
1b19e33a4f docs(p10-1c-jk): task spec + implementation plan
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:27:13 +00:00
9c9e391b15 Merge pull request 'feat(p10-1C-Go): tree-sitter-go AST extractor + chunker — Go 코드 색인 활성화' (#151) from feat/p10-1c-go into main 2026-05-20 10:16:09 +00:00
f95cd55484 docs(p10-1c-go): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX + design §10.1; chore: bump version 0.11.1 → 0.12.0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:02:21 +00:00
ab288135e9 test(p10-1c-go): code-go-ast-v1 chunker snapshot + full-suite gate
Mirrors code_python_ast_snapshot / code_ts_ast_snapshot patterns. In-memory
CanonicalDocument (no kebab-parse-code dep — boundary §6.3 respected).

verify:
- cargo test -p kebab-chunk --test code_go_ast_snapshot → 2/2
- cargo test --workspace --no-fail-fast -j 1 → 0 failures (all green)
- cargo clippy --workspace --all-targets -- -D warnings → clean
- SMOKE: chunk.ParseDoc symbol + code_lang_breakdown {"go": 1} 확인

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:54:17 +00:00
c19aa006d0 feat(p10-1c-go): activate Go in ingest_one_code_asset dispatch
Replaces Go bail! arms with GoAstExtractor + CodeGoAstV1Chunker. Adds
go_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=go, symbol=chunk.ParseDoc, code_lang=go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:13:47 +00:00
f1a4f67e12 feat(p10-1c-go): code-go-ast-v1 chunker (1:1 + oversize split)
Duplicate of code-rust-ast-v1 / code-{python,ts,js}-ast-v1 with language-agnostic
body unchanged. Cross-chunker policy_hash identity asserted vs md-heading-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:11:14 +00:00
6463c52827 feat(p10-1c-go): tree-sitter-go AST extractor (GoAstExtractor)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:08:46 +00:00
2559d0d95a refactor(p10-1c-go): add go to ingest dispatch allowlist (bail until Task F)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:03:28 +00:00
4524830306 feat(p10-1c-go): route .go to MediaType::Code(go)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:01:29 +00:00
8cdd3903c7 build(p10-1c-go): add tree-sitter-go workspace dep
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:00:04 +00:00
8b89961ada docs(p10-1c-go): task spec + implementation plan
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:58:45 +00:00
eec90996aa chore: bump version 0.11.0 → 0.11.1
dogfood semantic cleanup (PR #150) lands: document-centric fetch_span +
assets.workspace_path 'last-registered' semantic explicitly documented.

patch bump 사유: 외부 wire / CLI / config surface 변경 없음. 새 internal
trait method (get_asset) + caller refactor + doc-comment 갱신. twin file
의 fetch_span 잘못 분기 가능성 fix (rare).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:09:46 +00:00
ce1c778b4a Merge pull request 'fix(dogfood): document-centric fetch_span + assets.workspace_path semantic doc' (#150) from fix/dogfood-asset-flip-flop-cleanup into main 2026-05-20 08:08:55 +00:00
453ec15df4 fix(dogfood): document-centric fetch_span + remove get_asset_by_workspace_path
assets.workspace_path is INTENTIONALLY 'last-registered path' for twin
files (identical content at different paths share one asset row PK'd by
blake3 content hash). PR #146 made try_skip_unchanged document-centric;
PR #149 made reset --orphans-only document-centric; this PR removes the
last caller of get_asset_by_workspace_path (fetch.rs:193 in fetch_span,
which used it to reject PDF/audio media — for twins this could read the
wrong asset's media_type and pick the wrong branch).

Replaced with the natural 2-step lookup: get_document_by_workspace_path
(PR #146) → doc.source_asset_id → get_asset (NEW trait method, asset_id
is PRIMARY KEY so flip-flop-immune by construction).

Then removed get_asset_by_workspace_path trait method + SqliteStore impl
— 0 callers after the refactor.

UPSERT doc-comment refreshed in store.rs to make the 'last-registered'
semantics explicit so future readers don't try to 'fix' the flip-flop.

Dogfood follow-up (PR #142 1B + multi-root corpus).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:03:38 +00:00
1e6de9fe9f chore: bump version 0.10.0 → 0.11.0
dogfood follow-up (PR #149) lands: kebab reset --orphans-only explicit
complement to PR #148's conservative sweep.

minor bump 사유: 새 CLI flag (--orphans-only) + 새 ResetScope variant +
ResetReport additive 필드 = surface 확장. design §10.4 트리거 충족.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:53:55 +00:00
9fa2a1ebac Merge pull request 'feat(dogfood): kebab reset --orphans-only — explicit complement to PR #148 sweep' (#149) from feat/dogfood-reset-orphans-only into main 2026-05-20 07:50:43 +00:00
749c6ae240 docs(dogfood): sync reset_report schema + README for --orphans-only (PR #149 review)
Round 1 review found 2 doc gaps:
- docs/wire-schema/v1/reset_report.schema.json: 'orphans_only' missing
  from scope enum; orphans_purged/purged_paths properties absent
- README: --orphans-only not listed in the reset prose

Schema additions are additive minor (default values keep back-compat).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:47:44 +00:00
5f2bd9e97e feat(dogfood): kebab reset --orphans-only — purge stored docs outside walker scope
PR #148 auto-purges only filesystem-missing files (conservative — leaves
on-disk-but-out-of-scope docs alone for data safety). This is the explicit
complement: when the user has narrowed include / widened exclude / removed
a sub-directory from the workspace and WANTS the stored docs reconciled,
they invoke 'kebab reset --orphans-only'.

Confirm prompt with orphan count + sample paths; --yes required in
non-TTY. SQLite purge via existing purge_deleted_workspace_path (PR #148)
+ vector store delete_by_chunk_ids when configured. No fs existence
check — orphans-only is the explicit 'I know what I'm doing' variant.

dogfood follow-up to PR #148 (file deletion auto-purge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:38:10 +00:00
471 changed files with 72818 additions and 4514 deletions

1
.gitignore vendored
View File

@@ -1,6 +1,7 @@
.superpowers/
.worktrees/
.claude/
.omc/
/target
**/*.rs.bk
Cargo.lock.bak

1
AGENTS.md Symbolic link
View File

@@ -0,0 +1 @@
CLAUDE.md

View File

@@ -81,11 +81,71 @@ Bump 자체는 단순 minor / patch 한 줄 수정 (`Cargo.toml` workspace `vers
Release 절차:
1. `gitea-release v<X.Y.Z>` (gitea-ops skill) 으로 tag + push + release notes.
2. release notes 는 사용자 도그푸딩에 영향 가는 surface 변경 위주 — wire schema 추가, CLI flag 신규, TUI 키 변경, V00X migration 등.
3. 프리-1.0 (`0.x.y`) 단계: minor bump 시 wire schema additive / surface 변경 누적, patch bump 시 bug fix only.
2. release notes 는 사용자 도그푸딩에 영향 가는 surface 변경 위주 — wire schema 추가, CLI flag 신규, TUI 키 변경, V00X migration 등 — 다룬다. 이때 추가된 기능과 변경사항은 유저가 이해할 수 있도록 친절하고 자세하게 풀어서 설명해야 하며, 단순히 commit subject 를 나열하는 형태로 끝내면 안 된다. 필요하다면 도그푸딩이나 테스트 결과도 함께 적어 둔다.
3. 프리-1.0 (`0.x.y`) 단계 bump 규칙 — **기능(behavior) 또는 인터페이스(interface) 변경 여부**로 판정:
- **minor bump** (`0.x.0`): 기능 또는 인터페이스에 *실질적* 변경이 있을 때. 인터페이스 = 신규/변경/삭제된 CLI subcommand·flag, config 키, wire schema 의 **breaking** 변경, 임베딩/검색/RAG 등 사용자가 받는 **결과·동작**의 변화, V00X migration, frozen 설계 변경. 기능 = 새 source 형식·검색 모드·백엔드 등 *할 수 있는 일*의 추가/변경.
- **patch bump** (`0.x.y`): 기능·인터페이스 변경이 **없을** 때. bug fix, 내부 refactor, 성능 개선, 로깅/진행표시 등 **관측성(observability) 개선**, **additive-only wire 변경**(backward-compat 신규 필드/이벤트라 기존 소비자 무영향), 문서. ← 즉 "결과가 같고 새 명령/플래그/config 도 없으면 patch".
- 경계 예: 진행 로그에 phase/파일명 추가 + additive wire 이벤트(asset_phase) = **patch** (검색·색인 결과 불변, 새 명령/플래그/config 없음). arctic 임베더 provider + 신규 config 키 = **minor** (인터페이스 추가). 별칭 기능 제거 + migration = **minor** (동작·인터페이스 변경).
**bump 시점 = release 시점 같은 commit**. 즉 commit `chore: bump version 0.x → 0.y` 직후 같은 commit 에 tag. v0.1.0 (`2319206`) 처럼 bump 없이 tag 만 찍는 패턴은 후속 release 가 대상 commit 을 헷갈리게 함 — pre-release snapshot 은 SHA reference 로 충분.
## Dogfood trigger
도그푸딩 = 새 binary 를 실제 KB / 실제 query 로 돌려보고 user-visible 동작이 spec 의 의도와 일치하는지 확인하는 종단 검증. unit / integration test 가 못 잡는 회귀 (UX 어색함, performance regression, 의외의 token 처리, embedding drift, RAG hallucination) 를 catch 함. PR 머지 전 또는 머지 직후 release notes 작성 전에 실시.
### 도그푸딩이 필요한 시점
다음 트리거 중 하나라도 hit 시 도그푸딩 필수. **모두 release-level 또는 user-visible behavior 변경 임**.
**Schema / migration**:
- 신규 V00X migration (예: V007 trigram, V008 OCR mirror, V009 morphological) — `corpus_revision` cascade + auto-backfill 정책의 사용자 경험 확인.
- frozen design contract 변경 (`docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` §X 갱신) — verbatim CI diff-check 외의 user-visible side effect 확인.
**Wire schema / CLI surface**:
- 신규 `--json` 필드, exit code 변경, 또는 schema major bump (v1 → v2) — agent / external integration 의 호환성 검증.
- `kebab` 의 subcommand 또는 flag 추가/삭제/rename — agent skill / muscle memory 영향.
**Search / RAG behavior**:
- FTS5 tokenizer / chunker / embedder 모델 / RAG prompt template 변경 — 같은 query 의 hit ordering, snippet, RAG citation 패턴이 자연스럽게 변화하는지.
- score gate, RRF fusion ratio, NLI threshold 같은 ranking 파라미터 default 변경.
**Performance**:
- ingest / search / ask latency 의 의도된 변화 (예: lindera tokenize, OCR 추가, multi-hop RAG) — actual wall-clock 측정 + release notes 에 명시.
- 대용량 KB (수천 doc / 만 chunk) 의 first-boot eager backfill 시간이 사용자 hang 인지에 영향 안 가는지.
**Language / locale**:
- 한국어 / 일본어 / 중국어 lexical 동작 변경 (V007 trigram, V009 morphological, future N-gram).
- 영어 substring 매칭 같은 ad-hoc 부산물의 회귀.
**File / asset surface**:
- 신규 source 형식 (PDF OCR, audio, video) — extractor / chunker 의 실제 corpus 동작.
- `.kebabignore` / `_external/` 같은 workspace 정책 변경.
**Release-level**: 위 트리거 중 하나가 hit 되어 `Cargo.toml` workspace `version` bump 가 필요하면, **bump commit 이전에 도그푸딩 evidence 가 HOTFIXES + release notes 에 명시** 되어 있어야 함. evidence 없는 release 는 사용자가 "왜 bump 했는지" 추적 불가.
### 도그푸딩 데이터 보관소
모든 도그푸딩 source 문서 + KB state + 로그는 `/build/dogfood/` 한 디렉토리에 누적 보관한다. **분류는 문서 의미 / 종류 / 형식 기준만** — kebab version, 생성 시점, scenario name 같은 prefix 금지 (`v0.20.1-dogfood/`, `dogfood-v018/` 같은 디렉토리 신설 X). 자세한 layout 은 `/build/dogfood/README.md` 참조.
- `/build/dogfood/corpus/` — source 문서 (read-only). format 별 분류 (`markdown/`, `code/`, `html/`, `images/`, `pdf/`, `manifest/`, `resources/`) + 각 format 내 category 별 (예: `markdown/{korean,english,bilingual,tech-docs,coding-md-corpus,topics,notes,edge-cases}`, `code/{rust,python,...}`). 새 fixture 는 적절한 category subdir 에 추가.
- `/build/dogfood/kb/` — 도그푸딩 run 의 KB 출력 (SQLite + LanceDB + assets + models). 매 run 마다 reset 가능. 별 KB 디렉토리 신설 X.
- `/build/dogfood/logs/` — 누적 실행 로그 (ndjson + stderr + summary).
- `/build/dogfood/config.toml` — canonical 도그푸딩 config (없으면 `kebab init` 후 path override).
- `/build/dogfood/_archive/` — regeneratable stale state (이전 run 의 sqlite/lancedb, XDG snapshot). 디스크 압박 시 wipe 가능.
`/tmp/kebab-smoke/`, `/tmp/kebab-*`, `/build/cache/dogfood*`, `/home/altair823/KnowledgeBase`, `~/.config/kebab/`, `~/.local/share/kebab/`, `~/.local/state/kebab/` 같은 위치 신규 사용 금지 — 모두 `/build/dogfood/` 로 일관. ad-hoc fixture 가 필요하면 `corpus/<format>/<category>/` 에 추가.
### 도그푸딩 결과 기록
도그푸딩 evidence 는 두 곳에 cascade:
1. **`tasks/HOTFIXES.md` 의 dated entry** — 시나리오 별 hit count 표 + snippet evidence + known limitation. 미래에 spec drift 의심 시 git history 외 immediate reference 가 됨.
2. **`docs/release-notes/v<X.Y.Z>-draft.md`** (또는 gitea release body) — 사용자 도그푸딩 영향에 영향이 가는 surface 변경을 4 단락 (변경 사실 / trade-off / mitigation / upgrade 절차) 으로 풀어서 설명. evidence link.
도그푸딩 단계에서 *발견된 bug* (spec 과 실제 동작의 mismatch, performance regression, UX 어색함) 는 즉시 fix → re-dogfood. fix 가 별 PR 으로 빠지면 머지 후 HOTFIXES 에 dated entry.
DOGFOOD scenario catalog (§1~§13) 는 `docs/DOGFOOD.md`. 신규 release 마다 §관련 section 의 scenario list 갱신 + 신규 scenario 추가.
## Naming + paths
- Crate prefix: `kebab-` (kebab-case package, `kebab_` snake_case in Rust modules).

1442
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -2,17 +2,17 @@
resolver = "3"
members = [
"crates/kebab-core",
"crates/kebab-parse-types",
"crates/kebab-config",
"crates/kebab-source-fs",
"crates/kebab-parse-md",
"crates/kebab-normalize",
"crates/kebab-chunk",
"crates/kebab-store-sqlite",
"crates/kebab-store-vector",
"crates/kebab-search",
"crates/kebab-embed",
"crates/kebab-embed-local",
"crates/kebab-embed-candle",
"crates/kebab-embed-ollama",
"crates/kebab-llm",
"crates/kebab-llm-local",
"crates/kebab-rag",
@@ -24,6 +24,7 @@ members = [
"crates/kebab-tui",
"crates/kebab-mcp",
"crates/kebab-parse-code",
"crates/kebab-nli",
]
[workspace.package]
@@ -31,7 +32,95 @@ edition = "2024"
rust-version = "1.85"
license = "MIT OR Apache-2.0"
repository = "https://github.com/altair823/kebab"
version = "0.10.0"
version = "0.26.1" # v0.26.1 — ingest 진행 로그 개선: TTY 진행바에 현재 파일명 + 느린 phase(ocr/caption/embed)+모델명 실시간 + 경과초 heartbeat `(Ns)`, 종료 시 최장 소요 파일 top-5 요약. 신규 wire 이벤트 `asset_phase{idx,total,phase,model}` + `asset_timings.ocr_ms`/`caption_ms` 추가(additive, ingest_progress.v1 유지, serde default 0). 기본 동작 불변. — CLAUDE.md §Release
# pre-v0.18 workspace-wide cleanup: enable clippy::pedantic group with
# intentional allow-list. The allowed lints are either cosmetic (doc style),
# informational (function size), or carry intentional truncation we accept
# (numeric casts in tokenizer/ONNX inputs, hash modular reduction, etc).
[workspace.lints.clippy]
pedantic = { level = "warn", priority = -1 }
# Intentional u32 ↔ i64 casts in kebab-nli (ONNX i64 inputs from tokenizer u32 ids).
# u64 ↔ usize across kebab-store-sqlite row counts. Wide truncation is auditable
# at use site, not lint-wide.
cast_possible_truncation = "allow"
cast_possible_wrap = "allow"
cast_sign_loss = "allow"
cast_precision_loss = "allow"
# Doc markdown style is cosmetic; we run rustdoc on demand.
doc_markdown = "allow"
missing_errors_doc = "allow"
missing_panics_doc = "allow"
# Informational only — splitting a long pipeline function isn't always cleaner.
too_many_lines = "allow"
# `Foo::default()` is concise and idiomatic here; `<Foo as Default>::default()`
# adds noise without surfacing intent.
default_trait_access = "allow"
# Module name prefix on public items keeps the wire/log surface readable
# (`refusal_reason::no_chunks` etc).
module_name_repetitions = "allow"
# We use `#[must_use]` deliberately on public results, not blanket.
must_use_candidate = "allow"
# `String` arg sometimes signals "I'll consume this" — let signature decide.
needless_pass_by_value = "allow"
# Idiomatic single-line bindings stay; let-else expansion isn't always clearer.
manual_let_else = "allow"
# `use` after `let` is a common kebab pattern (scoped imports next to use site).
items_after_statements = "allow"
# Naming pairs like `chunk_id` / `chunks_id` are intentional domain terms.
similar_names = "allow"
# `iter.map(format!).collect::<String>()` is idiomatic when the per-element
# string is genuinely independent — `fold` only wins on accumulation patterns.
format_collect = "allow"
# Exhaustive `match` with explicit variant arms (vs `_`) catches future
# variant additions at compile time (kebab core's `RefusalReason` pattern).
match_wildcard_for_single_variants = "allow"
# Copy types under `&self` keep call-site discipline; auto-deref noise > tiny perf gain.
trivially_copy_pass_by_ref = "allow"
# `unnecessary_wraps` flags helpers that could drop `Result`, but keeping the
# Result allows future error variants without churning callers.
unnecessary_wraps = "allow"
# NLI score / RRF fusion / similarity threshold comparisons are intentional —
# floats live in the `[0, 1]` band and are compared with explicit thresholds.
float_cmp = "allow"
# File-extension dispatch is keyed on ASCII conventions; case sensitivity
# is part of the spec for `.md`, `.pdf`, etc.
case_sensitive_file_extension_comparisons = "allow"
# Config / opts structs intentionally bundle boolean flags (ingest options,
# search modes, etc) — splitting them into enums would obscure the wire shape.
struct_excessive_bools = "allow"
# `bytecount` crate would be a new dep just for one-off ASCII counts.
naive_bytecount = "allow"
# `#[ignore]` annotations on tests document via the test name + nearby comment.
ignore_without_reason = "allow"
# `format!` push patterns are a hot path for kebab-tui's progressive rendering;
# `write!` rewrite needs a verified-equal benchmark before swapping.
format_push_string = "allow"
# Builder-style `with_*` methods return `Self`; the existing `#[must_use]`
# discipline lives on aggregate constructors, not every chainable setter.
return_self_not_must_use = "allow"
# Match arms grouped by side-effect over body equality (e.g. snake_case wire
# label tables) — fanning them out keeps adding a new variant trivial.
match_same_arms = "allow"
# Remaining style-only warnings: trailing `continue` is sometimes clearer than
# rewriting, `_x` underscored bindings document intent at the use site, and
# `!(a == b)` reads better than `a != b` when paired with a complementary check.
needless_continue = "allow"
used_underscore_binding = "allow"
nonminimal_bool = "allow"
# Other one-off cosmetic items: large literal formatting, doc link quoting,
# `Clone::clone_from` swap, `str::replace` chaining, `Iterator::any` ergonomics.
unreadable_literal = "allow"
many_single_char_names = "allow"
doc_link_with_quotes = "allow"
assigning_clones = "allow"
collapsible_str_replace = "allow"
trivial_regex = "allow"
elidable_lifetime_names = "allow"
range_plus_one = "allow"
explicit_iter_loop = "allow"
implicit_hasher = "allow"
ref_option = "allow"
[workspace.dependencies]
anyhow = "1"
@@ -54,6 +143,7 @@ proptest = "1"
# p9-fb-19: LRU cache for `App::search` results. Bounded capacity
# from `config.search.cache_capacity` (default 256, ~1.3 MB cap).
lru = "0.12"
lopdf = "0.32"
# fastembed-rs ships ONNX runtime via the `ort-download-binaries` feature
# in its default set (which also pulls `hf-hub` for first-run model
# downloads). Pinned to the 4.x line per task p3-2 (current 5.x release
@@ -94,6 +184,31 @@ tree-sitter-rust = "0.24"
tree-sitter-python = "0.25.0"
tree-sitter-typescript = "0.23.2"
tree-sitter-javascript = "0.25.0"
# Go grammar for code ingest (kebab-parse-code, p10-1C-Go).
tree-sitter-go = "0.25.0"
# JVM family grammars for code ingest (kebab-parse-code, p10-1C-JK).
tree-sitter-java = "0.23.5"
tree-sitter-kotlin-ng = "1.1.0" # bare tree-sitter-kotlin requires ts <0.23; -ng uses tree-sitter-language 0.1 (ts 0.26 compat)
# C/C++ family grammars for code ingest (kebab-parse-code, p10-1D).
tree-sitter-c = "0.24.2"
tree-sitter-cpp = "0.23.4"
# fb-41 PR-9 (kebab-nli): mDeBERTa-v3 XNLI verifier deps. Versions match
# the fastembed 4.9 transitive set so the ONNX Runtime + tokenizer stack
# stays single-versioned across the workspace. ort `default-features=false`
# drops the bundled binary downloader (fastembed already provides one);
# tokenizers `default-features=false, onig` swaps the default `esaxx` regex
# backend for `onig` so the build doesn't need libstdc++ headers (verified
# via PR-9a pre-flight: SentencePiece tokenizer.json loads + KR/EN encode).
# hf-hub uses `ureq + rustls-tls` to stay aligned with kebab-embed-local's
# pure-Rust TLS stack.
ort = { version = "=2.0.0-rc.9", default-features = false, features = ["ndarray"] }
tokenizers = { version = "0.21", default-features = false, features = ["onig"] }
hf-hub = { version = "0.4", default-features = false, features = ["ureq", "rustls-tls"] }
ndarray = "0.16"
# Korean morphological tokenizer (FTS v0.20.x, §6.1). lindera-ko-dic bundles
# the KO-DIC dictionary as an embedded blob via the embed-ko-dic feature.
lindera = "3"
lindera-ko-dic = "3"
# Disk-footprint trim for dev / test builds. Codegen, opt-level, and
# behavior are unchanged — only DWARF debug info is reduced (line

1
GEMINI.md Symbolic link
View File

@@ -0,0 +1 @@
CLAUDE.md

View File

@@ -4,7 +4,7 @@
## 한 줄 요약
P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료. `kebab ingest` 가 markdown / image / PDF 모두 처리. `kebab search` / `kebab ask` 가 매체 가로질러 결과 + page citation 반환. `kebab tui` 가 4 패널 (Library + Search + Ask + Inspect) 제공 — 사용자가 `?` 로 ask, `/` 로 search, Library Enter / Search `i` 로 inspect, Search `g` 로 editor jump. 다음 후보 = P9-5 (desktop tauri) 또는 보류 중인 P8 (audio) 의 시스템 dep brainstorm.
P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) + P10 전체 머지 완료 (현재 **v0.18.0**). `kebab ingest` 가 markdown / image / PDF / 소스코드 (Rust / Python / TS / JS / Go / Java / Kotlin / C / C++) / Tier 2 리소스 파일 (yaml/k8s / dockerfile / toml / json / xml / groovy / go-mod) + Tier 3 paragraph fallback (shell / 비-k8s YAML / AST 실패 케이스) 처리. `kebab search` / `kebab ask` 가 매체 가로질러 결과 + page / code citation 반환. `kebab tui` 가 4 패널 (Library + Search + Ask + Inspect) 제공. **v0.17.0 cut (2026-05-24)**: 한국어 trigram FTS5 tokenizer (PR #159) + C typedef alias unit (PR #160) + `code_lang_chunk_breakdown` additive (PR #161). **v0.17.1 cut (2026-05-25)**: 확장 도그푸딩 후 `[models.llm] request_timeout_secs` config 노브 (PR #162) + sudo 없이 ollama 설치 + `kebab ask --stream` UX 권장 docs (PR #163). **v0.17.2 cut (2026-05-25)**: v0.17.1 post-dogfood polish — `[image.ocr] request_timeout_secs` 별 노브 (PR #164, v0.17.1 미진행 closure) + `heading_path` FTS5 column filter 로 text-only 매칭 + raw-mode escape hatch (PR #165, 2026-05-24 v0.17.0 trigram entry 의 JSON 노이즈 closure). **v0.18.0 cut (2026-05-26)**: fb-41 multi-hop RAG + NLI verification ship (PR #176-180) — `kebab ask --multi-hop` 의 decompose → decide → synthesize loop + mDeBERTa-v3 XNLI ONNX post-synthesize entailment 검사. dogfood S7 caffeine hallucination 의 silent LLM-self-judge ceiling 해결 (nli_score 0.0035 graceful refuse). 추가 `chore: workspace-wide cleanup + post-PR9 refactor` (PR #181) — clippy::pedantic baseline + H1 config wiring + 9 new tests. 자세한 영향은 [v0.17.0 release notes](https://gitea.altair823.xyz/altair823-org/kebab/releases/tag/v0.17.0) + [v0.17.1 release notes](https://gitea.altair823.xyz/altair823-org/kebab/releases/tag/v0.17.1) + [v0.17.2 release notes](https://gitea.altair823.xyz/altair823-org/kebab/releases/tag/v0.17.2) + [v0.18.0 release notes](https://gitea.altair823.xyz/altair823-org/kebab/releases/tag/v0.18.0). 구조적으로 남은 component 는 P9-5 (desktop tauri) 하나뿐, P8 (audio) 는 사용자 보류.
## Phase 로드맵
@@ -17,21 +17,38 @@ P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료.
| **P4** | Local LLM + RAG + grounded answer | `kebab-llm`, `kebab-llm-local`, `kebab-rag` | P3 | ✅ 완료 |
| **P5** | Golden query / regression eval | `kebab-eval` | P4 | ✅ 완료 |
| **P6** | 이미지 ingestion (OCR + caption) | `kebab-parse-image` | P5 | ✅ 완료 (4/4 component, OCR/caption Ollama-vision) |
| **P7** | PDF text + page citation | `kebab-parse-pdf` | P5 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring) |
| **P7** | PDF text + page citation + scanned OCR (v0.20.0 sub-item 1) | `kebab-parse-pdf` + `kebab-app::pdf_ocr_apply` | P5 + P6 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring + post-extract OCR enrichment via qwen2.5vl:3b vision LLM) |
| **P8** | 음성 transcription + timestamp citation | `kebab-parse-audio` | P5 | ⏸ 보류 (whisper-rs 시스템 dep brainstorm 필요) |
| **P9** | TUI + desktop app | `kebab-tui`, `kebab-desktop` | P5 | 🟡 진행 (4/5 component — P9-1/2/3/4 완료 [Library / Search / Ask / Inspect], P9-5 desktop 예정 · 도그푸딩 피드백 **20/20 ✅**) |
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, tree-sitter-rust, `code-rust-ast-v1` — v0.7.0), **1B 🟡 PR 오픈** (Python `code-python-ast-v1` + TypeScript `code-ts-ast-v1` + JavaScript `code-js-ast-v1`3 언어 dogfooding 가능, v0.8.0 대기) |
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, `code-rust-ast-v1` — v0.7.0), 1B ✅ (Python/TS/JS AST chunkers — v0.8.0 이후), **1C-Go ✅ (Go AST chunker, `code-go-ast-v1` — v0.12.0)**, **1C-JavaKotlin ✅ (Java + Kotlin AST chunkers, `code-java-ast-v1` / `code-kotlin-ast-v1` — v0.13.0)**, **2 ✅ (Tier 2 resource-aware: yaml/k8s + dockerfile + manifest, `k8s-manifest-resource-v1` / `dockerfile-file-v1` / `manifest-file-v1` — v0.14.0)**, **3 ✅ (Tier 3 paragraph fallback: code-text-paragraph-v1 — v0.15.0)**, **1D ✅ (C + C++ AST chunkers, code-c-ast-v1 + code-cpp-ast-v1 — v0.16.0)** |
P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
## Component 카운트
총 33 component task — spec 시점 31 개 + 후속 wiring task 3 (P3-5 / P6-4 / P7-3) 가 머지 시점에 추가됨. per-component 진행 + status 는 [tasks/INDEX.md](tasks/INDEX.md).
총 33 component task — spec 시점 31 개 + 후속 wiring task 3 (P3-5 / P6-4 / P7-3) 가 머지 시점에 추가됨. v0.18.0 cut 시점에 fb-41 multi-hop RAG + NLI verification (PR-9 5 sub-PRs) 가 P9 추가 component 로 ship — `kebab-nli` 신규 crate (mDeBERTa-v3 XNLI ONNX verifier) + `kebab-rag::ask_multi_hop` (decompose/decide/synthesize loop + step 8.5 NLI hook). per-component 진행 + status 는 [tasks/INDEX.md](tasks/INDEX.md).
## 머지 후 발견된 버그 / 결정 (요약)
- **candle 임베딩 백엔드 다변화** (2026-06-01, Track 1, v0.22.0): `provider = "candle"` opt-in 추가 — 같은 `multilingual-e5-large` 모델을 순수 Rust(candle)로 돌려 듀얼소켓 NUMA 서버의 onnxruntime 48-스레드 double-free 를 회피. `[models.embedding].num_threads`(+env `KEBAB_EMBED_THREADS`)로 CPU 스레드 캡. fastembed default 동작·벡터 불변, `embedding_version` 유지(재색인 0). Phase 0 스파이크 패리티 cosine 1.000000. 상세 HOTFIXES 동일 일자.
- **config 마이그레이션** (2026-05-31, PR #198): `kebab config migrate` 추가 — 기존 config.toml 에 빠진 섹션을 주석과 함께 채우고 deprecated 정리(멱등·`.bak`·dry-run, 값/주석 보존). `schema_version` 1→2, `init` 도 섹션 주석 포함, doctor 에 `config_migration` 체크. 상세 HOTFIXES 동일 일자.
머지 후 발견된 모든 deviation / hotfix 의 dated 로그는 [tasks/HOTFIXES.md](tasks/HOTFIXES.md). 본 요약은 \"누군가가 인수받을 때 알아두면 시간을 많이 절약하는\" 항목만:
- **2026-06-03 ingest 진행 로그 개선** — v0.26.1. 이미지/PDF + OCR/caption on 볼트 ingest 가 "멈춘 듯" 보이던 문제 해소: TTY 진행바에 현재 파일명 + 느린 phase(ocr/caption/embed)+모델명 + 경과초 `(Ns)` heartbeat, 종료 시 최장 소요 파일 top-5 요약. 신규 wire `asset_phase{idx,total,phase,model}` + `asset_timings.ocr_ms`/`caption_ms`(additive, `ingest_progress.v1` 유지, serde default 0). 이미지·PDF 경로도 `asset_timings` emit(이전 markdown 만). 기본 동작 불변. 자세한 내용: `tasks/HOTFIXES.md` (2026-06-03 ingest 진행 로그), spec/plan `docs/superpowers/{specs,plans}/2026-06-03-ingest-log-improve-*.md`.
- **2026-06-03 arctic-embed-l-v2.0 임베더 통합** — v0.26.0. 별칭 제거 후 설명형 query recall 보강(측정 recall@10 130/132, e5 +7). `kebab-embed-candle` 모델 레지스트리화(e5 mean + `snowflake-arctic-embed-l-v2.0` CLS, 모델별 pooling/prefix) + 신규 `kebab-embed-ollama`(`provider="ollama"`, `/api/embed`). config `endpoint: Option<String>` 추가. 기본 e5 유지(opt-in), arctic 전환은 embedding_version cascade → 재색인. candle↔Ollama cosine>0.99 게이트로 pooling/prefix 정확성 고정(`#[ignore]`). 자세한 내용: `tasks/HOTFIXES.md` (2026-06-03 arctic), spec `docs/superpowers/specs/2026-06-03-arctic-embedder-spec.md`.
- **2026-06-03 doc-side expansion(별칭) 기능 완전 제거** — v0.25.0. 아래 2026-05-31 항목의 색인-시 청크당 LLM 별칭 생성 + 별칭 검색 채널을 **전부 제거**(ROI 음수: cross-lingual 은 e5-large 단독으로 충분, 기여는 설명형 +2 그룹뿐인데 대가가 청크당 색인-시 LLM). `Chunk.aliases`/`expansion.rs`/`IngestExpansionCfg`/alias lexical arm/`expansion_progress` wire kind 제거, 신규 마이그레이션 **V013**`chunk_aliases_fts`+`chunks.aliases` DROP. 별칭 default-off 였어 사용자 체감 0, 기존 KB 도 재색인 불요(잔존 별칭 벡터는 `strip_alias_suffix` graceful 매핑/`reset` 정리). `AssetTimings.expansion_ms` 는 wire 호환 위해 값 0 으로 유지. 자세한 내용: `tasks/HOTFIXES.md` (2026-06-03), spec `docs/superpowers/specs/2026-06-03-remove-doc-expansion-spec.md`.
- **2026-05-31 Phase 2 doc-side expansion 별칭(개별 dense 벡터) + 파생물 캐시(V012)** — v0.21.0 cut. 색인 시 LLM 이 청크별 별칭("같은 의미 다른 표현")을 생성, 줄별 **개별 dense 벡터**(sentinel `{chunk}#alias#N`)로 색인 (묶음 1벡터는 평균화 희석으로 회귀 → 폐기) + boilerplate 청크 skip. `[ingest.expansion]` default off. 측정(나무위키 ~1000 문서 CS corpus): 변형 일관성 14/18 → **16/18**, spread 0.222→0.111, 대조군 false-positive 별칭 무죄. 비용 병목(별칭 18문서 2.5h)은 **파생물 캐시(V012, 청크 내용 해시 키)**로 해소 — 정답 3개 cold 1879s → warm 13s **≈ 145배**, embedding+별칭 LLM 캐싱, version_key cascade 정합. search/ask 가 `kebab.sqlite`+`lancedb` 만으로 동작 → 외부 서버 색인 후 DB 만 복사하는 이식 워크플로 가능. **결정/known limitation**: grounded/refusal 판정이 부분 인용을 grounded 로 오분류(정직한 거부가 false-positive 로 집계) — 별도 개선 후보. stack·svm 설명형 2개 잔존. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-31), 측정: `docs/superpowers/handoffs/2026-05-31-namu-wiki-alias-cache-study.md`.
- **2026-05-29 v0.20.2 dogfood findings + 검색 품질 baseline** — 8-finding 라운드 완료. (1) Ask 응답언어: rag-v3 default (질문 언어 = 답변 언어). (2) eval `--config` facade 패치 로 dogfood KB 직접 eval 가능. (3) 검색 품질 baseline — hybrid hit@3=1.0 / MRR=0.833, lexical hit@3=1.0 / MRR=0.7 (golden 10 query). **O-2 known limitation**: 소형 모델(gemma4:e4b) refusal 메시지의 query 언어 불일치 가능 — 판정은 정상, 표시 문구만 해당. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-29).
- **v0.20 sub-item 1 (scanned PDF OCR via qwen2.5vl:3b)**: post-extract enrichment pattern (`kebab-app::pdf_ocr_apply`, H-1 resolution), DCTDecode-only v1 scope (FlateDecode/CCITTFax page 는 warning + skip), parser_version `"pdf-text-v1"` 보존 + force-reingest UX 명문 (H-4).
- **2026-05-26 kebab-normalize + kebab-parse-types 흡수 (24 → 22 crates, design §3.7b 재작성)** — v0.19.0 cut. 4 parser 중 markdown 한 갈래만 lift 를 경유하는 reality 가 design §3.7b 의 fan-in ≥ 2 가정과 diverge → thin layer (`kebab-parse-types`) + `kebab-normalize` 두 crate 가 `kebab-parse-md` 로 흡수. 5 사용 type + 3 forward-declared struct 모두 `kebab-parse-md::{types,normalize}` module 의 `pub` re-export 로 보존. wire / surface impact = 0 (CLI / TUI / MCP / `--json` / config / XDG / parser_version 모두 unchanged). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-26 design deviation entry).
- **2026-05-26 v0.18.0 fb-41 multi-hop RAG + NLI verification ship (PR #176-180) + post-PR9 cleanup (PR #181)** — pre-v0.18.0 dogfood (`/build/cache/dogfood-v018/`, 33 assets / 205 chunks, gemma3:4b CPU only / 16 GB RAM) 에서 발견된 S7 caffeine hallucination 의 root cause = LLM-self-judge ceiling (synthesize 가 chunks 와 무관한 Adam optimizer gradient 식을 silent emit, self-judge 가 reject 못함). 학계 표준 (Self-RAG, CRAG, Auto-GDA, MedTrust-RAG) 결론 = deterministic post-synthesis verification. mDeBERTa-v3 XNLI ONNX (280 MB, Xenova HF) 가 `(packed_chunks, answer)` entailment 검사 — `[rag] nli_threshold > 0` (default 0.0 = disabled, production 권장 0.5) 일 때 활성. dogfood retest 측정 — S7 PR-8 baseline `grounded=true + Adam hallucination` → PR-9 `nli_verification_failed, nli_score 0.0035`. wire additive minor — `answer.v1.verification` field + `refusal_reason``nli_verification_failed` / `nli_model_unavailable` 추가, pre-v0.18 reader 무영향. 5 sub-PR 시퀀스 + cleanup PR (clippy::pedantic baseline + 의도적 30+ allow + H1 `[models.nli].model` config wiring + 9 new tests). post-refactor retest = PR-9d byte-identical (deterministic 확인). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-25 fb-41 PR-9 closure entry + S3 follow-up).
- **2026-05-25 v0.17.2 post-v0.17.1 polish (PR #164 + #165)** — v0.17.1 의 두 follow-up closure. (1) `[image.ocr] request_timeout_secs` 별 노브 — `crates/kebab-parse-image/src/ocr.rs::REQUEST_TIMEOUT` hard 300s 제거, LLM 쪽 패턴 (PR #162) 을 OCR 어댑터에 동일 적용. 사용자 결정으로 별 노브 분리 (OCR vs LLM 의 cold start 패턴이 달라 독립 조절). v0.17.1 미진행 항목 closure. (2) `chunks_fts``heading_path` 컬럼이 JSON 표기 + path 세그먼트 까지 trigram 색인 → query false positive 가능 문제 closure. `lexical.rs::build_match_string` 가 non-raw 분기 결과를 `text : (<expr>)` 로 wrap — heading 색인 V007 verbatim 유지, 매칭만 text 한정. 사용자가 명시 heading 검색 하려면 raw mode `'heading_path : <token>'` escape hatch (SKILL.md 갱신). 둘 다 additive (옛 config 호환) / re-ingest 불필요. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-25 v0.17.2 두 entry).
- **2026-05-25 v0.17.1 post-dogfood (PR #162 + #163)** — 확장 도그푸딩 (16 GB CPU only, gemma4:e4b 시도) 에서 발견된 두 follow-up 한 묶음. (1) `crates/kebab-llm-local/src/ollama.rs::REQUEST_TIMEOUT` hard 300s → `[models.llm] request_timeout_secs` config + env override (additive, default 300, `=0` 은 disable 아닌 "즉시 timeout" 이라 doc 명시). (2) README + SMOKE 에 sudo / systemd 없이 ollama 설치 + ≤4B Q4 권장 모델 + `kebab ask --stream` UX 권장 docs. additive only — 옛 config / wire 호환. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-25).
- **2026-05-24 v0.17.0 PR-C `code_lang_chunk_breakdown` additive (closure of 2026-05-22 LOW)** — `schema.v1.stats` 에 chunk 수 집계 신규 키. 기존 `code_lang_breakdown` (doc count) 와 sister. 또 기존 두 필드 JSON schema description 의 "chunk count" 오기재 → "doc count" 로 정정. wire additive — schema_version bump 불필요. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-24 PR-C).
- **2026-05-24 v0.17.0 PR-B C typedef alias unit (closure of 2026-05-21)** — `kebab-parse-code::c::extract_blocks``type_definition` 분기로 inner anonymous struct/enum/union → declarator 의 typedef alias 이름으로 synthetic unit 방출. `PARSER_VERSION code-c-v1``code-c-v2` bump + 같은-asset/다른-doc_id 케이스용 `purge_workspace_path_for_parser_bump` cascade (`stale_chunk_ids_for_workspace_path_except_doc_id` + `purge_document_at_workspace_path_except_doc_id` helper 신규). 사용자 작업 불필요 (다음 ingest 가 자동 재처리). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-24 PR-B).
- **2026-05-24 v0.17.0 PR-A 한국어 trigram tokenizer 채택 (closure of 2026-05-22 한국어 lexical)** — `chunks_fts` 가 FTS5 `unicode61``trigram` 으로 V007 migration (자동 backfill, re-ingest 불필요). `lexical.rs::build_match_string` trigram-aware 재설계 — multi-token 한국어 query (`해시 충돌`) 가 whole-phrase 후보로 hit, 한영 혼합 (`Rust 충돌은`) 도 OR-combined. 2자 이하 query 는 0-hit + CLI/TUI/wire `hint` 안내. 영어 lexical 도 substring 매칭으로 바뀜 (recall ↑ / 단어 경계 ↓). `kebab.sqlite` 크기 ~2-5배 증가 (trigram index). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-24).
- **2026-05-22 P10 종합 도그푸딩 round 2 (한국어 lexical 검색 한계)** — `kebab search --mode lexical` 의 한국어 query 가 FTS5 `unicode61` 토크나이저에서 거의 0 hit (어절 단위 토큰화 → 부분 매칭 불가). 기본 hybrid 모드는 `multilingual-e5-small` vector 가 carry 해 한국어 검색 정상. **closure**: 위 2026-05-24 v0.17.0 entry.
- **2026-05-20 P10-1B (Rust 1A symbol path 비일관 + expression-level 함수 미방출)** — (a) Rust `code-rust-ast-v1` 은 file-scope nesting 만 (workspace path prefix 없음), 1B 의 Python/TypeScript/JavaScript 는 workspace 경로 → module path prefix 사용 (비일관 수용, retrofit = chunker_version bump + reindex 필요, 사용자 명시 요청까지 보류); (b) TS/JS 의 `const foo = () => {...}` 같은 expression-level 함수는 `<top-level>` glue 로 처리됨 (declaration-level 단위만 1B 1차 범위). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-20) 두 항목.
- **2026-05-19 P10-1A-2 (code_rust_ast_v1.rs + SourceType)** — `AST_CHUNK_MAX_LINES` 상수가 `IngestCodeCfg.ast_chunk_max_lines` 를 읽지 않고 모듈 상수 200 고정 (Chunker trait 이 per-medium config 미노출); `SourceType::Code` variant 부재로 code 파일이 `SourceType::Note` 로 분류됨 — 두 항목 모두 `tasks/HOTFIXES.md` (2026-05-19) 에 기록.
- **2026-05-07 fb-26 (progress.rs)** — `Aborted` unconditional writeln (TTY duplicate) + `Completed` TTY no summary fixed; `KEBAB_PROGRESS=plain` env + quiet suppression added
@@ -81,13 +98,66 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
## 다음 task 후보
- **P9-2 TUI search** — `App.search` slot 채움. Library 의 `/` 가 enable 됨.
- **P9-3 TUI ask** — `App.ask` slot 채움. `?` enable.
- **P9-4 TUI inspect** — `App.inspect` slot 채움. `Enter` enable.
- **P9-5 desktop tauri** — 별도 분기. PDF citation rendering UI 가치 큼.
- **P8 audio brainstorm** — whisper-rs 시스템 dep 받을지 / 외부 transcription endpoint 사용할지 사용자 결정 필요. 사용자 패턴 (책+PDF 위주, audio 의향 없음) 상 후순위.
구조적으로 미완인 component 는 P9-5 하나뿐. 나머지는 도그푸딩 follow-up (아래 "P10 dogfooding 백로그") 또는 사용자 결정 대기.
P9-2/3/4 는 P9-1 의 parallel-safety contract (sub-state slot 패턴) 덕에 병렬 진행 가능 — 같은 `App` 손대지 않음.
- **P9-5 desktop tauri** — 마지막 남은 P9 component. `kebab-desktop` crate + Tauri 앱, 별도 분기. PDF citation rendering UI 가치 큼. 사용자 우선순위 (P9 우선 · 책/PDF 위주) 와 부합.
- **P10 도그푸딩 round 2 follow-up** — ✅ v0.17.0 cut (2026-05-24) 으로 세 항목 모두 closure (한국어 trigram PR-A + C typedef alias PR-B + code_lang_chunk_breakdown additive PR-C). 상세 cross-link: 아래 "P10 dogfooding 백로그" 절 + `tasks/HOTFIXES.md` (2026-05-24 PR-A/B/C).
- **P8 audio brainstorm** — whisper-rs 시스템 dep 받을지 / 외부 transcription endpoint 사용할지 사용자 결정 필요. 사용자 패턴 (책+PDF 위주, audio 의향 없음) 상 보류.
- **fb-41 multi-hop reasoning** — ⏳ 미구현, XL, eval 인프라 선행 + brainstorm 필요.
- **Rust symbol path retrofit** — Rust `code-rust-ast-v1` symbol 이 file-scope-only (1B+ 는 module prefix). `code-rust-ast-v2` bump + Rust corpus re-ingest 비용 → 사용자 명시 요청까지 보류. HOTFIXES `2026-05-20`.
### v0.20.0 sub-item 1 (PDF scanned OCR) 머지 후 priorities (2026-05-28, 사용자 결정)
PR #189 (2026-05-28 머지, commit `09333d0`) 으로 PDF scanned OCR (qwen2.5vl:3b vision LLM) + 4 round bugfix (#2/#3/#4/#6/#7/#9/#10/#11/#13/#14) + ingest log feature 가 main 으로 진입. 다음 작업 순서 = **C → B → A → G**.
- **C — 한국어 morphological tokenizer (Bug #8 follow-up)** ✅ **v0.20.1 머지 완료**.
- V007 trigram 의 ≥3 char query 제약 (HOTFIXES `2026-05-22`) — '한국' 같은 2-char 한국어 query 0 hit → V009 migration + lindera-ko-dic tokenizer + tokenized_korean_text column + first-boot eager backfill 으로 해소. branch `feat/korean-morphological-tokenizer` (8 commit + 5 follow-up).
- scope: search index 재빌드 cascade (corpus_revision bump) + V007 trigram 보존 (backward-compat).
- 사용자 surface: `kebab search` 의 한국어 2자 query ('한국', '서울') 매칭. README + SKILL + release notes 반영.
- **B — OCR dense page coverage** ⏳ C 다음.
- metro-korea.pdf page 8/13 timeout (180s, dense newspaper article). vision LLM 의 output token 과대 → 정상 timeout.
- 가능한 path: (a) per-page `max_pixels` 동적 조정 (high-resolution page 만 축소), (b) column-level sub-region OCR (newspaper layout 분할 후 OCR call 분리), (c) model upgrade (qwen2.5vl:7b — Ollama 모델 변경 + max_pixels trade-off), (d) OCR timeout 점진 축소 (180s → 120s → 90s) — round 마다 p90 측정 후.
- mojibake.pdf `pdf_ocr_pages: 0` (round 1 부터 동일) — text-detect path fallback 강화 검토.
- 별 sub-item.
- **A — v0.20 의 deferred sub-items (frozen design contract)** ⏳ B 다음.
- **sub-item 2** — Multi-region image dispatch (`OcrText.regions` bbox 분리) — image OCR + PDF column-aware OCR.
- **sub-item 3** — PDF normalize integration (`ParsedPdfPage` production caller + `build_canonical_document_from_pdf_pages` + cross-page reference graph).
- **TODO #4** — Per-page image / table extraction (PDF figure / table extract).
- **TODO #5** — Enricher trait 도입 — OCR + caption 의 `Extractor` trait 통합 (post-extract enrichment 의 generalization).
- 각 sub-item 별 spec/plan/executor cycle.
- **G — v0.20.1 patch release + release notes** ⏳ A 머지 후 (또는 C/B 시점에 따라 조기 cut).
- CLAUDE.md release 룰 — sub-item 1 base + bugfix1-4 + log feature + logging r2 누적 → minor surface 변경 다수 + wire schema additive minor + config 신규 → **v0.20.1 patch bump + release notes**.
- 핵심 surface (사용자 도그푸딩 가이드 형식):
- **한국어 2자 query 지원** (`kebab search` 에서 '한국', '서울' 같은 2자 단어 매칭 — V009 morphological tokenizer).
- OCR timeout default 180s (HOTFIXES 2026-05-28).
- `[logging]` config section (default enabled) + `{state_dir}/logs/ingest-{run_id}.ndjson` 자동 생성.
- `[logging] keep_recent_runs` (100) + `retention_days` (30) — OR-on-stale cleanup.
- `ingest_progress.v1.pdf_ocr_finished` 의 4 추가 field (image_byte_size, image_width, image_height, failure_reason) — image_w/h 가 round 2 (PR #190) 에서 실제 capture.
- `schema.v1.models``active_parsers` + `active_chunkers` (additive minor).
- V008 migration — `pdf_ocr_events` table (per-OCR-call historical record).
- 새 wire schemas — `ocr_stats.v1` + `ocr_failures.v1` (CLI inspect 의 emit).
- CLI `kebab inspect ocr-stats` + `kebab inspect ocr-failures` — sweet-spot 점진 분석.
- CLI `--media code` first-class, empty query → `invalid_input`, `--config` missing → `config_not_found` + exit 2.
- capabilities.streaming_ask + single_file_ingest 가 true (이전 false 거짓 정정).
- bump 작업: workspace `Cargo.toml` version → 0.20.1, tag, gitea-release.
### v0.20 후속 bug catalog (non-blocking known)
본 PR #189 dogfood 에서 **falsified** 또는 **design constraint** 로 분류 — fix 안 함:
- Bug #8 (V007 trigram 2-char query 한계) → 위 C 항목.
- Bug #12 (Code block wire `.code` field, `.text` 가 아닌 jq fallback artifact) — falsified.
- ask 한국어 query phrasing-sensitive refusal — RAG corner case / NLI gate behavior. 별도 brainstorm.
### Logging feature enhancements — ✅ closed (PR #190, 2026-05-28 merged commit `7bbdc89a`)
logging round 2 (PR #190) 으로 4 enhancement 모두 closed:
-`image_width` + `image_height` capture (raster JPEG decode).
- ✅ SQLite mirror (V008 `pdf_ocr_events` table + dual-write).
- ✅ CLI query (`kebab inspect ocr-stats` + `ocr-failures``ocr_stats.v1` + `ocr_failures.v1` wire schemas).
- ✅ log retention (`keep_recent_runs` + `retention_days` — file + SQLite cleanup).
### P9 dogfooding 백로그 (fb-26 ~ fb-42) — release 분할
@@ -96,11 +166,20 @@ P9-2/3/4 는 P9-1 의 parallel-safety contract (sub-state slot 패턴) 덕에
- **0.3.0 — agent foundation** ✅ cut 2026-05-07: fb-26 (log), fb-27 (introspection/error wire), fb-28 (readonly/quiet). ~~fb-29 (daemon)~~ → 🚫 **deferred** — fb-30 stdio MCP 가 동일 가치를 daemon 복잡도 없이 제공.
- **0.4.0 — agent integration (MCP)** ✅ cut: fb-30 (MCP stdio), fb-31 (single-file/stdin ingest).
- **0.5.0 — agent surface refinement (additive)** ✅ cut 2026-05-10: fb-32 (stale doc indicator), fb-33 (streaming ask), fb-34 (output budget controls), fb-35 (verbatim fetch), fb-36 (search filter args), fb-37 (trace + stats). 모두 wire schema additive minor.
- **0.6.0 — RAG quality** 🟡 진행: fb-38 (score semantics) ✅ 머지 (2026-05-10), fb-40 (fact-grounded answer / rag-v2 prompt) ✅ 머지 (2026-05-10), fb-39 (retrieval precision tuning, embedding_version cascade) — 미진행 (eval golden set 선행 필요).
- **0.7.0 또는 P+**: fb-41 (multi-hop reasoning, XL), fb-42 (bulk multi-query / rerank, Nice).
- **0.6.0 — RAG quality** ✅ 대부분 머지 (2026-05-10): fb-38 (score semantics) ✅, fb-39 (eval foundation — `precision_at_k_chunk` metric) ✅, fb-39b (embedding upgrade — multilingual-e5-large default) ✅, fb-40 (fact-grounded answer / rag-v2 prompt) ✅. 잔여 = fb-39 retrieval precision lever 실제 적용 (eval golden set 확장 선행 필요).
- **0.7.0 또는 P+**: fb-41 (multi-hop reasoning, XL) — ⏳ 미구현 · brainstorm 필요; fb-42 (bulk multi-query) ✅ 머지 (2026-05-10, bulk only — rerank hint 은 deferred).
각 fb spec frontmatter 의 `target_version` 필드가 source of truth. INDEX.md 의 release subheader 도 동일 grouping.
### P10 dogfooding 백로그 (2026-05-22 round 2)
P10 종합 도그푸딩 round 2 (`/build/cache/dogfood-p10b/`, OSS 8 repo + 한국어 위키 문서 10편) 에서 발견된 follow-up 후보. 자세한 내용 + 우선순위 근거는 `tasks/HOTFIXES.md` (2026-05-22).
- **한국어 lexical tokenizer** — ✅ v0.17.0 (2026-05-24) PR-A 머지 (#159). V007 trigram migration 자동 backfill + `build_match_string` 재설계 + CLI/TUI/wire hint. HOTFIXES `2026-05-24 PR-A` 참조.
- **code_lang_chunk_breakdown chunk 단위 집계 (LOW)** — ✅ v0.17.0 (2026-05-24) PR-C 머지 (#161). `schema.v1.stats` additive 필드. HOTFIXES `2026-05-24 PR-C` 참조.
- **C typedef-wrapped struct (LOW)** — ✅ v0.17.0 (2026-05-24) PR-B 머지 (#160). `type_definition` 분기 + `PARSER_VERSION code-c-v2` bump + orphan purge cascade. HOTFIXES `2026-05-24 PR-B` 참조.
- **ranking glue chunk 편향 (deferred)** — 자동 heuristic 은 user intent misalignment 위험. 사용자 명시 요청 전까지 surface 변경 0 유지. 1주+ 실사용 후 재 brainstorm.
## 검증된 운영 동작 (release binary, fastembed enabled)
P7-3 머지 직후 25 시나리오 smoke 통과 — markdown + image + PDF 5 자산 워크스페이스에서 doctor / ingest / list / inspect / search (lex/vec/hybrid) / re-ingest / byte-edit re-ingest / corrupt PDF / RAG ask + page citation 모두. 자세한 시나리오 표는 conversation 기록 참조; 워크스페이스에 직접 돌려보는 절차는 [docs/SMOKE.md](docs/SMOKE.md).

321
README.md
View File

@@ -1,121 +1,196 @@
# kebab — Local-first Knowledge Base
# kebab — Local-first Knowledge Base + RAG
`kebab` 는 개인용 로컬 knowledge base + RAG 도구다. Markdown / PDF / 이미지를 한 곳에 색인하고, 의미 검색 + page-단위 citation 포함 LLM 답변을 단일 binary 로 제공한다. 모든 추론은 로컬 (Ollama / fastembed) 에서 돌아간다. 대상 하드웨어: M4 48GB MacBook 1대, 사용자 1명.
## 사전 요구
- **Rust toolchain** ≥ 1.85 (workspace 가 edition 2024 + resolver 3 사용). [rustup](https://rustup.rs) 권장.
- **Ollama** — `kebab ask` 와 이미지 OCR/caption 가 사용. `https://ollama.com/download` 에서 설치 후 `ollama serve` 실행. 기본 LLM 은 gemma4 계열 (`ollama pull gemma4:e4b`) — OCR / caption 도 같은 family 라 모델 하나만 pull 하면 됨. 더 큰 variant 원하면 `gemma4:26b` 등으로 config override. config 의 `[models.llm].endpoint` 에 host:port 명시.
- **빌드 디스크** — 첫 빌드 시 `target/` 가 610 GB (Lance + DataFusion + fastembed). 여유 확인.
- **fastembed 모델** — 첫 `kebab ingest``multilingual-e5-large` (~1.3 GB, fb-39b) 자동 다운로드. `config.toml` 에서 `model = "multilingual-e5-small"` 로 명시하면 이전 모델 사용.
## 설치
표준 경로는 `cargo install``~/.cargo/bin/kebab` 가 PATH 에 있는지만 확인하면 끝.
```bash
# 1) repo clone
git clone https://gitea.altair823.xyz/altair823-org/kebab.git
cd kebab
# 2) binary 빌드 + 설치 (~/.cargo/bin/kebab)
cargo install --path crates/kebab-cli --locked
# 3) PATH 확인 (아직 추가 안 했으면 ~/.bashrc / ~/.zshrc 에 추가)
which kebab # → /Users/<you>/.cargo/bin/kebab 같은 경로
kebab --version # → kebab 0.1.0
```
git URL 직접 install 도 가능 (clone 없이):
```bash
cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin kebab --locked
```
업데이트는 `git pull && cargo install --path crates/kebab-cli --locked --force` 또는 git URL 형식의 경우 `cargo install --git ... --force`.
제거는 `cargo uninstall kebab-cli`. 이 명령은 binary 만 지우고 워크스페이스 데이터는 그대로 남는다. 데이터까지 정리하려면 `kebab reset --all --yes` (config + data + cache + state 4 개 XDG 경로 모두 wipe — **irreversible**, 재시작 시 `kebab init` 다시 실행). 부분 wipe 는 `kebab reset --data-only` (config 보존), `kebab reset --vector-only` (Lance + `embedding_records` 만, 다음 ingest 가 re-embed) 등.
`kebab` 는 개인용 로컬 knowledge base + RAG 도구다. Markdown · PDF · 이미지 · 소스코드를 한 곳에 색인하고, 하이브리드 의미 검색과 근거 인용을 포함 LLM 답변을 **단일 binary** 로 제공한다. 모든 추론은 로컬 (Ollama + fastembed) 에서 돌아간다.
## Quick start
사전 요구는 두 가지뿐이다.
- **Rust toolchain** ≥ 1.85 (workspace 가 edition 2024 사용). [rustup](https://rustup.rs).
- **Ollama** — `kebab ask` 와 이미지/PDF OCR 가 사용. [공식 설치 안내](https://ollama.com/download) 참고 후 `ollama serve` 실행. 기본 LLM family 는 gemma4 (`ollama pull gemma4:e4b`) — OCR/caption 도 같은 family 라 모델 하나면 된다. CPU-only 환경이면 소형 모델 (예: `gemma3:4b`) 을 권장.
```bash
# 첫 실행 — XDG 경로에 데이터 디렉토리 + config.toml 생성
# 1) 빌드 + 설치 (~/.cargo/bin/kebab)
git clone https://gitea.altair823.xyz/altair823-org/kebab.git
cd kebab
cargo install --path crates/kebab-cli --locked
# 2) 데이터 디렉토리 + config.toml 생성 (XDG 경로)
kebab init
# config 손보 — workspace.root, 모델 endpoint 등 설정 (지원 형식: md / png / jpg / pdf / rs / py / ts / js)
# 3) config 최소 손보 — workspace.root (색인할 폴더) 와 LLM endpoint
${EDITOR:-vi} ~/.config/kebab/config.toml
# 색인 (Markdown / 이미지 / PDF 모두 한 번에)
# 4) 색인 (Markdown · PDF · 이미지 · 소스코드 한 번에)
kebab ingest
# 검색 (citation 의 source_span 이 매체별로 line / region / page)
kebab search "Markdown chunking 규칙" --mode hybrid
# 5) 검색 (hybrid = lexical + vector RRF, citation 포함)
kebab search "Markdown chunking 규칙"
# 질문 (Ollama 필요, PDF 인용 시 page 번호 surface)
# 6) 질문 (RAG 답변 + 근거 인용, Ollama 필요)
kebab ask "내 KB 설계에서 저장소 전략은?"
# Ratatui 셸 (Library + Search + Ask + Inspect 패널, desktop 진행 중)
kebab tui
# 헬스 체크 (config 경로 / 데이터 디렉토리 쓰기 가능 여부)
kebab doctor
```
격리된 임시 워크스페이스로 돌려보는 절차는 [docs/SMOKE.md](docs/SMOKE.md) — `--config <path>` 로 분리. 이미지 / PDF fixture 가 필요하면 두 example 바이너리 (`cargo run --release --example gen_smoke_pdf -p kebab-parse-pdf` / `gen_smoke_png -p kebab-parse-image`) 로 시스템 dep 없이 in-tree 생성 가능.
clone 없이 git URL 로 바로 설치할 수도 있다: `cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin kebab --locked`. 업데이트는 동일 명령에 `--force`. 제거는 `cargo uninstall kebab-cli` (데이터는 보존 — 데이터까지 지우려면 `kebab reset --all --yes`).
설치 없이 dev 흐름으로 돌려볼 때는 `cargo run --release -p kebab-cli -- <subcommand>` 또는 `cargo build --release && ./target/release/kebab <subcommand>`.
설치 없이 dev 흐름으로 돌려볼 때는 `cargo run --release -p kebab-cli -- <subcommand>`. 격리된 임시 워크스페이스로 검증하는 절차는 [docs/SMOKE.md](docs/SMOKE.md) (`--config <path>` 로 분리).
## 핵심 기능
### 하이브리드 검색 + citation
lexical (FTS5 BM25) 과 vector (cosine) 두 채널을 **RRF fusion** 으로 합쳐 검색한다. 모든 hit 은 출처 위치를 매체별로 정확히 담는다 — Markdown/코드는 line, 이미지는 region, PDF 는 page. `--tag` · `--media` · `--lang` · `--path-glob` 등 다양한 필터와 `--max-tokens` · `--cursor` 같은 agent budget flag 를 지원한다.
### 파생물 캐시 (자동)
embedding 벡터를 청크 **내용 해시** 로 캐싱한다 (`derivation_cache`). 재색인·갱신 시 내용이 같은 청크는 재계산을 건너뛴다. 캐시 키에 모델·차원 버전이 포함돼 버전 변경 시 자동 무효화된다 (cascade 안전). 별도 설정 없이 투명하게 동작한다. (현재 TTL/LRU 자동 정리는 미구현 — 누적된 캐시는 `kebab reset` 으로만 정리.)
### 외부 계산 + 로컬 검색 워크플로
search/ask 는 원본 파일 없이 KB 산출물만으로 동작한다 (청크 본문이 SQLite 에 저장되고 문서 경로는 상대경로로 기록됨). 비싼 색인(임베딩·OCR)을 성능 좋은 머신에서 수행한 뒤(예: Apple Silicon 맥에서 candle Metal GPU), **두 산출물만** 다른 머신(예: NUMA 서버)으로 복사하면 그대로 검색·질문할 수 있다.
**무엇을 복사하나 — `[storage]` 에서 정의된 두 경로:**
| 복사 대상 | config 키 (`[storage]`) | 기본 경로 | 내용 |
|-----------|------------------------|-----------|------|
| `kebab.sqlite` | `sqlite = "{data_dir}/kebab.sqlite"` | `{data_dir}/kebab.sqlite` | 문서·청크·본문·FTS5·메타 |
| `lancedb/` | `vector_dir = "{data_dir}/lancedb"` | `{data_dir}/lancedb/` | 임베딩 벡터 |
`{data_dir}``[storage].data_dir` (예: `~/.local/share/kebab`). `models/`(`model_dir``assets/`(`asset_dir`)는 **복사 불필요** — 모델은 각 머신이 자기 캐시를 받고, asset 원본 바이트는 검색·질문에 쓰이지 않는다 (단일파일/`stdin` 색인의 원본 재읽기·재색인까지 보존하려면 `assets/` 도 함께 복사).
```bash
# ingest 가 끝난(쓰기 없는) 상태에서 복사
rsync -a <src-data_dir>/kebab.sqlite user@server:<dst-data_dir>/
rsync -a <src-data_dir>/lancedb/ user@server:<dst-data_dir>/lancedb/
```
조건: **양쪽 동일 `kebab` 버전 + 동일 임베딩 모델/차원** (`[models.embedding].model`·`dimensions`). provider 는 달라도 됨 (예: 맥 `candle`/Metal ↔ 서버 `candle`/CPU 또는 `fastembed` — 같은 모델이면 벡터 호환). 복사는 반드시 ingest 가 돌지 않을 때.
### 멀티미디어 색인
Markdown · PDF · 이미지(OCR + caption) · 소스코드(Rust/Python/TS/JS/Go/Java/Kotlin/C/C++ AST) · 리소스(YAML/Dockerfile/TOML/JSON/XML 등)를 확장자에 따라 자동으로 적절한 chunker 에 라우팅한다. embedded text 가 없는 scanned PDF 는 `[pdf.ocr]` 로 page-단위 OCR (opt-in). 전체 확장자→chunker 매핑은 [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).
### RAG (근거 인용 + 거절)
검색 결과를 근거로 LLM 답변을 생성하고 [#번호] 인용을 단다. 근거가 부족하면 답을 지어내지 않고 거절한다. compound 질문은 `--multi-hop` 으로 분해→synthesize. 답변의 groundedness 는 mDeBERTa XNLI 로 검증할 수 있다 (`[rag] nli_threshold`, default off).
### TUI
`kebab tui` 는 Ratatui 셸 — Library / Search / Ask / Inspect 패널을 vim-style 모드로 다룬다. 키 매핑은 앱 내 `F1` cheatsheet 가 권위 소스다.
## 명령
| 명령 | 동작 |
|------|------|
| `kebab init` | XDG 경로에 데이터 디렉토리 + config.toml 생성 |
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs``code-rust-ast-v1`, `.py``code-python-ast-v1`, `.ts`/`.tsx``code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx``code-js-ast-v1` — 모두 tree-sitter AST chunker). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"``citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`). |
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor <opaque>] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media``,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min``primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md``markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. |
| `kebab ingest [<path>]` | 워크스페이스 스캔 후 새/변경 문서 색인 (idempotent · incremental, `--force-reingest` 로 강제 재처리). 미지원 확장자는 자동 skip. 진행바는 현재 **파일명** · 느린 **phase(ocr/caption/embed)+모델명** · **경과초**`(Ns)` · 문서별 청크 수 · phase별 소요시간(parse/chunk/ocr/caption/embed/store)을 표시하고, 종료 시 **최장 소요 파일 top-5** 를 요약한다 (`--json``asset_phase`/`asset_chunked`/`asset_timings` 이벤트로, 사람용 요약은 미출력) |
| `kebab ingest-file <path>` | 단일 파일 ingest (workspace 외부 가능 — `_external/` 로 deterministic copy) |
| `kebab ingest-stdin --title <T>` | stdin 의 markdown 본문 ingest |
| `kebab search --mode {lexical,vector,hybrid} "<query>" [flags]` | 검색 (default hybrid = RRF fusion, citation 포함). 필터/budget flag 는 `--help` |
| `kebab ask "<query>" [flags]` | RAG 답변 + 근거 인용 (Ollama 필요). `--session` (multi-turn) · `--stream` · `--multi-hop` |
| `kebab list docs` | 색인된 문서 목록 |
| `kebab inspect doc <id>` / `kebab inspect chunk <id>` | raw record 보기 |
| `kebab fetch chunk <id> [--context N]` / `kebab fetch doc <id> [--max-tokens N]` / `kebab fetch span <doc_id> <ls> <le> [--max-tokens N]` | (p9-fb-35) verbatim text fetch from indexed corpus. wire = `fetch_result.v1` (kind discriminator). chunk: target + ±N ordinal-context chunks. doc: full normalized markdown. span: 1-based line range (PDF/audio rejected as `error.v1.code = span_not_supported`). chars/4 budget on doc/span. |
| `kebab ask "<query>" [--show-citations / --hide-citations] [--session <id>] [--stream]` | RAG 답변 + 근거 인용. 답변 후 `근거:` block 으로 full path / line range / score 한 줄씩 (default ON — `--hide-citations` 로 끄기, pipe 시 유용). 근거 부족 시 거절. Ollama 필요. `--session <id>` 로 multi-turn — 첫 호출에서 SQLite `chat_sessions` 에 자동 생성, 이후 호출은 prior turns 를 history 로 받아 follow-up. session id 는 사용자 지정 (e.g. `kb-rust-async-2026-05`) — `kebab reset --data-only` 로 모든 session wipe. **`--stream` (p9-fb-33)** 로 ndjson `answer_event.v1` event (retrieval_done → token* → final) 를 stderr 에 흘리고 stdout 마지막 줄에 기존 `answer.v1` — agent 가 token 즉시 소비 가능 |
| `kebab doctor` | 설정/모델/DB 헬스 체크 |
| `kebab tui` | Ratatui 셸 (Library + Search + Ask + Inspect 패널, desktop 진행 중). Library 에서 `r` 키로 background ingest 시작 — 화면 하단 status bar 가 진행 표시, 완료/abort 시 final 라인 잠시 유지 후 자동 hide. ingest 진행 중 `Esc` / `Ctrl-C` 가 cancel signal (그 외에는 quit). vim-style mode (header 우측 `-- NORMAL --` / `-- INSERT --`) — Library/Inspect 는 자동 NORMAL, Search/Ask 는 자동 INSERT. `i` 로 Normal→Insert (모든 pane — p9-fb-21), `Esc` 로 Insert→Normal 어디서나. mode-authoritative dispatch — Search 의 `j/k/o/g`, Ask 의 `e/j/k` 는 NORMAL 모드에서만 명령으로 동작, INSERT 에서는 입력 문자로 typing. (Search 의 chunk inspect 키는 `i``o` 로 rebind — `i` 가 universal Insert toggle.) **`F1` 로 cheatsheet popup** (현재 pane 의 키 매핑 + global 토글 표) — `Esc` / `F1` 로 닫기. Search 패널은 200ms debounce 후 background worker 가 검색 — 키 입력으로 UI freeze 안 됨, 사용자가 계속 타이핑하면 stale 결과 자동 폐기 (generation counter). Ask 패널은 multi-turn — 같은 conversation 안에서 Q1/A1, Q2/A2 transcript 누적, 다음 질문이 이전 턴을 history 로 받아 답변. 답변 본문은 markdown 렌더 (bold/italic/inline code/heading/list/code fence/table/blockquote, raw `**bold**` 가 실제 굵게 표시). `Ctrl-L` 로 새 conversation 시작. Search 의 `g` 키가 `$EDITOR` (기본 `vi`) 로 hit 의 citation 위치 열기 — 종료 후 TUI 화면이 자동으로 깨끗이 redraw. CLI `kebab ask` 는 raw markdown 그대로 (terminal 호환성 위해). Library 의 doc-list 가 한글 / 일본어 / 중국어 (CJK) 제목을 wide-char 정확한 column width 로 truncate — 한글 제목이 한 줄을 넘기지 않음 (CJK 1 자 = 2 col). Search/Ask/Filter 입력의 cursor 가 wide char 위에서 column 단위로 정렬 — 한글 입력 시 caret 이 글자 옆에 정확히 놓임. `← / →` 로 입력 문자열 중간 cursor 이동 (한글 한 글자 = 2 column 이라도 한 번에 이동), `Home / End` 로 양 끝 점프, `Delete` 로 cursor 위치 char 삭제 — 모든 input pane (Ask / Search / Library filter overlay) 동일 (p9-fb-22). Ask 트랜스크립트는 새 답변이 viewport 아래로 누적될 때 자동으로 tail 을 따라감 (auto-scroll); `j` / `k` 로 위로 스크롤하면 freeze, `Shift-G` 로 다시 bottom + auto-tail 재개. 화면 하단 hint line 은 한국어 동사구로 (`"위로"` / `"아래로"` / `"필터"` / `"타이핑 검색어"` / `"Esc 로 NORMAL 모드"` / `"i 입력모드"` 등) + 현재 (pane, mode) 조합에 맞춰 자동 분기, **첫 fragment 가 항상 `F1 도움말`** (cheatsheet 발견성 보장). 모든 모드에서 항상 떠 있는 상태바 — `kebab v<version> │ <pane> │ <docs> docs │ <state>` (state: streaming/searching/indexing/idle, ingest 진행 중에는 progress 가 같은 자리에 흡수됨). Ask 진입 시 conversation id 8 자 prefix 도 함께 표시. Ask 트랜스크립트와 Inspect 양쪽에서 `PgUp / PgDn` 으로 10 줄씩 페이지 스크롤. Library 의 doc list 위에는 `TITLE / TAGS / UPDATED / CHUNKS` 컬럼 헤더 행 표시 (display-width 정렬, Hangul / CJK 안전). |
| `kebab reset [--all / --data-only / --vector-only / --config-only] [--yes]` | XDG 데이터 wipe. **Irreversible.** TTY 면 confirm prompt, 아니면 `--yes` 필수. `--vector-only` 는 SQLite `embedding_records` 도 함께 truncate (orphan 방지) |
| `kebab eval run / compare` | golden query 회귀 측정 |
| `kebab schema [--json]` | introspection — wire schemas / capabilities / models / stats 한 번에. `--json``schema.v1` wire; 사람 모드는 서식 출력. **stats 에 (p9-fb-37) `media_breakdown` (5 keys: markdown / pdf / image / audio / other) + `lang_breakdown` (BCP-47 코드, NULL 은 literal `"null"`) + `index_bytes` (sqlite + lancedb on-disk 합계) + `stale_doc_count` (`config.search.stale_threshold_days` 초과 doc 수) 추가.** |
| `kebab ingest-file <path>` | 단일 파일 ingest (workspace 외부 가능). 바이트는 `<workspace.root>/_external/<hash12>.<ext>` 로 copy. `.kebabignore` 매치 시 stderr warn 후 진행 (explicit ingest 가 bypass intent). |
| `kebab ingest-stdin --title <T> [--source-uri <URI>]` | stdin 의 markdown 본문 ingest. frontmatter (title + source_uri) 자동 prepend. v1 markdown only. |
| `kebab mcp` | MCP (Model Context Protocol) stdio server. agent host (Claude Code / Cursor / OpenAI Agents) 가 spawn 하여 tool 호출 (`search` / `bulk_search` / `ask` / `fetch` / `schema` / `doctor` / `ingest_file` / `ingest_stdin`). `--config` honor. |
| `kebab inspect doc <id>` / `inspect chunk <id>` | raw record 보기 |
| `kebab fetch chunk\|doc\|span <id> [flags]` | indexed corpus 에서 verbatim text fetch |
| `kebab eval run \| aggregate \| compare \| variants` | golden query 회귀 측정 + 변형 일관성 진단 |
| `kebab schema [--json]` | introspection — wire schemas / capabilities / models / stats |
| `kebab doctor` | 설정 / 모델 / DB 헬스 체크 |
| `kebab tui` | Ratatui 셸 (Library / Search / Ask / Inspect) |
| `kebab mcp` | MCP stdio server (`search` / `bulk_search` / `ask` / `fetch` / `schema` / `doctor` / `ingest_file` / `ingest_stdin`) |
| `kebab reset [--all \| --data-only \| --vector-only \| --config-only \| --orphans-only] [--yes]` | XDG 데이터 wipe (**irreversible**) |
모든 명령에 `--json` 플래그. 출력은 frozen wire schema v1 (`schema_version` 항상 포함, 예: `ingest_report.v1`, `ingest_progress.v1`, `search_hit.v1`, `answer.v1`, `doctor.v1`, `reset_report.v1`, `schema.v1`). `--json` 모드에서 fatal error 는 stderr 에 `error.v1` ndjson 으로 emit (exit code 0/1/2/3 unchanged).
모든 명령에 `--json` 플래그가 있고, 출력은 frozen **wire schema v1** 을 따른다 (`schema_version` 항상 포함). `--json` 모드에서 fatal error 는 stderr 에 `error.v1` ndjson 으로 emit (exit code 0/1/2/3 불변). 글로벌 flag: `--readonly` (write-path 비활성화), `--quiet` (human stderr 억제), env `KEBAB_PROGRESS=plain`. 전체 flag·wire 의미는 `kebab <cmd> --help` 와 [docs/wire-schema/v1/](docs/wire-schema/v1/). 외부 agent 통합(Claude Code skill / MCP)은 [docs/mcp-usage.md](docs/mcp-usage.md) 와 [integrations/](integrations/).
글로벌 플래그: `--readonly` (또는 `KEBAB_READONLY=1`) — 모든 write-path 명령 (`ingest` / `ingest-file` / `ingest-stdin` / `reset`) 을 비활성화, exit 1. `--quiet` — 진행 바 / hint 등 human-readable stderr 억제 (exit code / stdout 출력은 그대로). `KEBAB_PROGRESS=plain` — TTY 가 없는 환경에서도 진행 상황을 plain-text 한 줄씩 stderr 로 출력 (spinner 대신).
## Configuration
### Score 해석 (fb-38)
`~/.config/kebab/config.toml``kebab init` 가 XDG 경로에 생성한다. 핵심 노브만 정리한다 (전체 절은 생성된 파일 주석 참고, 예시는 [docs/SMOKE.md](docs/SMOKE.md)).
`search_hit.v1.score`**ranking signal** 이지 confidence 가 아니다. `score_kind` 필드로 의미 선언:
```toml
[workspace]
root = "~/KnowledgeBase" # 색인할 폴더. 절대 / tilde / env / 상대 경로 가능.
# 상대 경로의 base 는 config.toml 위치 (cwd 무관).
| `score_kind` | 의미 | 범위 |
|--------------|------|------|
| `rrf` (hybrid) | RRF normalized | `[0, 1]`, ceiling = 1.0 (양 채널 rank=1) |
| `bm25` (lexical) | raw BM25 | unbounded (≥ 0) |
| `cosine` (vector) | cosine sim | `[-1, 1]` |
#### RRF 수식 (hybrid mode)
```
chunk c 의 raw RRF = Σ_m 1 / (k_rrf + rank_m(c))
여기서 m ∈ {lexical, vector}, k_rrf = config.search.rrf_k (default 60).
양 채널 모두 rank=1 일 때 raw RRF = 2 / (k_rrf + 1) ≈ 0.0328.
normalize: rrf_score = raw_rrf / (2 / (k_rrf + 1))
→ rrf_score ∈ [0, 1]. 양쪽 rank=1 → 1.0, 한 쪽만 등장 → ≈ 0.5 천장.
[models.embedding]
provider = "fastembed" # "fastembed"(기본, onnxruntime) / "candle"(순수 Rust)
# / "ollama"(원격 HTTP) / "none"(lexical-only).
# candle 는 같은 모델·같은 벡터를 순수 Rust 로 돌려
# NUMA 서버의 onnxruntime 48-스레드 double-free 를 피하는
# opt-in 백엔드 (e5 는 재색인 불필요).
model = "multilingual-e5-large" # 다국어 sentence embedding (1024-dim).
# 첫 ingest 시 ONNX (~1.3GB) 자동 다운로드.
# candle provider 는 safetensors (~2GB) 다운로드.
# candle/ollama 는 "snowflake-arctic-embed-l-v2.0"
# (설명형 query 의 recall 보강) 도 지원 — 아래 참고.
dimensions = 1024 # config 와 LanceDB stored dim 불일치 시 검색 0건.
num_threads = 0 # candle 전용 CPU 스레드 캡 (0=auto=#cores).
# env KEBAB_EMBED_THREADS 가 우선. NUMA 노드 바인딩은
# numactl 과 조합. fastembed provider 는 무시.
# endpoint = "http://127.0.0.1:11434" # provider="ollama" 전용 HTTP endpoint.
# 생략 시 [models.llm].endpoint 로 폴백.
# fastembed/candle provider 는 무시.
```
`rrf_score = 0.5` 의 의미: chunk 가 한 채널 (lexical 또는 vector) 에서만 rank 1 로 등장. confidence 50% 가 아님 — RRF 수식의 산술적 천장.
**arctic-embed-l-v2.0 (설명형 query recall 보강)**: 기본 e5-large 대신
Snowflake `arctic-embed-l-v2.0` 임베더를 쓸 수 있다 (1024-dim, opt-in). 측정에서
설명형/약어/영문 용어 query 의 recall@10 이 e5 대비 향상됐다. 두 경로:
agent 가 trust threshold 가 필요하면 top-level `score` 가 아닌 nested `retrieval.lexical_score` (BM25 raw) / `retrieval.vector_score` (cosine raw) 사용.
```toml
# (A) candle 백엔드 — 순수 Rust, in-process (NUMA 안전, Metal GPU 가능):
[models.embedding]
provider = "candle"
model = "snowflake-arctic-embed-l-v2.0" # CLS pooling, query 에 "query: " 접두어
# (문서는 무접두어). safetensors ~2GB 다운로드.
## 논리 아키텍처
# (B) ollama 백엔드 — 원격/로컬 Ollama 데몬에 위임 (POST /api/embed):
[models.embedding]
provider = "ollama"
model = "snowflake-arctic-embed2" # Ollama 모델 태그 (ollama pull 필요)
endpoint = "http://127.0.0.1:11434" # 생략 시 [models.llm].endpoint
```
> ⚠️ e5 → arctic 전환은 `embedding_version` cascade 를 트리거한다 (모델이 다르면
> 벡터도 다름). 기존 e5 KB 와 혼용 불가 — 전환 시 **재색인** 필요 (`kebab reset`
> 후 재 ingest). 기본값은 e5 라 기존 사용자는 영향 없음.
**Apple Silicon GPU 가속 (candle / macOS)**: M-시리즈 맥에서 candle 임베딩을
GPU(Metal)로 돌리면 CPU 대비 대용량 ingest 가 크게 빨라진다. 빌드 또는 설치 시
`embed_metal` feature 를 켠다:
```bash
# 빌드만:
cargo build --release --features embed_metal
# 전역 설치 (~/.cargo/bin/kebab):
cargo install --path crates/kebab-cli --features embed_metal --locked
```
벡터는 CPU candle 과 동일 모델이라 호환되므로, 맥에서 GPU 로 색인한
`kebab.sqlite` + `lancedb/` 를 그대로 Linux 서버(CPU candle)로 복사해 질의할 수
있다. 색인 로그에 `candle device = Metal (GPU)` 가 보이면 GPU 사용 중. metal
feature 는 macOS 전용 (Linux/서버는 기본 CPU 빌드).
```toml
[models.llm]
endpoint = "http://localhost:11434" # Ollama host:port
model = "gemma4:e4b"
# request_timeout_secs = 300 # 큰 모델은 늘림. 0 은 disable 이 아니라 "즉시 timeout".
[search]
stale_threshold_days = 30 # search hit / citation 의 stale 플래그 기준 (0 = off).
[rag]
prompt_template_version = "rag-v3" # 답변 언어 = 질문 언어. rag-v1/v2 는 legacy.
nli_threshold = 0.0 # >0 (예: 0.5) 면 mDeBERTa XNLI groundedness 검증.
```
- **파생물 캐시** — embedding 결과를 내용 해시로 자동 캐싱한다 (위 「핵심 기능」 참고). 설정 항목 없음.
- **`[ingest.code]`** — code ingest 의 skip 정책 (`skip_generated_header`, `max_file_bytes`, `extra_skip_globs`). `.gitignore` 자동 honor, `.kebabignore` 는 추가 layer.
- **`[pdf.ocr]`** — scanned PDF 의 page-단위 OCR (default off / opt-in, page 당 ~수십 초 cost). 활성화 후 v0.19 시절 색인분은 `kebab ingest --force-reingest` 로 재처리.
- **`--config <path>`** — 임시 워크스페이스 / 격리 테스트용 (CLI · TUI 모두 honor).
- **`kebab config migrate`** — 새 버전에서 추가된 config 섹션을 기존 `config.toml` 에 설명 주석과 함께 채워 넣는다 (사용자가 손본 값·주석·순서는 보존, 멱등, 변경 시 자동 `.bak` 백업). `--dry-run` 으로 변경 미리보기. `kebab doctor` 가 갱신 필요 시 안내한다. `kebab init` 으로 새로 생성되는 config.toml 도 섹션별 주석을 포함한다.
- **`KEBAB_*` env** — 일부 키 override (`KEBAB_RAG_SCORE_GATE`, `KEBAB_EVAL_GOLDEN` 등).
- **XDG layout**: `~/.config/kebab/`, `~/.local/share/kebab/`, `~/.cache/kebab/`, `~/.local/state/kebab/`.
## 아키텍처
```mermaid
flowchart TB
@@ -132,7 +207,7 @@ flowchart TB
subgraph Pipeline["도메인 + 파이프라인"]
parse["parse-md / parse-pdf / parse-image / parse-code"]
chunker["chunker (md-heading-v1, pdf-page-v1, code-rust-ast-v1, code-python-ast-v1, code-ts-ast-v1, code-js-ast-v1)"]
chunker["chunker (md / pdf / code-AST / manifest)"]
embedder["embedder (fastembed multilingual-e5-large)"]
retriever["retriever (lexical / vector / hybrid RRF)"]
rag["RAG pipeline"]
@@ -174,70 +249,22 @@ flowchart TB
rag --> ollama
```
`kebab-app` 가 facade — UI binary 가 store / parse / search / llm / rag 를 직접 참조하지 않는다 (frozen 설계 §8). 자세한 crate-level 의존성 + 디렉토리 + 핵심 기술 결정은 [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).
v0.21.0 기준 핵심 설계:
## Configuration
- **crate facade** — `kebab-app` 가 유일한 facade다. UI binary (`kebab-cli` / `kebab-tui`) 는 store / parse / search / llm / rag 를 직접 참조하지 않는다 (frozen 설계 §8). 각 user-facing 엔트리는 `*_with_config(cfg, …)` 동반 함수로 explicit config 를 thread 한다.
- **chunk_id 는 위치 기반** — chunk 의 정체성은 문서 내 위치(ordinal + span)다. 반면 파생물 캐시 키는 **내용 해시**라, 내용이 같으면 위치·문서가 달라도 동일 캐시를 재사용한다.
- **wire schema v1** — 모든 `--json` 출력은 `schema_version` 을 담는 frozen contract다. 깨는 변경은 `*.v2` major bump을 요구한다.
- **versioning cascade** — `parser_version` / `chunker_version` / `embedding_version` / `prompt_template_version` / `index_version` 변경은 downstream record(청크·임베딩·캐시·eval)를 무효화한다.
- `~/.config/kebab/config.toml``kebab init` 가 XDG 경로에 생성. `[workspace]` (root, exclude — include 필드는 제거됨, 지원 형식은 자동 결정), `[storage]`, `[chunking]`, `[models.embedding]`, `[models.llm]`, `[image.ocr]`, `[image.caption]`, `[search]`, `[rag]`, `[ui]` 절.
- `[models.embedding]`
- `model` (default `"multilingual-e5-large"`, fb-39b) — 다국어 sentence embedding 모델. 1024-dim. ONNX (~1.3 GB) 첫 실행 시 fastembed cache (`config.storage.model_dir/fastembed/`) 에 자동 다운로드. `"multilingual-e5-small"` (384 dim) 는 backwards-compat 으로 사용 가능 — TOML 에 명시.
- `dimensions` (default `1024`) — 모델의 embedding 차원. config 와 LanceDB stored dim 불일치 시 검색 결과 0 건 (orphan table). 모델 변경 시 `kebab reset --vector-only && kebab ingest` 로 vector index 재구축 권장.
- `[ui] theme = "dark" | "light"` 로 TUI 팔레트 선택 (default `"dark"`, 알 수 없는 값은 dark fallback).
- `[search] stale_threshold_days = 30` (p9-fb-32) — search hit / RAG citation 의 `stale` 플래그 기준 (default 30 일, `0` 으로 비활성화). 옛 config 의 `workspace.include = [...]` 은 silently 무시 + 단발 deprecation warning (p9-fb-25).
- `[ingest.code]` (p10-1A-1) — code ingest 의 skip 정책 + chunker 기본값.
- `skip_generated_header = true` — 첫 ~512 byte 의 generated marker (`@generated` / `DO NOT EDIT` 등) 감지 시 skip.
- `max_file_bytes = 262144` (256 KiB) / `max_file_lines = 5000` — 파일당 cap, 초과 시 skip.
- `extra_skip_globs = []` — 사용자 추가 skip 패턴 (`.gitignore` 문법).
- `.gitignore` honor: 자동 적용. `.kebabignore` 는 추가 layer. 우선순위: built-in safety net (`node_modules/` / `target/` / `__pycache__/` / `.venv/` / `venv/` / `env/`) > `.gitignore` > `.kebabignore`.
- `[rag] prompt_template_version` (default `"rag-v2"`) — RAG system prompt version. `"rag-v1"` 은 legacy backwards-compat (사용자 명시 시 유지). v2 강화 규칙: (1) fact 인용 시 [#번호] 앞에 chunk 속 원문 큰따옴표 표기, (2) 학습 지식 동원 금지, (3) 근거 모호 시 "확실하지 않다" 명시.
- `--config <path>` flag — 임시 워크스페이스 / 격리 테스트 시 사용. CLI / TUI 모두 honor.
- `KEBAB_*` env — 일부 키 override (`KEBAB_RAG_SCORE_GATE`, `KEBAB_EVAL_GOLDEN`, `KEBAB_COMMIT_HASH` 등).
- XDG layout: `~/.config/kebab/`, `~/.local/share/kebab/`, `~/.cache/kebab/`, `~/.local/state/kebab/`.
- `workspace.root` 경로 형식: 절대 (`/foo/bar`) / tilde (`~/KnowledgeBase`, default) / env (`${XDG_DATA_HOME}/kebab`) / 상대 (`./notes`, `notes`, `../shared/x`) 모두 가능. **상대 경로의 base 는 config.toml 자체가 위치한 디렉토리** — 사용자의 `cwd` 와 무관 (`--config /tmp/cfg.toml` + `root = "kb"``/tmp/kb`). p9-fb-05 정책.
config 예시는 [docs/SMOKE.md](docs/SMOKE.md) 의 `/tmp/kebab-smoke/config.toml` 블록 참조.
## 외부 AI 통합
`--json` 출력 + frozen wire schema v1 가 stable contract. 통합 옵션:
- **Claude Code skill** — repo 의 [`integrations/claude-code/`](integrations/claude-code/) 가 ship-ready skill. `cp -r integrations/claude-code/kebab ~/.claude/skills/` 한 번이면 새 Claude Code 세션부터 자동 trigger (내부 시스템 / 위키 lookup / 사내 runbook 질문). multi-turn 은 `kebab ask --session <id> --json` 으로 영속 — skill 이 conversation id 관리하면 외부 agent 도 `--repl` 없이 stateful 대화 가능 (p9-fb-18).
- **Codex / 기타 agent host** — `--json` + frozen wire schema v1 가 stable contract. 동일 패턴으로 ~50줄 wrapper 작성 가능. `integrations/<host>/` 에 추가 PR 환영.
- **MCP server** — stdio JSON-RPC 로 `kebab-app` facade 1:1 노출. `kebab mcp` 참조.
- **HTTP wrapper** — `kebab serve --bind 127.0.0.1:7711` (P+, local-only 가치 신중).
## MCP 사용
`kebab mcp` 가 stdio MCP server. 8 tool: `search` / `bulk_search` (p9-fb-42 — N query 한 번에) / `ask` / `fetch` (p9-fb-35) / `schema` / `doctor` / `ingest_file` / `ingest_stdin`.
Claude Code 빠른 등록 (`~/.claude/mcp.json` 또는 host 동등 위치):
```json
{
"mcpServers": {
"kebab": {
"command": "kebab",
"args": ["mcp"]
}
}
}
```
자세한 사용법 (Cursor / OpenAI Agents / Copilot CLI config, per-tool 입출력 예시, troubleshooting, multi-turn ask + session 관리, performance / security) — **[docs/mcp-usage.md](docs/mcp-usage.md)** 참조.
crate-level 의존성 그래프 · 디렉토리 트리 · 확장자→chunker 전체 매핑 · 핵심 기술 결정은 [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md), 진척도는 [HANDOFF.md](HANDOFF.md).
## 비-목표
다중 사용자 SaaS / K8s / 원격 vector DB / enterprise RBAC / 실시간 협업 / 모든 파일 포맷의 완벽한 parsing / agent 임의 파일 수정 / multi-workspace / LLM-as-judge eval / CLIP 시각 embedding / `kebab://` protocol handler — frozen 설계 §11 / §0 참조.
다중 사용자 SaaS / K8s / 원격 vector DB / enterprise RBAC / 실시간 협업 / agent 임의 파일 수정 / multi-workspace / LLM-as-judge eval / CLIP 시각 embedding — frozen 설계 §0 / §11 참조.
## 라이선스
## 버전 / 라이선스 / 참고
`MIT OR Apache-2.0` (workspace `Cargo.toml``license` 필드).
## 참고
- 진척도: [HANDOFF.md](HANDOFF.md)
- 아키텍처: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
- Frozen 설계: [docs/superpowers/specs/2026-04-27-kebab-final-form-design.md](docs/superpowers/specs/2026-04-27-kebab-final-form-design.md)
- Task 인덱스: [tasks/INDEX.md](tasks/INDEX.md)
- 머지 후 hotfix 로그: [tasks/HOTFIXES.md](tasks/HOTFIXES.md)
- Smoke 절차: [docs/SMOKE.md](docs/SMOKE.md)
- **버전**: v0.21.0 (`kebab --version` 으로 확인).
- **라이선스**: `MIT OR Apache-2.0`.
- 진척도: [HANDOFF.md](HANDOFF.md) · 아키텍처: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) · Frozen 설계: [docs/superpowers/specs/2026-04-27-kebab-final-form-design.md](docs/superpowers/specs/2026-04-27-kebab-final-form-design.md)
- Task 인덱스: [tasks/INDEX.md](tasks/INDEX.md) · Hotfix 로그: [tasks/HOTFIXES.md](tasks/HOTFIXES.md) · Smoke 절차: [docs/SMOKE.md](docs/SMOKE.md) · MCP 사용: [docs/mcp-usage.md](docs/mcp-usage.md)

View File

@@ -12,17 +12,22 @@ kebab-core = { path = "../kebab-core" }
kebab-config = { path = "../kebab-config" }
kebab-source-fs = { path = "../kebab-source-fs" }
kebab-parse-md = { path = "../kebab-parse-md" }
kebab-parse-types = { path = "../kebab-parse-types" }
kebab-normalize = { path = "../kebab-normalize" }
kebab-chunk = { path = "../kebab-chunk" }
kebab-store-sqlite = { path = "../kebab-store-sqlite" }
kebab-store-vector = { path = "../kebab-store-vector" }
kebab-search = { path = "../kebab-search" }
kebab-embed = { path = "../kebab-embed" }
kebab-embed-local = { path = "../kebab-embed-local" }
kebab-embed-candle = { path = "../kebab-embed-candle" }
kebab-embed-ollama = { path = "../kebab-embed-ollama" }
kebab-llm = { path = "../kebab-llm" }
kebab-llm-local = { path = "../kebab-llm-local" }
kebab-rag = { path = "../kebab-rag" }
# p9-fb-41 PR-9c-2: facade construction of OnnxNliVerifier when
# `[rag] nli_threshold > 0`. Trait-only consumption via kebab-rag's
# `with_verifier`; no kebab-nli internals leak into kebab-app code
# beyond the construction site in `open_with_config`.
kebab-nli = { path = "../kebab-nli" }
# P6-4: image extractor + OCR + caption adapters live here. App
# threads them into the per-asset dispatch (see `ingest_one_asset`
# image branch). Trait-only consumption — no `kebab-parse-image`
@@ -32,6 +37,11 @@ kebab-parse-image = { path = "../kebab-parse-image" }
# per-asset dispatch (see `ingest_one_asset` PDF branch) and runs the
# resulting `CanonicalDocument` through `kebab-chunk::PdfPageV1Chunker`.
kebab-parse-pdf = { path = "../kebab-parse-pdf" }
lopdf = { workspace = true }
# Enhancement 1 (v0.20.x r2): JPEG dimension decode in pdf_ocr_apply.rs.
# jpeg feature added explicitly (F3 closure-r1) rather than relying on
# feature unification via kebab-parse-image.
image = { version = "0.25", default-features = false, features = ["png", "jpeg"] }
# p10-1A-2: Rust AST extractor lives here. App threads it into the
# per-asset dispatch (see `ingest_one_asset` Code branch) and runs the
# resulting `CanonicalDocument` through `kebab-chunk::CodeRustAstV1Chunker`.
@@ -41,6 +51,7 @@ blake3 = { workspace = true }
serde = { workspace = true }
serde_json = { workspace = true }
time = { workspace = true }
uuid = { workspace = true }
tracing = { workspace = true }
tracing-subscriber = { version = "0.3", features = ["env-filter", "fmt", "json"] }
tracing-appender = "0.2"
@@ -58,21 +69,40 @@ unicode-normalization = "0.1"
ignore = "0.4"
# p9-fb-34: opaque pagination cursor encodes payload as base64.
base64 = { workspace = true }
# Enhancement 3 (v0.20.x r2): direct SQL queries for inspect_ocr_stats/failures.
rusqlite = { workspace = true }
[dev-dependencies]
kebab-config = { path = "../kebab-config" }
# doc-side expansion (Phase 2) Task 4: ExpansionGenerator unit tests build
# MockLanguageModel (gated behind kebab-llm's `mock` feature, default OFF in
# [dependencies]). Enabling it here turns it on for the test build only.
kebab-llm = { path = "../kebab-llm", features = ["mock"] }
rusqlite = { workspace = true }
filetime = "0.2"
tempfile = { workspace = true }
# Image-pipeline integration tests use wiremock to stub Ollama for OCR
# / caption HTTP calls. Async runtime to host the mock server only;
# the kb-app code under test stays sync.
wiremock = { workspace = true }
tokio = { workspace = true, features = ["rt-multi-thread"] }
image = { version = "0.25", default-features = false, features = ["png"] }
image = { version = "0.25", default-features = false, features = ["png", "jpeg"] }
# P7-3 PDF integration tests build in-memory PDF fixtures via the same
# lopdf builder pattern `kebab-parse-pdf::tests::common` uses; pinned
# to the same major (0.32) so byte output is identical between the two
# fixture surfaces.
lopdf = "0.32"
lopdf = { workspace = true }
# error_wire::tests::llm_unreachable_classifies_to_model_unreachable needs a real
# reqwest::Error (private constructor) — built from a connect-refused call.
reqwest = { version = "0.12", default-features = false, features = ["blocking", "rustls-tls"] }
[features]
# Marker feature — spec §6.3 Option A (단순): lindera 는 kebab-chunk 가 default dep 으로 소유.
# disable path 없음; 이 feature 는 spec §6.3 명시를 honor 하는 role 만.
default = ["fts_korean_morphological"]
fts_korean_morphological = []
# opt-in (macOS): candle embedder runs on the Apple Silicon GPU. See kebab-embed-candle.
embed_metal = ["kebab-embed-candle/metal"]
[lints]
workspace = true

View File

@@ -40,11 +40,19 @@ use anyhow::{Context, Result, anyhow};
use lru::LruCache;
use kebab_core::{
Answer, DocumentStore, Embedder, IndexVersion, LanguageModel, Retriever, SearchHit,
SearchMode, SearchOpts, SearchQuery, VectorStore,
Answer, DocumentStore, Embedder, ExtractContext, Extractor, IndexVersion, LanguageModel,
MediaType, Retriever, SearchHit, SearchMode, SearchOpts, SearchQuery, VectorStore,
};
use kebab_embed_candle::CandleEmbedder;
use kebab_embed_local::FastembedEmbedder;
use kebab_embed_ollama::OllamaEmbedder;
use kebab_llm_local::OllamaLanguageModel;
use kebab_parse_code::{
CAstExtractor, CppAstExtractor, GoAstExtractor, JavaAstExtractor, JavascriptAstExtractor,
KotlinAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor,
};
use kebab_parse_image::ImageExtractor;
use kebab_parse_pdf::PdfTextExtractor;
use kebab_rag::{AskOpts, RagPipeline};
use kebab_search::{HybridRetriever, LexicalRetriever, VectorRetriever};
use kebab_store_sqlite::SqliteStore;
@@ -73,6 +81,14 @@ pub struct SearchResponse {
/// p9-fb-37: present when caller passed `SearchOpts.trace = true`.
/// Consumers that ignore trace should leave this `None`.
pub trace: Option<kebab_core::SearchTrace>,
/// v0.17.0 A5 Step 4b: human / agent-readable advisory string set
/// when the empty hit list is likely due to a query shorter than the
/// FTS5 trigram tokenizer's 3-char minimum. `None` otherwise. CLI
/// surfaces it on stderr (text mode); MCP / `--json` consumers
/// surface it however they prefer. See
/// `docs/superpowers/specs/2026-05-22-korean-trigram-tokenizer-design.md`
/// §3.3.
pub hint: Option<String>,
}
/// Facade state — see module docs for lifetime rules.
@@ -84,6 +100,12 @@ pub struct SearchResponse {
pub struct App {
pub(crate) config: kebab_config::Config,
pub(crate) sqlite: Arc<SqliteStore>,
/// post-v0.18.0 extractor-dispatch-unification: polymorphic Extractor
/// registry. App init 시 1회 등록되어 `extract_for(...)` 가 lookup
/// 한다. 현재 11 entry (ImageExtractor + PdfTextExtractor + 9 AST).
/// MarkdownExtractor 는 별 PR 에서 추가 — markdown ingest path 는
/// 본 PR 에서 free-function 그대로 유지.
pub(crate) extractors: Vec<Box<dyn Extractor + Send + Sync>>,
/// Memoized embedder — built lazily on first `embedder()` call when
/// embeddings are enabled. `OnceLock` keeps the struct `Sync` and
/// the build path cold-only-once.
@@ -102,6 +124,17 @@ pub struct App {
/// `corpus_revision` snapshot embedded in `SearchCacheKey`
/// invalidates every entry the moment a new ingest commit lands.
search_cache: Option<Mutex<LruCache<SearchCacheKey, Vec<SearchHit>>>>,
/// p9-fb-41 PR-9c-2: NLI verifier built eagerly at
/// `open_with_config` time when `config.rag.nli_threshold > 0`,
/// consumed by `RagPipeline::with_verifier` on every `ask` /
/// `ask_with_session` call. `None` when the gate is disabled
/// (default, threshold = 0) — multi-hop skips step 8.5 entirely
/// and single-pass never touches the verifier.
///
/// Built eagerly (not lazy) so the `open_with_config` `?`
/// propagation surfaces NLI model construction errors at App
/// boot time, before any user query runs.
pipeline_verifier: Option<Arc<dyn kebab_nli::NliVerifier>>,
}
/// p9-fb-19: cache key for `App::search`. Includes every field that
@@ -158,20 +191,108 @@ impl App {
sqlite
.run_migrations()
.context("kb-app: run SqliteStore migrations")?;
// V009 의 tokenized_korean_text column 의 first-boot eager backfill.
// 신규 ingest 의 chunks_ai trigger 가 이미 채우므로 NULL row 가 없으면 즉시 0 반환 (idempotent).
// V007 → V009 업그레이드 시 KB 크기 비례 (~10000 chunk 당 ~30-60s).
let backfill_count = sqlite
.backfill_tokenized_korean_text(
|done, total| {
if total > 0 && done % 500 == 0 {
tracing::info!(
target: "kebab-app",
"korean tokenizer backfill: {done}/{total}"
);
}
},
kebab_chunk::tokenize_korean_morphological,
)
.unwrap_or_else(|e| {
tracing::warn!(
target: "kebab-app",
"korean tokenizer backfill failed: {e}"
);
0
});
if backfill_count > 0 {
tracing::info!(
target: "kebab-app",
"korean tokenizer backfill complete: {backfill_count} chunks updated"
);
}
// p9-fb-19: build the LRU cache from config. Capacity 0 →
// `None` (cache disabled — every search hits the retrievers).
let search_cache = NonZeroUsize::new(config.search.cache_capacity)
.map(|cap| Mutex::new(LruCache::new(cap)));
// post-v0.18.0 extractor-dispatch-unification: build the 11-entry
// Extractor registry. All entries are state-less unit structs with
// zero-cost `new()`, so init cost is effectively 0 and side effects
// are 0 — `pipeline_verifier` fallible `?` below may bail but the
// already-constructed `extractors` Vec drops without cost. Markdown
// is NOT registered (see field doc).
let extractors: Vec<Box<dyn Extractor + Send + Sync>> = vec![
Box::new(ImageExtractor::new()),
Box::new(PdfTextExtractor::new()),
Box::new(RustAstExtractor::new()),
Box::new(PythonAstExtractor::new()),
Box::new(TypescriptAstExtractor::new()),
Box::new(JavascriptAstExtractor::new()),
Box::new(GoAstExtractor::new()),
Box::new(JavaAstExtractor::new()),
Box::new(KotlinAstExtractor::new()),
Box::new(CAstExtractor::new()),
Box::new(CppAstExtractor::new()),
];
// p9-fb-41 PR-9c-2: build the NLI verifier when the gate is
// enabled. App carries it on `RagPipeline` via
// `with_verifier` so the rag crate doesn't have to know about
// kebab-nli construction. Failure (`?`) surfaces as a user-
// facing error at App boot — never a panic in the pipeline's
// `expect("verifier must be Some when nli_threshold > 0.0")`.
let pipeline_verifier: Option<Arc<dyn kebab_nli::NliVerifier>> = if config.rag.nli_threshold
> 0.0
{
let v = kebab_nli::OnnxNliVerifier::new(&config)
.context("kebab-app: construct OnnxNliVerifier (config.rag.nli_threshold > 0)")?;
Some(Arc::new(v))
} else {
None
};
Ok(Self {
config,
sqlite: Arc::new(sqlite),
extractors,
embedder: OnceLock::new(),
vector: OnceLock::new(),
llm: OnceLock::new(),
search_cache,
pipeline_verifier,
})
}
/// Polymorphic dispatcher for the [`Extractor`] trait. Looks up the
/// first Extractor whose `supports(media)` returns true and invokes
/// `extract(ctx, bytes)` on it.
///
/// Errors with `anyhow!("no Extractor for media_type {media:?}")`
/// when no matching Extractor is registered. Callers in
/// `ingest_one_*_asset` reach this only after the outer 4-arm
/// dispatch (`MediaType::Markdown` / `Image` / `Pdf` / `Code(lang)`)
/// has matched, so a miss is a programming error — NOT a user-
/// facing skip.
pub(crate) fn extract_for(
&self,
media: &MediaType,
ctx: &ExtractContext<'_>,
bytes: &[u8],
) -> Result<kebab_core::CanonicalDocument> {
let extractor = self
.extractors
.iter()
.find(|e| e.supports(media))
.ok_or_else(|| anyhow!("no Extractor for media_type {media:?}"))?;
extractor.extract(ctx, bytes)
}
/// Run a [`SearchQuery`] through the configured retriever stack and
/// return the top-k hits. p9-fb-19: result is served from the
/// in-process LRU cache when the same `(query_norm, mode, k,
@@ -235,7 +356,9 @@ impl App {
// so other in-flight searches can use the cache concurrently.
drop(guard);
let hits = self.search_uncached(query)?;
let mut guard = cache.lock().unwrap_or_else(|e| e.into_inner());
let mut guard = cache
.lock()
.unwrap_or_else(std::sync::PoisonError::into_inner);
guard.put(key, hits.clone());
Ok(hits)
}
@@ -315,11 +438,7 @@ impl App {
///
/// `SearchResponse.next_cursor` and `truncated` are independent
/// signals — see `SearchResponse` doc for details.
pub fn search_with_opts(
&self,
query: SearchQuery,
opts: SearchOpts,
) -> Result<SearchResponse> {
pub fn search_with_opts(&self, query: SearchQuery, opts: SearchOpts) -> Result<SearchResponse> {
use crate::cursor;
let corpus_revision = self.sqlite.corpus_revision().to_string();
@@ -404,12 +523,11 @@ impl App {
// Apply offset + k_effective truncation (mirrors non-trace path).
let drop_n = offset.min(traced_hits.len());
traced_hits.drain(..drop_n);
let mut hits: Vec<SearchHit> =
traced_hits.into_iter().take(k_effective).collect();
let mut hits: Vec<SearchHit> = traced_hits.into_iter().take(k_effective).collect();
// Snippet truncation if opts.snippet_chars set (mirror non-trace path).
if opts.snippet_chars.is_some() {
for h in hits.iter_mut() {
for h in &mut hits {
if h.snippet.chars().count() > snippet_chars {
h.snippet = trim_to_chars(&h.snippet, snippet_chars);
}
@@ -418,11 +536,13 @@ impl App {
// Trace path skips the budget loop. Caller will inspect
// `hits.len()` and `trace.timing` rather than paginate.
let hint: Option<String> = None;
return Ok(SearchResponse {
hits,
next_cursor: None,
truncated: false,
trace: Some(trace),
hint,
});
}
@@ -434,15 +554,14 @@ impl App {
// Skip offset.
let drop_n = offset.min(all_hits.len());
all_hits.drain(..drop_n);
let mut hits: Vec<SearchHit> =
all_hits.into_iter().take(k_effective).collect();
let mut hits: Vec<SearchHit> = all_hits.into_iter().take(k_effective).collect();
// Apply snippet_chars override if shorter than what the
// retriever returned (retriever already honored
// `config.search.snippet_chars`; this only kicks in when the
// caller asked for *less*).
if opts.snippet_chars.is_some() {
for h in hits.iter_mut() {
for h in &mut hits {
if h.snippet.chars().count() > snippet_chars {
h.snippet = trim_to_chars(&h.snippet, snippet_chars);
}
@@ -456,15 +575,11 @@ impl App {
// Step 1: shorten snippets progressively to a 60-char floor.
const SNIPPET_FLOOR: usize = 60;
let mut current_snippet_cap = snippet_chars;
while estimate_chars(&hits) > max_chars
&& current_snippet_cap > SNIPPET_FLOOR
{
current_snippet_cap =
(current_snippet_cap / 2).max(SNIPPET_FLOOR);
for h in hits.iter_mut() {
while estimate_chars(&hits) > max_chars && current_snippet_cap > SNIPPET_FLOOR {
current_snippet_cap = (current_snippet_cap / 2).max(SNIPPET_FLOOR);
for h in &mut hits {
if h.snippet.chars().count() > current_snippet_cap {
h.snippet =
trim_to_chars(&h.snippet, current_snippet_cap);
h.snippet = trim_to_chars(&h.snippet, current_snippet_cap);
truncated = true;
}
}
@@ -505,11 +620,13 @@ impl App {
None
};
let hint: Option<String> = None;
Ok(SearchResponse {
hits,
next_cursor,
truncated,
trace: None,
hint,
})
}
@@ -518,11 +635,27 @@ impl App {
pub fn ask(&self, query: &str, opts: AskOpts) -> Result<Answer> {
let retriever = self.build_retriever(opts.mode)?;
let llm = self.llm()?;
let pipeline =
RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone());
let pipeline = self.build_pipeline(retriever, llm);
pipeline.ask(query, opts)
}
/// p9-fb-41 PR-9c-2: shared pipeline builder used by [`Self::ask`]
/// and [`Self::ask_with_session`]. Attaches the App-built NLI
/// verifier (when `cfg.rag.nli_threshold > 0`) via
/// `RagPipeline::with_verifier`, keeping the construction site in
/// a single place so the two call paths can't drift.
fn build_pipeline(
&self,
retriever: Arc<dyn Retriever>,
llm: Arc<dyn LanguageModel>,
) -> RagPipeline {
let pipeline = RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone());
match &self.pipeline_verifier {
Some(v) => pipeline.with_verifier(v.clone()),
None => pipeline,
}
}
/// p9-fb-18: shared retriever-stack builder used by [`Self::ask`]
/// and [`Self::ask_with_session`]. Lexical mode uses the FTS5
/// retriever directly; vector / hybrid require embeddings (and
@@ -587,12 +720,7 @@ impl App {
/// returns; on persistence error, the answer is still returned
/// (don't lose the user's compute) but the error is logged so
/// the operator notices.
pub fn ask_with_session(
&self,
session_id: &str,
query: &str,
opts: AskOpts,
) -> Result<Answer> {
pub fn ask_with_session(&self, session_id: &str, query: &str, opts: AskOpts) -> Result<Answer> {
use kebab_core::traits::{ChatSessionRepo, ChatSessionRow, ChatTurnRow};
use std::time::{SystemTime, UNIX_EPOCH};
@@ -625,17 +753,13 @@ impl App {
// p9-fb-18 R1: shared retriever builder removes the prior
// copy of `ask`'s 35-line stack — see [`Self::build_retriever`].
// p9-fb-41 PR-9c-2: shared `build_pipeline` attaches the NLI
// verifier when the gate is enabled.
let retriever = self.build_retriever(opts.mode)?;
let llm = self.llm()?;
let pipeline =
RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone());
let answer = pipeline.ask_with_history(
query,
history,
session_id.to_string(),
next_index,
opts,
)?;
let pipeline = self.build_pipeline(retriever, llm);
let answer =
pipeline.ask_with_history(query, history, session_id.to_string(), next_index, opts)?;
// Auto-create the session header on first use. Title from
// the first question (≤40 chars after trim).
@@ -676,7 +800,8 @@ impl App {
turn_index: next_index,
question: query.to_string(),
answer: answer.answer.clone(),
citations_json: serde_json::to_string(&answer.citations).unwrap_or_else(|_| "[]".to_string()),
citations_json: serde_json::to_string(&answer.citations)
.unwrap_or_else(|_| "[]".to_string()),
created_at: now_unix,
};
if let Err(e) = self.sqlite.append_turn(&turn_row) {
@@ -710,10 +835,31 @@ impl App {
if let Some(e) = self.embedder.get() {
return Ok(Some(e.clone()));
}
let emb: Arc<dyn Embedder + Send + Sync> = Arc::new(
FastembedEmbedder::new(&self.config)
.context("kb-app: load FastembedEmbedder")?,
);
// Provider branch (Track 1 spec §3 + arctic-embedder spec). The
// `embeddings_disabled()` check above already handled `"none"`; here we
// route the live providers. `fastembed`/`onnx`/(empty) keep the default
// onnxruntime path (vectors unchanged — `embedding_version` is
// preserved); `candle` selects the pure-Rust NUMA-safe backend (e5 or
// arctic via its model registry); `ollama` offloads to a remote
// `/api/embed` daemon.
let provider = self.config.models.embedding.provider.as_str();
let emb: Arc<dyn Embedder + Send + Sync> = match provider {
"fastembed" | "onnx" | "" => Arc::new(
FastembedEmbedder::new(&self.config).context("kb-app: load FastembedEmbedder")?,
),
"candle" => Arc::new(
CandleEmbedder::new(&self.config).context("kb-app: load CandleEmbedder")?,
),
"ollama" => Arc::new(
OllamaEmbedder::new(&self.config).context("kb-app: load OllamaEmbedder")?,
),
other => {
return Err(anyhow!(
"kb-app: unknown embedding provider {other:?}; expected one of \
`fastembed` (default), `candle`, `ollama`, or `none` (lexical-only)"
));
}
};
// `set` returns Err if another thread won the race; in that case
// the loser still returns the (now-cached) winner via `get()`.
let _ = self.embedder.set(emb.clone());
@@ -788,7 +934,9 @@ impl App {
/// clear` admin command). No-op when the cache is disabled.
pub fn clear_search_cache(&self) {
if let Some(cache) = self.search_cache.as_ref() {
let mut guard = cache.lock().unwrap_or_else(|e| e.into_inner());
let mut guard = cache
.lock()
.unwrap_or_else(std::sync::PoisonError::into_inner);
guard.clear();
}
}
@@ -809,8 +957,8 @@ impl App {
/// git tree) correctly keep `repo: None` — `Metadata.repo` is already
/// `None` for those, so the assignment is a no-op.
fn backfill_repo(&self, hits: &mut [SearchHit]) {
use std::collections::HashMap;
use kebab_core::DocumentId;
use std::collections::HashMap;
// doc_id → Option<String> where None means "not found / no repo"
let mut cache: HashMap<DocumentId, Option<String>> = HashMap::new();
@@ -819,26 +967,24 @@ impl App {
if hit.repo.is_some() {
continue;
}
let repo_val = cache
.entry(hit.doc_id.clone())
.or_insert_with(|| {
// Deliberately non-aborting: a failed store lookup for
// one hit must not abort the whole search response. Log
// the error so it's observable rather than silently
// dropped (review #140 round 1).
match self.sqlite.get_document(&hit.doc_id) {
Ok(opt) => opt.and_then(|doc| doc.metadata.repo),
Err(e) => {
tracing::warn!(
target: "kebab-app",
doc_id = %hit.doc_id,
error = %e,
"backfill_repo: get_document failed; leaving hit.repo = None"
);
None
}
let repo_val = cache.entry(hit.doc_id.clone()).or_insert_with(|| {
// Deliberately non-aborting: a failed store lookup for
// one hit must not abort the whole search response. Log
// the error so it's observable rather than silently
// dropped (review #140 round 1).
match self.sqlite.get_document(&hit.doc_id) {
Ok(opt) => opt.and_then(|doc| doc.metadata.repo),
Err(e) => {
tracing::warn!(
target: "kebab-app",
doc_id = %hit.doc_id,
error = %e,
"backfill_repo: get_document failed; leaving hit.repo = None"
);
None
}
});
}
});
if let Some(r) = repo_val {
hit.repo = Some(r.clone());
}
@@ -849,10 +995,7 @@ impl App {
/// "switch to --mode lexical" error when embeddings are disabled.
fn require_embeddings(
&self,
) -> Result<(
Arc<dyn Embedder + Send + Sync>,
Arc<LanceVectorStore>,
)> {
) -> Result<(Arc<dyn Embedder + Send + Sync>, Arc<LanceVectorStore>)> {
let emb = self.embedder()?.ok_or_else(|| {
anyhow!(
"embeddings disabled (config.models.embedding.provider == \"none\" \
@@ -874,8 +1017,16 @@ impl App {
/// the active config. This token surfaces in `SearchHit.index_version`
/// and on snapshot tests; including the chunker version pins it to
/// the chunking policy in effect.
///
/// V009 (2026-05-28): FTS5 tokenizer 가 trigram → unicode61 + 한국어
/// 형태소 분해 column 로 갱신됨. `fts5-v009-korean-morphological`
/// suffix 가 V007 baseline 과 구별되어 eval runner 의 config
/// snapshot 및 search cache 무효화에 picks up 된다.
fn lexical_index_version(config: &kebab_config::Config) -> IndexVersion {
IndexVersion(format!("lex:{}", config.chunking.chunker_version))
IndexVersion(format!(
"lex:{}:fts5-v009-korean-morphological",
config.chunking.chunker_version
))
}
/// p9-fb-37: stand-in for the vector retriever in the trace path when
@@ -979,6 +1130,223 @@ fn backfill_code_lang(hits: &mut [SearchHit]) {
}
}
// ── v0.20.x r2 Enhancement 3: OCR stats + failures inspect ──────────────
/// Wire type for `kebab inspect ocr-stats --json` (`ocr_stats.v1`).
#[derive(serde::Serialize)]
pub struct OcrStatsV1 {
pub schema_version: &'static str,
pub total_events: u64,
pub total_runs: u64,
pub success_count: u64,
pub failure_count: u64,
pub success_rate: f64,
pub p50_ms: Option<u64>,
pub p90_ms: Option<u64>,
pub p99_ms: Option<u64>,
pub max_ms: Option<u64>,
pub by_engine: std::collections::BTreeMap<String, u64>,
pub by_doc: Vec<OcrStatsByDoc>,
}
/// Per-doc breakdown row inside `OcrStatsV1`.
#[derive(serde::Serialize)]
pub struct OcrStatsByDoc {
pub doc_id: String,
pub failure_count: u64,
pub success_count: u64,
pub p90_ms: Option<u64>,
}
/// Wire type for `kebab inspect ocr-failures --json` (`ocr_failures.v1`).
#[derive(serde::Serialize)]
pub struct OcrFailuresV1 {
pub schema_version: &'static str,
pub doc_id: Option<String>,
pub failure_count: u64,
pub failures: Vec<OcrFailureRow>,
}
/// Single failure row inside `OcrFailuresV1`.
#[derive(serde::Serialize)]
pub struct OcrFailureRow {
pub ts: String,
pub page: u32,
pub ms: u64,
pub reason: String,
pub image_byte_size: Option<u64>,
}
impl App {
/// Corpus-wide OCR statistics from the `pdf_ocr_events` SQLite mirror.
pub fn inspect_ocr_stats(&self) -> Result<OcrStatsV1> {
self.inspect_ocr_stats_with_config(&self.config)
}
#[doc(hidden)]
pub fn inspect_ocr_stats_with_config(&self, _cfg: &kebab_config::Config) -> Result<OcrStatsV1> {
use crate::ingest_log::percentiles;
let conn = self.sqlite.read_conn();
// 1. Aggregate counters
let (total_events, success_count, failure_count, total_runs): (u64, u64, u64, u64) = conn
.query_row(
"SELECT COUNT(*), \
SUM(CASE WHEN success=1 THEN 1 ELSE 0 END), \
SUM(CASE WHEN success=0 THEN 1 ELSE 0 END), \
COUNT(DISTINCT run_id) \
FROM pdf_ocr_events",
[],
|r| Ok((r.get(0)?, r.get(1)?, r.get(2)?, r.get(3)?)),
)
.unwrap_or((0, 0, 0, 0));
let success_rate = if total_events == 0 {
0.0
} else {
success_count as f64 / total_events as f64
};
// 2. Latency percentiles from successful events
let samples: Vec<u64> = {
let mut stmt = conn
.prepare("SELECT ms FROM pdf_ocr_events WHERE success=1 ORDER BY ms")
.context("prepare ms query")?;
stmt.query_map([], |r| r.get::<_, u64>(0))
.context("query ms")?
.filter_map(Result::ok)
.collect()
};
let (p50_ms, p90_ms, p99_ms, max_ms) = percentiles(&samples);
// 3. Engine breakdown
let mut by_engine = std::collections::BTreeMap::new();
{
let mut stmt = conn
.prepare("SELECT ocr_engine, COUNT(*) FROM pdf_ocr_events GROUP BY ocr_engine")
.context("prepare engine query")?;
let rows = stmt
.query_map([], |r| Ok((r.get::<_, String>(0)?, r.get::<_, u64>(1)?)))
.context("query engine")?;
for row in rows.filter_map(Result::ok) {
by_engine.insert(row.0, row.1);
}
}
// 4. Top-10 docs by failure count
let by_doc: Vec<OcrStatsByDoc> = {
let mut stmt = conn
.prepare(
"SELECT doc_id, \
SUM(CASE WHEN success=0 THEN 1 ELSE 0 END), \
SUM(CASE WHEN success=1 THEN 1 ELSE 0 END) \
FROM pdf_ocr_events \
WHERE doc_id IS NOT NULL \
GROUP BY doc_id \
ORDER BY 2 DESC \
LIMIT 10",
)
.context("prepare by_doc query")?;
stmt.query_map([], |r| {
Ok(OcrStatsByDoc {
doc_id: r.get(0)?,
failure_count: r.get(1)?,
success_count: r.get(2)?,
p90_ms: None, // per-doc p90 deferred (open question #3)
})
})
.context("query by_doc")?
.filter_map(Result::ok)
.collect()
};
Ok(OcrStatsV1 {
schema_version: "ocr_stats.v1",
total_events,
total_runs,
success_count,
failure_count,
success_rate,
p50_ms,
p90_ms,
p99_ms,
max_ms,
by_engine,
by_doc,
})
}
/// Recent OCR failure rows, optionally filtered by `doc_id`.
pub fn inspect_ocr_failures(
&self,
doc_id: Option<&str>,
limit: usize,
) -> Result<OcrFailuresV1> {
self.inspect_ocr_failures_with_config(&self.config, doc_id, limit)
}
#[doc(hidden)]
pub fn inspect_ocr_failures_with_config(
&self,
_cfg: &kebab_config::Config,
doc_id: Option<&str>,
limit: usize,
) -> Result<OcrFailuresV1> {
let conn = self.sqlite.read_conn();
let failures: Vec<OcrFailureRow> = if let Some(did) = doc_id {
let mut stmt = conn
.prepare(
"SELECT ts, page, ms, COALESCE(reason,'unknown'), image_byte_size \
FROM pdf_ocr_events \
WHERE success=0 AND doc_id=? \
ORDER BY ts DESC \
LIMIT ?",
)
.context("prepare failures by doc_id")?;
stmt.query_map(rusqlite::params![did, limit as i64], |r| {
Ok(OcrFailureRow {
ts: r.get(0)?,
page: r.get(1)?,
ms: r.get(2)?,
reason: r.get(3)?,
image_byte_size: r.get(4)?,
})
})
.context("query failures by doc_id")?
.filter_map(Result::ok)
.collect()
} else {
let mut stmt = conn
.prepare(
"SELECT ts, page, ms, COALESCE(reason,'unknown'), image_byte_size \
FROM pdf_ocr_events \
WHERE success=0 \
ORDER BY ts DESC \
LIMIT ?",
)
.context("prepare failures corpus-wide")?;
stmt.query_map(rusqlite::params![limit as i64], |r| {
Ok(OcrFailureRow {
ts: r.get(0)?,
page: r.get(1)?,
ms: r.get(2)?,
reason: r.get(3)?,
image_byte_size: r.get(4)?,
})
})
.context("query failures corpus-wide")?
.filter_map(Result::ok)
.collect()
};
Ok(OcrFailuresV1 {
schema_version: "ocr_failures.v1",
doc_id: doc_id.map(String::from),
failure_count: failures.len() as u64,
failures,
})
}
}
#[cfg(test)]
mod tests {
use super::*;
@@ -1077,3 +1445,128 @@ mod tests_trace {
assert!(resp.trace.is_some(), "trace populated when opts.trace=true");
}
}
/// post-v0.18.0 extractor-dispatch-unification: in-crate unit tests for
/// the `App.extractors` registry + `App::extract_for` polymorphic
/// dispatch. In-crate (not `tests/`) because `extractors` + `extract_for`
/// are `pub(crate)` — integration tests cannot reach them.
///
/// Spec §5.1 + plan §2 Step 10 — 3 test class:
/// 1. registry length = 11 (image + pdf + 9 AST).
/// 2. mutually-exclusive `supports()` grid over 16 sample MediaTypes.
/// 3. `extract_for` returns `Err("no Extractor ...")` for registry-NOT-cover
/// MediaType (Audio).
#[cfg(test)]
mod tests_extractor_dispatch {
use super::*;
use kebab_core::{AudioType, ExtractConfig, ImageType};
/// helper: tempdir-isolated App for tests (mirrors `tests_trace`'s
/// `open_app_with_temp_dir` pattern).
fn open_app_with_temp_dir() -> (tempfile::TempDir, App) {
let dir = tempfile::tempdir().unwrap();
let mut cfg = kebab_config::Config::defaults();
cfg.storage.data_dir = dir.path().to_string_lossy().into_owned();
// Bring up migrations.
let store = kebab_store_sqlite::SqliteStore::open(&cfg).unwrap();
store.run_migrations().unwrap();
drop(store);
let app = App::open_with_config(cfg).unwrap();
(dir, app)
}
/// Registry length invariant: 11 Extractor (image + pdf + 9 AST).
/// Markdown is NOT registered (free-function path — defer to a
/// separate PR per spec §3.4).
#[test]
fn registry_has_eleven_extractors() {
let (_dir, app) = open_app_with_temp_dir();
assert_eq!(
app.extractors.len(),
11,
"registry must hold 11 Extractors (image + pdf + 9 AST). \
markdown 은 별 PR."
);
}
/// 11 Extractor 의 `supports()` 가 16 sample MediaType 에 대해
/// mutually exclusive — 어떤 두 Extractor 도 동일 MediaType 에
/// 대해 true 반환 안 됨.
#[test]
fn supports_grid_is_mutually_exclusive() {
let (_dir, app) = open_app_with_temp_dir();
let samples = vec![
MediaType::Markdown,
MediaType::Pdf,
MediaType::Image(ImageType::Png),
MediaType::Image(ImageType::Jpeg),
MediaType::Code("rust".into()),
MediaType::Code("python".into()),
MediaType::Code("typescript".into()),
MediaType::Code("javascript".into()),
MediaType::Code("go".into()),
MediaType::Code("java".into()),
MediaType::Code("kotlin".into()),
MediaType::Code("c".into()),
MediaType::Code("cpp".into()),
MediaType::Code("yaml".into()), // registry NOT cover
MediaType::Code("shell".into()), // registry NOT cover
MediaType::Audio(AudioType::Wav), // registry NOT cover
];
for sample in &samples {
let hits: Vec<_> = app
.extractors
.iter()
.filter(|e| e.supports(sample))
.collect();
assert!(
hits.len() <= 1,
"mutually exclusive violated for {sample:?}: {} hits",
hits.len()
);
}
}
/// `extract_for` 가 registry NOT cover MediaType (Audio) 에 대해
/// `Err("no Extractor for media_type ...")` 반환. Audio MediaType
/// 사용으로 RawAsset 의 actual content 의존 회피 — registry NOT
/// cover → 즉시 Err.
#[test]
fn extract_for_unsupported_media_errors() {
let (_dir, app) = open_app_with_temp_dir();
// Minimal RawAsset. Actual content never read — Audio MediaType
// 는 registry NOT cover → `extract_for` 가 dispatch loop 안에서
// 바로 Err 반환. RawAsset field set 은 `crates/kebab-core/src/
// asset.rs:62-73` 와 정합 (8 field).
let asset = kebab_core::RawAsset {
asset_id: kebab_core::AssetId("00".repeat(16)),
source_uri: kebab_core::SourceUri::File("/tmp/dummy.wav".into()),
workspace_path: kebab_core::WorkspacePath("dummy.wav".to_string()),
media_type: MediaType::Audio(AudioType::Wav),
byte_len: 0,
checksum: kebab_core::Checksum("00".repeat(32)),
discovered_at: time::OffsetDateTime::now_utc(),
// AssetStorage::Inline 미존재 — actual variant `Copied { path }`
// 사용 (kebab-core/src/asset.rs:55-60).
stored: kebab_core::AssetStorage::Copied {
path: std::path::PathBuf::from("/tmp/dummy.wav"),
},
};
let workspace_root: std::path::PathBuf = std::path::PathBuf::from("/tmp");
let cfg = ExtractConfig::default();
let ctx = ExtractContext {
asset: &asset,
workspace_root: &workspace_root,
config: &cfg,
};
let result = app.extract_for(&MediaType::Audio(AudioType::Wav), &ctx, &[]);
assert!(result.is_err(), "Audio 는 registry 미포함 → Err 기대");
let err_msg = format!("{:#}", result.unwrap_err());
assert!(
err_msg.contains("no Extractor"),
"unexpected err: {err_msg}"
);
}
}

View File

@@ -96,6 +96,11 @@ fn serialize_search_response(r: &SearchResponse) -> Value {
None => Value::Null,
};
map.insert("trace".to_string(), trace_v);
// v0.17.0 A5 Step 4b: only emit `hint` when set — matches
// the CLI wire wrapper's additive emit pattern.
if let Some(hint) = &r.hint {
map.insert("hint".to_string(), Value::String(hint.clone()));
}
}
v
}
@@ -121,7 +126,10 @@ fn parse_one(raw: &Value) -> Result<(SearchQuery, SearchOpts), String> {
let text = obj
.get("query")
.and_then(|v| v.as_str())
.ok_or("missing required field: query")?
.ok_or(
"missing required field: query \
(expected {\"query\":\"<text>\",\"mode\":\"lexical|vector|hybrid\",\"k\":3,...})",
)?
.to_string();
let mode = match obj.get("mode").and_then(|v| v.as_str()) {
@@ -134,9 +142,8 @@ fn parse_one(raw: &Value) -> Result<(SearchQuery, SearchOpts), String> {
let k = obj
.get("k")
.and_then(|v| v.as_u64())
.map(|n| n as usize)
.unwrap_or(0); // 0 → use config default in app
.and_then(serde_json::Value::as_u64)
.map_or(0, |n| n as usize); // 0 → use config default in app
let trust_min = match obj.get("trust_min").and_then(|v| v.as_str()) {
None => None,
@@ -204,14 +211,17 @@ fn parse_one(raw: &Value) -> Result<(SearchQuery, SearchOpts), String> {
let opts = SearchOpts {
max_tokens: obj
.get("max_tokens")
.and_then(|v| v.as_u64())
.and_then(serde_json::Value::as_u64)
.map(|n| n as usize),
snippet_chars: obj
.get("snippet_chars")
.and_then(|v| v.as_u64())
.and_then(serde_json::Value::as_u64)
.map(|n| n as usize),
cursor: obj.get("cursor").and_then(|v| v.as_str()).map(String::from),
trace: obj.get("trace").and_then(|v| v.as_bool()).unwrap_or(false),
trace: obj
.get("trace")
.and_then(serde_json::Value::as_bool)
.unwrap_or(false),
};
Ok((
@@ -295,4 +305,17 @@ mod tests {
assert!(items[1].error.is_some());
assert_eq!(items[1].error.as_ref().unwrap()["code"], "invalid_input");
}
#[test]
fn missing_query_error_message_includes_shape_hint() {
let cfg = open_temp();
let raw = vec![serde_json::json!({"mode": "lexical"})];
let (items, _summary) = bulk_search_with_config(cfg, raw).unwrap();
let err = items[0].error.as_ref().unwrap();
let msg = err["message"].as_str().unwrap();
assert!(
msg.contains("query") && msg.contains("mode"),
"missing shape hint in error message: {msg}"
);
}
}

View File

@@ -0,0 +1,61 @@
//! Derivation-cache payload encoding helpers (design 2026-05-31 §3.3).
//!
//! - embedding: `dimensions × f32` little-endian bytes (1024×4 = 4096 B/chunk).
//! - alias / korean_tokens: UTF-8 as-is (handled inline by the caller — no
//! helper needed, `String::as_bytes` / `String::from_utf8`).
/// Encode an embedding vector as a little-endian `f32` byte string (§3.3).
pub fn encode_embedding(vector: &[f32]) -> Vec<u8> {
let mut out = Vec::with_capacity(vector.len() * 4);
for &v in vector {
out.extend_from_slice(&v.to_le_bytes());
}
out
}
/// Decode a little-endian `f32` byte string back into a vector (§3.3).
///
/// Returns `None` if the payload length is not a multiple of 4 (corrupt
/// entry) — the caller treats this as a cache miss and recomputes, so a bad
/// payload never produces a wrong vector.
pub fn decode_embedding(payload: &[u8]) -> Option<Vec<f32>> {
if payload.len() % 4 != 0 {
return None;
}
Some(
payload
.chunks_exact(4)
.map(|c| f32::from_le_bytes([c[0], c[1], c[2], c[3]]))
.collect(),
)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn roundtrips_vector() {
let v = vec![0.0_f32, 1.5, -2.25, 3.125e10, f32::MIN, f32::MAX];
let bytes = encode_embedding(&v);
assert_eq!(bytes.len(), v.len() * 4);
assert_eq!(decode_embedding(&bytes), Some(v));
}
#[test]
fn empty_vector_roundtrips() {
assert_eq!(encode_embedding(&[]), Vec::<u8>::new());
assert_eq!(decode_embedding(&[]), Some(vec![]));
}
#[test]
fn misaligned_payload_is_none() {
assert_eq!(decode_embedding(&[1, 2, 3]), None);
}
#[test]
fn little_endian_layout_is_fixed() {
// 1.0_f32 == 0x3F800000, little-endian bytes [0x00,0x00,0x80,0x3F].
assert_eq!(encode_embedding(&[1.0]), vec![0x00, 0x00, 0x80, 0x3F]);
}
}

View File

@@ -10,6 +10,6 @@
pub use crate::doctor_signal::{DoctorUnhealthy, NoHitSignal, RefusalSignal};
pub use kebab_config::{ConfigInvalid, ConfigNotFound};
pub use kebab_llm_local::LlmError;
pub use kebab_config::ConfigInvalid;
pub use kebab_store_sqlite::NotIndexed;

View File

@@ -9,7 +9,7 @@
use serde::{Deserialize, Serialize};
use serde_json::{Value, json};
use crate::error_signal::{ConfigInvalid, LlmError, NotIndexed};
use crate::error_signal::{ConfigInvalid, ConfigNotFound, LlmError, NotIndexed};
// p9-fb-34: `stale_cursor` is constructed directly by `cursor::decode`
// and surfaced through `StructuredError` (an anyhow-friendly wrapper
@@ -65,6 +65,20 @@ pub fn classify(err: &anyhow::Error, verbose: bool) -> ErrorV1 {
hint: Some("check `--config <path>` and TOML syntax".to_string()),
};
}
if let Some(s) = err.downcast_ref::<ConfigNotFound>() {
return ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "config_not_found".to_string(),
message: s.to_string(),
details: json!({
"path": s.path.to_string_lossy(),
}),
hint: Some(
"verify --config <path>; pass an existing toml file or omit --config to use XDG default"
.to_string(),
),
};
}
if let Some(s) = err.downcast_ref::<NotIndexed>() {
return ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
@@ -91,7 +105,7 @@ pub fn classify(err: &anyhow::Error, verbose: bool) -> ErrorV1 {
}
let mut details = json!({});
if verbose {
let chain: Vec<String> = err.chain().map(|c| c.to_string()).collect();
let chain: Vec<String> = err.chain().map(std::string::ToString::to_string).collect();
details = json!({"chain": chain});
}
ErrorV1 {
@@ -158,7 +172,10 @@ mod tests {
});
let v1 = classify(&err, false);
assert_eq!(v1.code, "config_invalid");
assert_eq!(v1.details.get("path").and_then(|p| p.as_str()), Some("/tmp/x.toml"));
assert_eq!(
v1.details.get("path").and_then(|p| p.as_str()),
Some("/tmp/x.toml")
);
assert!(v1.hint.is_some());
}
@@ -182,7 +199,8 @@ mod tests {
// the resulting LlmError::Unreachable maps to "model_unreachable".
let client = reqwest::blocking::Client::builder()
.timeout(std::time::Duration::from_millis(500))
.build().unwrap();
.build()
.unwrap();
let err = client.get("http://127.0.0.1:1").send().unwrap_err();
let llm = LlmError::Unreachable {
endpoint: "http://127.0.0.1:1".to_string(),
@@ -198,7 +216,10 @@ mod tests {
let llm = LlmError::ModelNotPulled("gemma4:e4b".to_string());
let v1 = classify(&anyhow::Error::new(llm), false);
assert_eq!(v1.code, "model_not_pulled");
assert_eq!(v1.details.get("model").and_then(|p| p.as_str()), Some("gemma4:e4b"));
assert_eq!(
v1.details.get("model").and_then(|p| p.as_str()),
Some("gemma4:e4b")
);
}
#[test]
@@ -235,7 +256,10 @@ mod tests {
// (single source of truth). classify must not pattern-match on
// anyhow string contents — that would create two sources of
// truth. The bare anyhow string falls through to "generic".
assert_ne!(v1.code, "stale_cursor", "classify must not produce stale_cursor from bare anyhow string");
assert_ne!(
v1.code, "stale_cursor",
"classify must not produce stale_cursor from bare anyhow string"
);
}
#[test]

View File

@@ -36,9 +36,7 @@ pub fn ensure_kebabignore_entry(workspace_root: &Path) -> Result<()> {
} else {
String::new()
};
let already = existing
.lines()
.any(|line| line.trim() == KEBABIGNORE_LINE);
let already = existing.lines().any(|line| line.trim() == KEBABIGNORE_LINE);
if already {
return Ok(());
}
@@ -50,18 +48,14 @@ pub fn ensure_kebabignore_entry(workspace_root: &Path) -> Result<()> {
if !existing.is_empty() && !existing.ends_with('\n') {
file.write_all(b"\n")?;
}
writeln!(file, "{}", KEBABIGNORE_LINE)?;
writeln!(file, "{KEBABIGNORE_LINE}")?;
Ok(())
}
/// Copy bytes to `<external_dir>/<blake3-12>.<ext>`. Idempotent — if the
/// destination file already exists with the expected hash, the existing
/// file is reused (no second write). Returns the destination path.
pub fn copy_to_external(
external_dir: &Path,
bytes: &[u8],
ext: &str,
) -> Result<PathBuf> {
pub fn copy_to_external(external_dir: &Path, bytes: &[u8], ext: &str) -> Result<PathBuf> {
let hash = blake3::hash(bytes);
let hex = hash.to_hex();
let prefix = &hex.as_str()[..12];
@@ -82,11 +76,7 @@ pub fn copy_to_external(
/// Internal `yaml_quote` always uses double-quoted YAML form with backslash
/// escapes for `"` / `\` / control chars — agent-supplied titles with
/// special characters are safe.
pub fn inject_frontmatter(
body: &str,
title: &str,
source_uri: Option<&str>,
) -> Result<String> {
pub fn inject_frontmatter(body: &str, title: &str, source_uri: Option<&str>) -> Result<String> {
let head = body.trim_start();
if head.starts_with("---\n") || head.starts_with("---\r\n") || head.starts_with("---\r") {
anyhow::bail!(

View File

@@ -50,14 +50,14 @@ impl App {
fn fetch_chunk(app: &App, id: ChunkId, opts: FetchOpts) -> Result<FetchResult> {
let target = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_chunk(&app.sqlite, &id)?
.ok_or_else(|| {
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "chunk_not_found".to_string(),
message: format!("chunk_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "chunk_not_found".to_string(),
message: format!("chunk_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
let doc_id = target.doc_id.clone();
let doc =
@@ -107,14 +107,14 @@ fn fetch_chunk(app: &App, id: ChunkId, opts: FetchOpts) -> Result<FetchResult> {
fn fetch_doc(app: &App, id: DocumentId, opts: FetchOpts) -> Result<FetchResult> {
let doc = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_document(&app.sqlite, &id)?
.ok_or_else(|| {
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "doc_not_found".to_string(),
message: format!("doc_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "doc_not_found".to_string(),
message: format!("doc_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
let mut text = fmt_canonical_to_markdown(&doc);
let mut truncated = false;
@@ -176,23 +176,25 @@ fn fetch_span(
) -> Result<FetchResult> {
let doc = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_document(&app.sqlite, &id)?
.ok_or_else(|| {
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "doc_not_found".to_string(),
message: format!("doc_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "doc_not_found".to_string(),
message: format!("doc_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
// Reject line-incompatible media types (PDF / audio). `SourceType`
// (markdown / note / paper / reference / inbox) is the *user-facing*
// category, not the rendering format — the actual byte-level format
// lives on the source `RawAsset.media_type`. Look it up via
// workspace_path (unique key per asset).
if let Some(asset) = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_asset_by_workspace_path(
// doc.source_asset_id (PRIMARY KEY) so twin files (identical content
// at different paths) always read *this* document's own asset row,
// not whichever twin last wrote `assets.workspace_path`.
if let Some(asset) = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_asset(
&app.sqlite,
&doc.workspace_path,
&doc.source_asset_id,
)? {
if matches!(
asset.media_type,

View File

@@ -0,0 +1,446 @@
//! Per-ingest-run structured ndjson log writer (v0.20.x ingest log feature).
//!
//! Each `kebab ingest` run produces one `ingest-{run_id}.ndjson` file in
//! `config.logging.ingest_log_dir`. Records are appended line by line; the
//! last record is always `kind="summary"`. `IngestLogWriter::open` returns
//! `Ok(None)` when `ingest_log_enabled = false` so callers need not branch.
use std::fs::File;
use std::io::{BufWriter, Write};
use std::path::{Path, PathBuf};
use std::time::SystemTime;
use serde::{Deserialize, Serialize};
use time::format_description::well_known::Rfc3339;
pub struct IngestLogWriter {
file: BufWriter<File>,
path: PathBuf,
run_id: String,
started_at: SystemTime,
}
impl IngestLogWriter {
/// Open a new log file. Returns `Ok(None)` when `cfg.ingest_log_enabled == false` (AC-6).
pub fn open(cfg: &kebab_config::LoggingCfg) -> anyhow::Result<Option<Self>> {
if !cfg.ingest_log_enabled {
return Ok(None);
}
let run_id = generate_run_id();
let log_dir = expand_log_dir(&cfg.ingest_log_dir);
std::fs::create_dir_all(&log_dir)?;
// Cleanup before creating the new file (non-critical: warn on error).
if let Err(e) = cleanup_old_logs(&log_dir, cfg.keep_recent_runs, cfg.retention_days) {
tracing::warn!(target: "kebab-app", "ingest log cleanup failed: {e}");
}
let path = log_dir.join(format!("ingest-{run_id}.ndjson"));
let file = BufWriter::new(File::create(&path)?);
Ok(Some(Self {
file,
path,
run_id,
started_at: SystemTime::now(),
}))
}
pub fn write_event(&mut self, event: &LogEvent<'_>) -> anyhow::Result<()> {
serde_json::to_writer(&mut self.file, event)?;
writeln!(self.file)?;
Ok(())
}
pub fn write_summary(&mut self, summary: &IngestSummary) -> anyhow::Result<()> {
serde_json::to_writer(&mut self.file, summary)?;
writeln!(self.file)?;
Ok(())
}
pub fn flush(&mut self) -> anyhow::Result<()> {
self.file.flush()?;
Ok(())
}
pub fn run_id(&self) -> &str {
&self.run_id
}
pub fn path(&self) -> &Path {
&self.path
}
pub fn started_at(&self) -> SystemTime {
self.started_at
}
}
impl Drop for IngestLogWriter {
fn drop(&mut self) {
let _ = self.file.flush();
}
}
/// ISO 8601 compact timestamp + uuid v7 suffix: `20260528T013000Z-abc123de`.
/// uuid v7 is the workspace dep (Cargo.toml); `rand` is not added (spec §6 R-5).
fn generate_run_id() -> String {
use time::macros::format_description;
let now = time::OffsetDateTime::now_utc();
let ts = now
.format(format_description!(
"[year][month][day]T[hour][minute][second]Z"
))
.unwrap_or_else(|_| "19700101T000000Z".to_string());
let uid = uuid::Uuid::now_v7().simple().to_string();
let suffix = &uid[uid.len() - 8..];
format!("{ts}-{suffix}")
}
/// Expand `{state_dir}` placeholder → XDG state dir (spec §6 R-3).
/// Other tilde/env expansion is delegated to `kebab_config::expand_path`.
fn expand_log_dir(path: &Path) -> PathBuf {
let path_str = path.to_string_lossy();
if path_str.contains("{state_dir}") {
let state_dir = kebab_config::Config::xdg_state_dir();
PathBuf::from(path_str.replace("{state_dir}", &state_dir.to_string_lossy()))
} else {
path.to_path_buf()
}
}
/// RFC 3339 UTC timestamp for log records.
#[allow(dead_code)]
pub(crate) fn now_ts() -> String {
time::OffsetDateTime::now_utc()
.format(&Rfc3339)
.unwrap_or_else(|_| "1970-01-01T00:00:00Z".to_string())
}
/// Ingest event record (ndjson line). `kind` is the discriminator.
#[derive(Serialize, Deserialize)]
#[serde(tag = "kind", rename_all = "snake_case")]
pub enum LogEvent<'a> {
Ocr {
ts: String,
/// v0.20.x r2: additive field — doc_id for dual-write SQLite correlation.
/// Round 1 ndjson logs deserialize with doc_id=None (Serde Option default).
#[serde(skip_serializing_if = "Option::is_none")]
doc_id: Option<&'a str>,
doc_path: &'a str,
page: u32,
image_byte_size: Option<u64>,
image_width: Option<u32>,
image_height: Option<u32>,
ms: u64,
chars: u32,
success: bool,
reason: Option<&'a str>,
ocr_engine: &'a str,
},
ParseError {
ts: String,
doc_path: &'a str,
reason: &'a str,
message: &'a str,
},
Skip {
ts: String,
doc_path: &'a str,
reason: &'a str,
detail: Option<&'a str>,
},
Error {
ts: String,
code: &'a str,
message: &'a str,
},
}
/// Final summary record — always the last line of the log file.
/// Explicit `kind` field serializes to `"kind": "summary"`.
#[derive(Serialize, Deserialize)]
pub struct IngestSummary {
pub kind: String,
pub ts: String,
pub run_id: String,
pub scanned: u32,
pub new: u32,
pub errors: u32,
pub ocr_pages: u32,
pub ocr_failures: u32,
pub ocr_p50_ms: Option<u64>,
pub ocr_p90_ms: Option<u64>,
pub ocr_max_ms: Option<u64>,
pub duration_ms: u64,
}
impl IngestSummary {
#[allow(clippy::too_many_arguments)]
pub fn new(
ts: String,
run_id: String,
scanned: u32,
new: u32,
errors: u32,
ocr_pages: u32,
ocr_failures: u32,
ocr_ms_samples: &[u64],
duration_ms: u64,
) -> Self {
let (p50, p90, _p99, max) = percentiles(ocr_ms_samples);
Self {
kind: "summary".to_string(),
ts,
run_id,
scanned,
new,
errors,
ocr_pages,
ocr_failures,
ocr_p50_ms: p50,
ocr_p90_ms: p90,
ocr_max_ms: max,
duration_ms,
}
}
}
/// Simple percentile extraction on a sorted copy of `samples`.
/// Returns `(p50, p90, p99, max)`. All `None` when samples is empty.
/// p99 surfaces via `inspect ocr-stats`; `IngestSummary` uses p50/p90/max only.
pub(crate) fn percentiles(samples: &[u64]) -> (Option<u64>, Option<u64>, Option<u64>, Option<u64>) {
if samples.is_empty() {
return (None, None, None, None);
}
let mut sorted = samples.to_vec();
sorted.sort_unstable();
let n = sorted.len();
let p50 = sorted[(n.saturating_sub(1) * 50) / 100];
let p90 = sorted[(n.saturating_sub(1) * 90) / 100];
let p99 = sorted[(n.saturating_sub(1) * 99) / 100];
let max = *sorted.last().unwrap();
(Some(p50), Some(p90), Some(p99), Some(max))
}
/// Delete old ingest log files from `log_dir`.
///
/// **Retention rule (§3.4 OR-on-stale semantics):**
/// Keep a file iff BOTH conditions hold: (idx < keep_recent) AND (modified > cutoff).
/// Delete iff (idx >= keep_recent) OR (modified <= cutoff) — either stale condition
/// triggers deletion. Files are indexed newest-first so `idx=0` is the most recent.
pub(crate) fn cleanup_old_logs(
log_dir: &Path,
keep_recent: u32,
retention_days: u32,
) -> anyhow::Result<()> {
let mut entries: Vec<_> = std::fs::read_dir(log_dir)?
.filter_map(Result::ok)
.filter(|e| {
e.path()
.file_name()
.and_then(|n| n.to_str())
.is_some_and(|s| s.starts_with("ingest-") && s.ends_with(".ndjson"))
})
.collect();
// Sort newest-first by mtime (files without mtime go to the end).
entries.sort_by_key(|e| std::cmp::Reverse(e.metadata().ok().and_then(|m| m.modified().ok())));
let cutoff = SystemTime::now()
.checked_sub(std::time::Duration::from_secs(
u64::from(retention_days) * 86400,
))
.unwrap_or(SystemTime::UNIX_EPOCH);
for (idx, entry) in entries.into_iter().enumerate() {
let modified = entry
.metadata()
.ok()
.and_then(|m| m.modified().ok())
.unwrap_or(SystemTime::UNIX_EPOCH);
// Keep iff (idx < keep_recent) AND (modified > cutoff).
if (idx as u32) < keep_recent && modified > cutoff {
continue;
}
if let Err(e) = std::fs::remove_file(entry.path()) {
tracing::warn!(
target: "kebab-app",
"failed to remove old log {}: {e}",
entry.path().display()
);
}
}
Ok(())
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_config::LoggingCfg;
use std::time::SystemTime;
use tempfile::TempDir;
#[test]
fn generate_run_id_has_iso_prefix_and_8_hex_suffix() {
let id = generate_run_id();
// Format: YYYYMMDDTHHmmssZ-xxxxxxxx (total len = 16+1+8 = 25)
assert_eq!(id.len(), 25, "run_id len should be 25: {id}");
let (prefix, suffix) = id.split_once('-').expect("run_id should contain '-'");
assert_eq!(prefix.len(), 16, "prefix should be 16 chars: {prefix}");
assert!(prefix.contains('T'), "prefix should contain T: {prefix}");
assert!(prefix.ends_with('Z'), "prefix should end with Z: {prefix}");
assert_eq!(suffix.len(), 8, "suffix should be 8 chars: {suffix}");
assert!(
suffix.chars().all(|c| c.is_ascii_hexdigit()),
"suffix should be hex: {suffix}"
);
}
#[test]
fn expand_log_dir_substitutes_state_dir_placeholder() {
let input = PathBuf::from("{state_dir}/logs");
let expanded = expand_log_dir(&input);
let expected = kebab_config::Config::xdg_state_dir().join("logs");
assert_eq!(expanded, expected);
assert!(!expanded.to_string_lossy().contains("{state_dir}"));
}
#[test]
fn writer_disabled_returns_none() {
let cfg = LoggingCfg {
ingest_log_enabled: false,
ingest_log_dir: PathBuf::from("/tmp/should-not-exist"),
..Default::default()
};
let result = IngestLogWriter::open(&cfg).expect("open should not error");
assert!(result.is_none(), "disabled writer should return None");
}
#[test]
fn writer_writes_one_event_per_line_with_kind_discriminator() {
let tmp = TempDir::new().unwrap();
let cfg = LoggingCfg {
ingest_log_enabled: true,
ingest_log_dir: tmp.path().to_path_buf(),
..Default::default()
};
let mut writer = IngestLogWriter::open(&cfg).unwrap().unwrap();
let path = writer.path().to_path_buf();
writer
.write_event(&LogEvent::Skip {
ts: now_ts(),
doc_path: "a.zip",
reason: "builtin_blacklist",
detail: Some(".zip extension"),
})
.unwrap();
writer
.write_event(&LogEvent::Error {
ts: now_ts(),
code: "ingest_fatal",
message: "something bad",
})
.unwrap();
writer
.write_event(&LogEvent::ParseError {
ts: now_ts(),
doc_path: "weird.pdf",
reason: "lopdf_error",
message: "unexpected EOF",
})
.unwrap();
writer.flush().unwrap();
let contents = std::fs::read_to_string(&path).unwrap();
let lines: Vec<&str> = contents.lines().collect();
assert_eq!(lines.len(), 3, "expected 3 lines, got: {}", lines.len());
for line in &lines {
assert!(
line.starts_with('{'),
"each line should be JSON object: {line}"
);
assert!(
line.contains("\"kind\""),
"each line should have 'kind': {line}"
);
}
}
#[test]
fn drop_flushes_pending_buffer() {
let tmp = TempDir::new().unwrap();
let cfg = LoggingCfg {
ingest_log_enabled: true,
ingest_log_dir: tmp.path().to_path_buf(),
..Default::default()
};
let mut writer = IngestLogWriter::open(&cfg).unwrap().unwrap();
let path = writer.path().to_path_buf();
writer
.write_event(&LogEvent::Error {
ts: now_ts(),
code: "test",
message: "drop flush test",
})
.unwrap();
// Drop without explicit flush — Drop impl should flush BufWriter.
drop(writer);
let contents = std::fs::read_to_string(&path).unwrap();
assert!(
contents.lines().count() >= 1,
"file should have at least 1 line after drop"
);
}
/// AC-7: keep_recent=3 with 5 files, oldest 2 should be deleted.
#[test]
fn cleanup_keeps_recent_n_drops_old() {
let tmp = TempDir::new().unwrap();
let dir = tmp.path();
// Create 5 files with mtime spread across 60 days
for i in 0..5u64 {
let path = dir.join(format!("ingest-file{i}.ndjson"));
std::fs::write(&path, b"x").unwrap();
// Set mtime: file 0 = newest, file 4 = 60 days old
let age_days = i * 15; // 0, 15, 30, 45, 60 days old
let mtime = SystemTime::now()
.checked_sub(std::time::Duration::from_secs(age_days * 86400))
.unwrap();
filetime::set_file_mtime(&path, filetime::FileTime::from_system_time(mtime)).unwrap();
}
// keep_recent=3, retention_days=90 (no time-based deletion)
cleanup_old_logs(dir, 3, 90).unwrap();
let remaining: Vec<_> = std::fs::read_dir(dir)
.unwrap()
.filter_map(Result::ok)
.collect();
assert_eq!(remaining.len(), 3, "expected 3 files after cleanup");
}
/// F5 OR-on-stale: files within keep_recent count but older than retention_days
/// must still be deleted.
#[test]
fn cleanup_drops_stale_even_within_count() {
let tmp = TempDir::new().unwrap();
let dir = tmp.path();
// 2 files, both 90 days old — well past retention_days=30
for i in 0..2u64 {
let path = dir.join(format!("ingest-old{i}.ndjson"));
std::fs::write(&path, b"x").unwrap();
let mtime = SystemTime::now()
.checked_sub(std::time::Duration::from_secs(90 * 86400))
.unwrap();
filetime::set_file_mtime(&path, filetime::FileTime::from_system_time(mtime)).unwrap();
}
// keep_recent=10 (both within count) but retention_days=30 → both stale
cleanup_old_logs(dir, 10, 30).unwrap();
let remaining: Vec<_> = std::fs::read_dir(dir)
.unwrap()
.filter_map(Result::ok)
.collect();
assert_eq!(
remaining.len(),
0,
"stale files must be deleted even within keep_recent"
);
}
}

View File

@@ -46,10 +46,21 @@ pub struct AggregateCounts {
/// Ordering invariant per design §2.4a:
///
/// ```text
/// ScanStarted < ScanCompleted < (AssetStarted < AssetFinished)*
/// < (Completed | Aborted)
/// ScanStarted < ScanCompleted
/// < ( AssetStarted
/// [< (PdfOcrStarted < PdfOcrFinished)*]
/// [< AssetChunked]
/// [< AssetTimings]
/// < AssetFinished )*
/// < (Completed | Aborted)
/// ```
///
/// `[]` = optional. `PdfOcr*` is per-PDF asset only (v0.20.0 sub-item 1).
/// `AssetChunked` / `AssetTimings` are the v0.24.0 asset-internal phase
/// events: `AssetChunked` fires once right after chunking (markdown /
/// image / PDF); `AssetTimings` reports per-phase wall-clock once
/// (markdown only).
///
/// Embed-batch events (`embed_batch_started` / `embed_batch_finished`
/// in §2.4a) are reserved for a future iteration and are not emitted
/// by this task; the spec calls them out as "임의 위치" (optional).
@@ -79,12 +90,82 @@ pub enum IngestEvent {
result: IngestItemKind,
chunks: u32,
},
/// v0.24.0 (additive): emitted right after an asset is chunked, before
/// expansion / embed / store. Surfaces "this document is N chunks"
/// immediately so a single large document no longer looks frozen at
/// `idx/total` while its per-chunk phases churn. `chunks` is the chunk
/// count for asset `idx`.
AssetChunked { idx: u32, total: u32, chunks: u32 },
/// v0.26.1 (additive): emitted when an asset enters a *slow* internal
/// phase, so the interactive progress bar can show **which** phase
/// (and which model) is currently running instead of looking frozen.
/// `phase` ∈ {`"ocr"`, `"caption"`, `"embed"`}; short phases
/// (parse / chunk / store) are intentionally *not* emitted to avoid
/// noise. `model` is the model performing the phase — the vision LLM
/// id for `ocr` / `caption`, the embedder `model_id` for `embed`
/// (`None` when the phase runs without a configured model, e.g. embed
/// with no embedder wired). Emitted once per (asset, phase); no
/// throttle needed (low frequency). Wire v1 consumers that predate
/// this variant simply ignore the unknown `asset_phase` kind.
AssetPhase {
idx: u32,
total: u32,
phase: String,
model: Option<String>,
},
/// v0.24.0 (additive): per-phase wall-clock (milliseconds) for asset
/// `idx`, emitted once the asset's pipeline finishes. Lets a user see
/// *where* the time went (parse / chunk / ocr / caption / embed /
/// store) without parsing logs. The markdown path leaves `ocr_ms` /
/// `caption_ms` at 0 (no image analysis); the image / PDF paths fill
/// them so the slowest-asset summary attributes vision-model time
/// correctly. `expansion_ms` is retained for wire compatibility but is
/// always 0 since doc-side expansion was removed (HOTFIXES 2026-06-03).
/// `ocr_ms` / `caption_ms` (v0.26.1) are additive with serde default 0
/// so pre-v0.26.1 consumers deserialize cleanly.
AssetTimings {
idx: u32,
total: u32,
parse_ms: u64,
chunk_ms: u64,
expansion_ms: u64,
embed_ms: u64,
store_ms: u64,
#[serde(default)]
ocr_ms: u64,
#[serde(default)]
caption_ms: u64,
},
/// Run finished normally. `counts` is the final aggregate.
Completed { counts: AggregateCounts },
/// Run finished by user cancellation. `counts` is the partial
/// aggregate at the cancel boundary. Emitted by `p9-fb-04`; this
/// task never produces `Aborted`.
Aborted { counts: AggregateCounts },
/// PDF page 별 OCR 시작 시 emit. v0.20.0 sub-item 1.
PdfOcrStarted { page: u32 },
/// PDF page 별 OCR 종료 시 emit. v0.20.0 sub-item 1.
/// `skipped` = `true` 일 시 OCR 미수행 (DCTDecode 부재 또는 engine 실패).
/// `chars = 0` 만으로는 "skip" 과 "0-char OCR result" 구분 불가, `skipped` field 가 명시적.
PdfOcrFinished {
page: u32,
ms: u64,
chars: u32,
ocr_engine: String,
skipped: bool,
/// v0.20.x ingest log: raster image byte size (additive minor, optional).
#[serde(skip_serializing_if = "Option::is_none")]
image_byte_size: Option<u64>,
/// v0.20.x ingest log: raster image width in pixels (additive minor, optional).
#[serde(skip_serializing_if = "Option::is_none")]
image_width: Option<u32>,
/// v0.20.x ingest log: raster image height in pixels (additive minor, optional).
#[serde(skip_serializing_if = "Option::is_none")]
image_height: Option<u32>,
/// v0.20.x ingest log: OCR failure reason (additive minor, optional).
#[serde(skip_serializing_if = "Option::is_none")]
failure_reason: Option<String>,
},
}
/// Map a `MediaType` to the short label used by `IngestEvent::AssetStarted`.
@@ -118,10 +199,7 @@ pub fn render_skipped_breakdown(map: &std::collections::BTreeMap<String, u32>) -
/// Best-effort send into an optional `mpsc::Sender`. A dropped receiver
/// is silently absorbed — the ingest hot path must not stall on a slow
/// consumer. Logged at `trace` for diagnostics.
pub(crate) fn emit(
progress: Option<&std::sync::mpsc::Sender<IngestEvent>>,
event: IngestEvent,
) {
pub(crate) fn emit(progress: Option<&std::sync::mpsc::Sender<IngestEvent>>, event: IngestEvent) {
if let Some(tx) = progress {
if tx.send(event).is_err() {
tracing::trace!(
@@ -165,13 +243,131 @@ mod tests {
media: "markdown".into(),
};
let v = serde_json::to_value(&ev).unwrap();
assert_eq!(v.get("kind").and_then(|s| s.as_str()), Some("asset_started"));
assert_eq!(v.get("idx").and_then(|n| n.as_u64()), Some(1));
assert_eq!(v.get("total").and_then(|n| n.as_u64()), Some(10));
assert_eq!(
v.get("kind").and_then(|s| s.as_str()),
Some("asset_started")
);
assert_eq!(v.get("idx").and_then(serde_json::Value::as_u64), Some(1));
assert_eq!(v.get("total").and_then(serde_json::Value::as_u64), Some(10));
assert_eq!(v.get("path").and_then(|s| s.as_str()), Some("notes/foo.md"));
assert_eq!(v.get("media").and_then(|s| s.as_str()), Some("markdown"));
}
#[test]
fn asset_chunked_serializes_with_discriminator() {
// v0.24.0 additive variant — `kind` must be snake_case
// `asset_chunked` so wire v1 consumers branch on it cleanly.
let ev = IngestEvent::AssetChunked {
idx: 3,
total: 10,
chunks: 142,
};
let v = serde_json::to_value(&ev).unwrap();
assert_eq!(
v.get("kind").and_then(|s| s.as_str()),
Some("asset_chunked")
);
assert_eq!(v.get("idx").and_then(serde_json::Value::as_u64), Some(3));
assert_eq!(
v.get("chunks").and_then(serde_json::Value::as_u64),
Some(142)
);
}
#[test]
fn asset_timings_serializes_all_phase_fields() {
let ev = IngestEvent::AssetTimings {
idx: 2,
total: 7,
parse_ms: 12,
chunk_ms: 3,
expansion_ms: 45_000,
embed_ms: 800,
store_ms: 20,
ocr_ms: 1_200,
caption_ms: 3_400,
};
let v = serde_json::to_value(&ev).unwrap();
assert_eq!(
v.get("kind").and_then(|s| s.as_str()),
Some("asset_timings")
);
// All phase fields are present (plain u64, always serialized).
for (field, want) in [
("parse_ms", 12u64),
("chunk_ms", 3),
("expansion_ms", 45_000),
("embed_ms", 800),
("store_ms", 20),
("ocr_ms", 1_200),
("caption_ms", 3_400),
] {
assert_eq!(
v.get(field).and_then(serde_json::Value::as_u64),
Some(want),
"field {field}"
);
}
}
#[test]
fn asset_timings_ocr_caption_default_to_zero_for_legacy_wire() {
// v0.26.1 additive: a pre-v0.26.1 wire payload omits ocr_ms /
// caption_ms; serde `default` must fill 0 so old producers stay
// compatible.
let legacy = serde_json::json!({
"kind": "asset_timings",
"idx": 1, "total": 1,
"parse_ms": 5, "chunk_ms": 2, "expansion_ms": 0,
"embed_ms": 10, "store_ms": 3
});
let ev: IngestEvent = serde_json::from_value(legacy).unwrap();
match ev {
IngestEvent::AssetTimings {
ocr_ms,
caption_ms,
embed_ms,
..
} => {
assert_eq!(ocr_ms, 0);
assert_eq!(caption_ms, 0);
assert_eq!(embed_ms, 10);
}
other => panic!("unexpected event: {other:?}"),
}
}
#[test]
fn asset_phase_serializes_with_discriminator() {
// v0.26.1 additive variant — `kind` must be snake_case
// `asset_phase`, `phase` is the slow-phase label, `model` the
// model id (nullable).
let ev = IngestEvent::AssetPhase {
idx: 4,
total: 12,
phase: "ocr".into(),
model: Some("gemma4:e4b".into()),
};
let v = serde_json::to_value(&ev).unwrap();
assert_eq!(v.get("kind").and_then(|s| s.as_str()), Some("asset_phase"));
assert_eq!(v.get("idx").and_then(serde_json::Value::as_u64), Some(4));
assert_eq!(v.get("phase").and_then(|s| s.as_str()), Some("ocr"));
assert_eq!(v.get("model").and_then(|s| s.as_str()), Some("gemma4:e4b"));
}
#[test]
fn asset_phase_model_none_serializes_as_null() {
let ev = IngestEvent::AssetPhase {
idx: 1,
total: 1,
phase: "embed".into(),
model: None,
};
let v = serde_json::to_value(&ev).unwrap();
assert_eq!(v.get("phase").and_then(|s| s.as_str()), Some("embed"));
assert!(v.get("model").is_some_and(serde_json::Value::is_null));
}
#[test]
fn ingest_event_completed_has_counts() {
let ev = IngestEvent::Completed {
@@ -184,8 +380,14 @@ mod tests {
let v = serde_json::to_value(&ev).unwrap();
assert_eq!(v.get("kind").and_then(|s| s.as_str()), Some("completed"));
let counts = v.get("counts").unwrap();
assert_eq!(counts.get("scanned").and_then(|n| n.as_u64()), Some(5));
assert_eq!(counts.get("new").and_then(|n| n.as_u64()), Some(2));
assert_eq!(
counts.get("scanned").and_then(serde_json::Value::as_u64),
Some(5)
);
assert_eq!(
counts.get("new").and_then(serde_json::Value::as_u64),
Some(2)
);
}
#[test]

File diff suppressed because it is too large Load Diff

View File

@@ -26,7 +26,9 @@ pub fn init(level: LogLevel) -> Result<WorkerGuard> {
let (nb, guard) = tracing_appender::non_blocking(file_appender);
let env_filter = match level {
LogLevel::Default => EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("warn")),
LogLevel::Default => {
EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("warn"))
}
LogLevel::Verbose => EnvFilter::new("info"),
LogLevel::Debug => EnvFilter::new("debug"),
};

View File

@@ -0,0 +1,362 @@
// crates/kebab-app/src/pdf_ocr_apply.rs
//
// PDF post-extract OCR enrichment. parser isolation 보존 — kebab-parse-pdf 가
// kebab-parse-image::OcrEngine 을 import 하지 않도록, helper 는 kebab-app 에 둠.
// image path 의 apply_ocr (kebab-parse-image::ocr::apply_ocr) 의
// PDF page 변형 — image 는 ImageRefBlock.ocr 를 mutate, PDF 는
// Block::Paragraph.text / inlines 를 in-place mutate (단일 OCR fallback) 또는
// 새 Block::Paragraph 를 push (always_on dual-block).
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::time::Instant;
use anyhow::{Context, Result};
use kebab_core::{
Block, CanonicalDocument, CommonBlock, Inline, Lang, ProvenanceEvent, ProvenanceKind,
SourceSpan, TextBlock, id_for_block,
};
use kebab_parse_image::OcrEngine;
use kebab_parse_pdf::{compute_valid_char_ratio, extract_dctdecode_page_image};
use lopdf::Document as LopdfDocument;
use time::OffsetDateTime;
use tracing::warn;
/// Extract width/height from a JPEG (or any image format) byte slice.
/// Returns `None` on corrupt / unsupported data — callers fall back to
/// `(None, None)` so OCR results remain valid (R-4 mitigation).
fn extract_image_dimensions(bytes: &[u8]) -> Option<(u32, u32)> {
use image::ImageReader;
ImageReader::new(std::io::Cursor::new(bytes))
.with_guessed_format()
.ok()?
.into_dimensions()
.ok()
}
/// Per-page OCR knobs threaded through [`apply_ocr_to_pdf_pages`].
/// Mirrors the `[pdf.ocr]` config block (spec §4.5); the facade
/// (`kebab_app::ingest_one_pdf_asset`) fills these from
/// `kebab_config::Config::pdf::ocr` plus runtime flags (CLI / SIGINT).
pub struct PdfOcrOpts {
/// Master switch. `false` short-circuits to
/// `PdfOcrSummary { pages_ocrd: 0, ms_total: 0 }` without lopdf reparse.
pub enabled: bool,
/// `true` → 모든 page OCR (dual-block path, new `Block::Paragraph` push).
/// `false` → text-detect block 의 `min_char_count` 또는
/// `valid_ratio_threshold` 미달인 page 만 OCR (in-place mutate).
pub always_on: bool,
/// 0.0..=1.0. text-detect block 의 `compute_valid_char_ratio` 가
/// 본 임계 미만이면 OCR fallback. Default `0.5`.
pub valid_ratio_threshold: f32,
/// text-detect block 의 char count 가 본 임계 미만이면 OCR fallback.
/// empty page (cover, blank separator) 자동 skip. Default `20`.
pub min_char_count: u32,
/// OCR engine 에 전달할 언어 힌트 (예: `Lang("kor".into())`).
/// `None` → no hint passed to engine.
pub lang_hint: Option<Lang>,
/// Optional per-page cancellation handle. checked at start of each page
/// loop iteration; set→true 시 `cancelled mid-PDF` error 반환. plan §6 E4
/// + verifier LOW L-1 resolution + spec §4.8 line 1159 명시.
pub cancel: Option<Arc<AtomicBool>>,
}
/// OCR run summary returned by [`apply_ocr_to_pdf_pages`] for the caller's
/// `IngestItem.pdf_ocr_pages` + `pdf_ocr_ms_total` wire fields (§4.6.2).
#[derive(Debug)]
pub struct PdfOcrSummary {
/// Number of pages 가 OCR pipeline 을 실제 통과 (skipped page 제외).
pub pages_ocrd: u32,
/// Cumulative wall-clock duration of successful OCR engine calls (ms).
/// `saturating_add` 사용 — 24-day cumulative 까지 overflow-safe.
pub ms_total: u64,
}
/// Post-extract OCR enrichment for PDF. Walks `canonical.blocks` page-by-page,
/// classifies each page via `text_quality::compute_valid_char_ratio` +
/// `min_char_count`, and either:
/// - skips (vector PDF + sufficient text + `always_on=false`),
/// - mutates the text-detect `Block::Paragraph` in-place with OCR output
/// (scanned/mojibake page), or
/// - pushes a new `Block::Paragraph` with dual ordinal (`always_on=true` +
/// vector page).
///
/// Errors:
/// - cancel handle (`opts.cancel = Some(true)`) → `Err("PDF OCR cancelled mid-PDF at page N")`.
/// - lopdf re-parse failure → `Err(...)`.
/// - per-page OCR engine failure 또는 DCTDecode 부재 → `ProvenanceKind::Warning`
/// event push + `emit_progress(Finished { skipped: true })` + continue
/// (no `Err` propagation).
///
/// See spec §4.1 + §4.4 for the full pipeline.
pub fn apply_ocr_to_pdf_pages<F>(
canonical: &mut CanonicalDocument,
engine: &dyn OcrEngine,
pdf_bytes: &[u8],
opts: &PdfOcrOpts,
mut emit_progress: F,
) -> Result<PdfOcrSummary>
where
F: FnMut(PdfOcrProgress),
{
if !opts.enabled {
return Ok(PdfOcrSummary {
pages_ocrd: 0,
ms_total: 0,
});
}
let pdf_doc = LopdfDocument::load_mem(pdf_bytes)
.context("kb-app::pdf_ocr_apply: re-parse PDF for image extract")?;
let page_count = pdf_doc.get_pages().len() as u32;
let mut new_events: Vec<ProvenanceEvent> = Vec::new();
let mut ocr_blocks: Vec<Block> = Vec::new();
let mut pages_ocrd: u32 = 0;
let mut ms_total: u64 = 0;
// canonical.blocks 의 page → block index map (text-detect block 의 in-place
// mutate 또는 dual-block push 결정용).
// PdfTextExtractor 가 page 마다 1 Block::Paragraph + SourceSpan::Page 를
// 생성 (§1.4) — 그 invariant 사용.
for page_num in 1..=page_count {
if let Some(cancel) = &opts.cancel {
if cancel.load(std::sync::atomic::Ordering::Relaxed) {
anyhow::bail!("PDF OCR cancelled mid-PDF at page {page_num}");
}
}
let text_block_idx = find_paragraph_block_idx(&canonical.blocks, page_num);
let text = match &canonical.blocks[text_block_idx] {
Block::Paragraph(tb) => tb.text.clone(),
_ => String::new(),
};
let chars = text.chars().count() as u32;
let valid_ratio = compute_valid_char_ratio(&text);
let needs_ocr = chars < opts.min_char_count || valid_ratio < opts.valid_ratio_threshold;
// 결정 matrix:
// always_on=true → 모든 page OCR (dual-block).
// always_on=false + needs_ocr → in-place OCR (text-detect block mutate).
// needs_ocr=false → skip.
let do_ocr = opts.always_on || needs_ocr;
if !do_ocr {
continue;
}
emit_progress(PdfOcrProgress::Started { page: page_num });
let page_image_bytes = if let Some(b) = extract_dctdecode_page_image(&pdf_doc, page_num)? {
b
} else {
let note = format!(
"page={page_num} skipped: no DCTDecode image XObject (vector PDF page or unsupported /Filter — v1 supports DCTDecode passthrough only; see release notes for normalization guidance)"
);
warn!(target: "kebab-app", "{}", note);
new_events.push(ProvenanceEvent {
at: OffsetDateTime::now_utc(),
agent: "kb-parse-pdf".to_string(),
kind: ProvenanceKind::Warning,
note: Some(note),
});
emit_progress(PdfOcrProgress::Finished {
page: page_num,
ms: 0,
chars: 0,
skipped: true,
image_byte_size: None,
image_width: None,
image_height: None,
failure_reason: None,
});
continue;
};
let start = Instant::now();
let ocr = match engine.recognize(&page_image_bytes, opts.lang_hint.as_ref()) {
Ok(t) => t,
Err(e) => {
// OCR failure: warning event + skip (text-detect block 그대로).
let note = format!(
"page={} OCR failed engine={} version={} err={}",
page_num,
engine.engine_name(),
engine.engine_version(),
e
);
warn!(target: "kebab-app", "{}", note);
new_events.push(ProvenanceEvent {
at: OffsetDateTime::now_utc(),
agent: "kb-parse-pdf".to_string(),
kind: ProvenanceKind::Warning,
note: Some(note),
});
let (image_width, image_height) = extract_image_dimensions(&page_image_bytes)
.map_or((None, None), |(w, h)| (Some(w), Some(h)));
emit_progress(PdfOcrProgress::Finished {
page: page_num,
ms: start.elapsed().as_millis() as u64,
chars: 0,
skipped: true,
image_byte_size: Some(page_image_bytes.len() as u64),
image_width,
image_height,
failure_reason: Some("ocr_error".to_string()),
});
continue;
}
};
let elapsed_ms = start.elapsed().as_millis() as u64;
let chars_ocr = ocr.joined.chars().count() as u32;
pages_ocrd = pages_ocrd.saturating_add(1);
ms_total = ms_total.saturating_add(elapsed_ms);
if opts.always_on && !needs_ocr {
// dual-block path: 새 Block::Paragraph push, ordinal = page-1 + page_count.
let ocr_ordinal = (page_num - 1) + page_count;
let span_ocr = SourceSpan::Page {
page: page_num,
char_start: Some(0),
char_end: Some(chars_ocr),
};
let block_id =
id_for_block(&canonical.doc_id, "paragraph", &[], ocr_ordinal, &span_ocr);
let common = CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span_ocr,
};
ocr_blocks.push(Block::Paragraph(TextBlock {
common,
text: ocr.joined.clone(),
inlines: if ocr.joined.is_empty() {
Vec::new()
} else {
vec![Inline::Text {
text: ocr.joined.clone(),
}]
},
}));
} else {
// in-place mutate: text-detect block (빈 또는 low-valid) 의 text/inlines 교체.
// block_id / ordinal 보존 — span 의 char_end 만 갱신.
if let Block::Paragraph(tb) = &mut canonical.blocks[text_block_idx] {
tb.text = ocr.joined.clone();
tb.inlines = if ocr.joined.is_empty() {
Vec::new()
} else {
vec![Inline::Text {
text: ocr.joined.clone(),
}]
};
if let SourceSpan::Page { char_end, .. } = &mut tb.common.source_span {
*char_end = Some(chars_ocr);
}
}
}
new_events.push(ProvenanceEvent {
at: OffsetDateTime::now_utc(),
agent: "kb-parse-pdf".to_string(),
kind: ProvenanceKind::OcrApplied,
note: Some(format!(
"page={} engine={} version={} regions={} ms={} chars={}",
page_num,
engine.engine_name(),
engine.engine_version(),
ocr.regions.len(),
elapsed_ms,
chars_ocr
)),
});
let (image_width, image_height) = extract_image_dimensions(&page_image_bytes)
.map_or((None, None), |(w, h)| (Some(w), Some(h)));
emit_progress(PdfOcrProgress::Finished {
page: page_num,
ms: elapsed_ms,
chars: chars_ocr,
skipped: false,
image_byte_size: Some(page_image_bytes.len() as u64),
image_width,
image_height,
failure_reason: None,
});
}
canonical.blocks.extend(ocr_blocks);
canonical.provenance.events.extend(new_events);
Ok(PdfOcrSummary {
pages_ocrd,
ms_total,
})
}
fn find_paragraph_block_idx(blocks: &[Block], page_num: u32) -> usize {
blocks
.iter()
.position(|b| match b {
Block::Paragraph(tb) => matches!(
tb.common.source_span,
SourceSpan::Page { page, .. } if page == page_num
),
_ => false,
})
.expect("PdfTextExtractor emits 1 Block::Paragraph per page (invariant)")
}
/// Per-page OCR progress event 가 caller 의 `emit_progress` closure 호출 시 emit.
/// Step 6 의 ingest_one_pdf_asset 가 IngestEvent::PdfOcrStarted / PdfOcrFinished
/// 로 carry (spec §4.6.1 wire schema).
pub enum PdfOcrProgress {
/// page 별 OCR 시작 시 emit. `engine.recognize` 호출 직전.
Started {
/// 1-based PDF page number.
page: u32,
},
/// page 별 OCR 종료 시 emit (성공 / skip / failure 모두).
Finished {
/// 1-based PDF page number.
page: u32,
/// `engine.recognize` wall-clock duration. skip path 의 의미는 mixed
/// (DCTDecode 부재 시 `0`, OCR engine 실패 시 actual latency before bail).
ms: u64,
/// OCR result text 의 char count. skip 시 `0`.
chars: u32,
/// `true` = DCTDecode 부재 또는 OCR engine 실패 로 skip.
/// `false` = 정상 OCR 완료.
skipped: bool,
/// v0.20.x ingest log: raster image byte size (additive, optional).
image_byte_size: Option<u64>,
/// v0.20.x ingest log: raster image width in pixels (additive, optional).
image_width: Option<u32>,
/// v0.20.x ingest log: raster image height in pixels (additive, optional).
image_height: Option<u32>,
/// v0.20.x ingest log: failure reason string when OCR failed (additive, optional).
/// Values: "timeout" | "ocr_error" | "network_error" | None (success).
failure_reason: Option<String>,
},
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn extract_image_dimensions_valid_jpeg() {
let img = image::RgbImage::new(16, 12);
let mut bytes = Vec::new();
image::DynamicImage::from(img)
.write_to(
&mut std::io::Cursor::new(&mut bytes),
image::ImageFormat::Jpeg,
)
.expect("encode jpeg");
assert_eq!(extract_image_dimensions(&bytes), Some((16, 12)));
}
#[test]
fn extract_image_dimensions_corrupt_returns_none() {
assert_eq!(extract_image_dimensions(b"not a jpeg"), None);
}
}

View File

@@ -9,13 +9,19 @@
//!
//! `--vector-only` additionally truncates `embedding_records` in SQLite
//! so the next `kebab ingest` re-embeds cleanly without orphan rows.
//!
//! `--orphans-only` purges stored docs that are outside the current walker
//! scope (config narrowing / removed sub-directory). No filesystem paths are
//! removed — this is purely a store-level reconciliation.
use std::collections::HashSet;
use std::path::PathBuf;
use anyhow::{Context, Result};
use serde::{Deserialize, Serialize};
use kebab_config::{Config, expand_path};
use kebab_core::WorkspacePath;
/// What the user asked to remove. Mutually exclusive — picked by the CLI
/// from a clap `ArgGroup`.
@@ -32,6 +38,13 @@ pub enum ResetScope {
VectorOnly,
/// Wipe only the config dir.
ConfigOnly,
/// Purge stored docs that are outside the current walker scope (no
/// filesystem paths are removed). Filesystem existence is NOT checked —
/// anything the current walker would not visit is considered an orphan.
/// The explicit complement to the conservative `sweep_deleted_files`
/// that runs during ingest (which leaves on-disk-but-out-of-scope docs
/// alone for data safety).
OrphansOnly,
}
/// Result of a successful wipe — emitted as `reset_report.v1` by the
@@ -41,6 +54,16 @@ pub struct ResetReport {
pub scope: ResetScope,
pub removed_paths: Vec<PathBuf>,
pub embedding_rows_truncated: u64,
/// Number of stored docs purged because they are outside the current
/// walker scope. Non-zero only when `scope == OrphansOnly`.
/// `#[serde(default)]` preserves back-compat with older callers that
/// do not include this field.
#[serde(default)]
pub orphans_purged: u32,
/// Paths of the orphaned docs that were purged. Sorted for deterministic
/// output. Non-empty only when `scope == OrphansOnly`.
#[serde(default)]
pub purged_paths: Vec<WorkspacePath>,
}
/// Compute the absolute on-disk paths a given scope will wipe, given a
@@ -62,11 +85,14 @@ pub fn enumerate_paths(scope: ResetScope, cfg: &Config) -> Vec<PathBuf> {
ResetScope::All => vec![cfg_dir, data_dir, cache_dir, state_dir],
ResetScope::DataOnly => vec![data_dir, cache_dir, state_dir],
ResetScope::VectorOnly => {
let vector_dir =
expand_path(&cfg.storage.vector_dir, &data_dir.to_string_lossy());
let vector_dir = expand_path(&cfg.storage.vector_dir, &data_dir.to_string_lossy());
vec![vector_dir]
}
ResetScope::ConfigOnly => vec![cfg_dir],
// OrphansOnly operates purely at the store level — no filesystem paths
// are removed. Return empty so `estimate_size_bytes` stays zero and
// the existing confirm UI path for directory wipes is skipped.
ResetScope::OrphansOnly => vec![],
}
}
@@ -96,16 +122,79 @@ pub fn estimate_size_bytes(paths: &[PathBuf]) -> u64 {
paths.iter().map(|p| walk(p)).sum()
}
/// Compute the workspace paths stored in SQLite that are NOT visited by
/// the current walker scope (i.e. they are "orphans" — on disk but
/// outside the configured include/exclude rules, or from a sub-directory
/// that has since been removed from the workspace).
///
/// Does NOT check filesystem existence — `OrphansOnly` is the explicit
/// "I know what I'm doing" variant; callers that want the conservative
/// fs-aware sweep should use `sweep_deleted_files` inside ingest.
///
/// Returns the list sorted for deterministic output. Called twice by the
/// CLI path (once for the confirm UI preview, once inside `execute`);
/// the double scan is acceptable for a rare destructive operation.
pub fn enumerate_orphans(cfg: &Config) -> Result<Vec<WorkspacePath>> {
use kebab_core::DocumentStore as _;
use kebab_core::SourceScope;
use kebab_source_fs::FsSourceConnector;
let store = kebab_store_sqlite::SqliteStore::open(cfg)
.context("enumerate_orphans: open SqliteStore")?;
let stored = store
.all_workspace_paths()
.context("enumerate_orphans: all_workspace_paths")?;
if stored.is_empty() {
return Ok(Vec::new());
}
// Build the same SourceScope the CLI's ingest path uses: root from
// config, exclude list from config, no include override (full scope).
let root = cfg.resolve_workspace_root();
let scope = SourceScope {
root: root.clone(),
exclude: cfg.workspace.exclude.clone(),
..Default::default()
};
let connector =
FsSourceConnector::new(cfg).context("enumerate_orphans: build FsSourceConnector")?;
let (assets, _skips) = connector
.scan_with_skips(&scope)
.context("enumerate_orphans: scan workspace")?;
let scanned: HashSet<WorkspacePath> = assets.into_iter().map(|a| a.workspace_path).collect();
let mut orphans: Vec<WorkspacePath> = stored
.into_iter()
.filter(|p| !scanned.contains(p))
.collect();
orphans.sort_by(|a, b| a.0.cmp(&b.0));
Ok(orphans)
}
/// Wipe every path from `enumerate_paths(scope, cfg)`. For
/// `ResetScope::VectorOnly`, also truncates the SQLite
/// `embedding_records` table so the store doesn't point at the Lance
/// rows we just removed off-disk.
///
/// For `ResetScope::OrphansOnly`, no filesystem directories are removed.
/// Instead the store is reconciled: stored docs outside the current walker
/// scope are purged from SQLite (+ vector store when configured). The
/// caller is expected to have already shown the confirm UI using
/// `enumerate_orphans`.
///
/// Idempotent: a missing path is treated as already-removed (success).
/// Returns a `ResetReport` listing exactly what was removed (paths that
/// existed before the call) so `--json` callers see the truth, not the
/// request.
pub fn execute(scope: ResetScope, cfg: &Config) -> Result<ResetReport> {
if matches!(scope, ResetScope::OrphansOnly) {
return execute_orphans_only(cfg);
}
let paths = enumerate_paths(scope, cfg);
let mut removed = Vec::new();
@@ -113,8 +202,7 @@ pub fn execute(scope: ResetScope, cfg: &Config) -> Result<ResetReport> {
if !p.exists() {
continue;
}
std::fs::remove_dir_all(p)
.with_context(|| format!("remove {}", p.display()))?;
std::fs::remove_dir_all(p).with_context(|| format!("remove {}", p.display()))?;
removed.push(p.clone());
}
@@ -128,9 +216,99 @@ pub fn execute(scope: ResetScope, cfg: &Config) -> Result<ResetReport> {
scope,
removed_paths: removed,
embedding_rows_truncated,
orphans_purged: 0,
purged_paths: Vec::new(),
})
}
/// Execute the `OrphansOnly` variant: reconcile stored docs against the
/// current walker scope without touching any filesystem directory.
fn execute_orphans_only(cfg: &Config) -> Result<ResetReport> {
let orphans = enumerate_orphans(cfg).context("execute_orphans_only: enumerate orphans")?;
if orphans.is_empty() {
return Ok(ResetReport {
scope: ResetScope::OrphansOnly,
removed_paths: Vec::new(),
embedding_rows_truncated: 0,
orphans_purged: 0,
purged_paths: Vec::new(),
});
}
let store = std::sync::Arc::new(
kebab_store_sqlite::SqliteStore::open(cfg)
.context("execute_orphans_only: open SqliteStore")?,
);
// Open vector store if configured. Mirror the same guard the ingest
// path uses: only construct when the provider is not "none" / dims > 0.
let vector_store: Option<kebab_store_vector::LanceVectorStore> =
open_vector_store_if_configured(cfg, store.clone())?;
let mut purged_paths: Vec<WorkspacePath> = Vec::new();
for path in &orphans {
let chunk_ids = kebab_store_sqlite::purge_deleted_workspace_path(&store, path)
.with_context(|| format!("execute_orphans_only: purge {}", path.0))?;
if let Some(ref vs) = vector_store {
if !chunk_ids.is_empty() {
use kebab_core::VectorStore as _;
if let Err(e) = vs.delete_by_chunk_ids(&chunk_ids) {
tracing::warn!(
target: "kebab-app",
path = %path.0,
count = chunk_ids.len(),
error = %e,
"reset --orphans-only: vector delete failed; SQLite side already cleaned"
);
}
}
}
tracing::info!(
target: "kebab-app",
path = %path.0,
"reset --orphans-only: purged orphan document"
);
purged_paths.push(path.clone());
}
let orphans_purged = u32::try_from(purged_paths.len()).unwrap_or(u32::MAX);
Ok(ResetReport {
scope: ResetScope::OrphansOnly,
removed_paths: Vec::new(),
embedding_rows_truncated: 0,
orphans_purged,
purged_paths,
})
}
/// Open the Lance vector store if the configured embedding provider is
/// active (non-"none", dimensions > 0). Returns `None` for lexical-only
/// configs. Mirrors the guard in `App::vector`.
fn open_vector_store_if_configured(
cfg: &Config,
store: std::sync::Arc<kebab_store_sqlite::SqliteStore>,
) -> Result<Option<kebab_store_vector::LanceVectorStore>> {
if cfg.models.embedding.provider == "none" || cfg.models.embedding.dimensions == 0 {
return Ok(None);
}
match kebab_store_vector::LanceVectorStore::new(cfg, store) {
Ok(vs) => Ok(Some(vs)),
Err(e) => {
tracing::warn!(
target: "kebab-app",
error = %e,
"reset --orphans-only: could not open vector store; skipping vector delete"
);
Ok(None)
}
}
}
/// Open the SQLite store at the configured path and run
/// `truncate_embedding_records`. Returns the count of truncated rows
/// (the helper itself reports `DELETE` rowcount). If the SQLite file
@@ -200,4 +378,14 @@ mod tests {
let bytes = estimate_size_bytes(&[dir.path().to_path_buf()]);
assert_eq!(bytes, 5 + 6);
}
#[test]
fn enumerate_orphans_only_returns_empty_paths() {
let cfg = Config::defaults();
let paths = enumerate_paths(ResetScope::OrphansOnly, &cfg);
assert!(
paths.is_empty(),
"OrphansOnly must return empty vec from enumerate_paths"
);
}
}

View File

@@ -39,6 +39,14 @@ pub struct Capabilities {
pub struct Models {
pub parser_version: String,
pub chunker_version: String,
/// v0.20.1+ (Bug #13). Corpus 안 활성 parser version 전체.
/// 빈 corpus → empty Vec. backward compat: `parser_version` field 보존.
#[serde(default)]
pub active_parsers: Vec<String>,
/// v0.20.1+ (Bug #13). Corpus 안 활성 chunker version 전체.
/// 빈 corpus → empty Vec.
#[serde(default)]
pub active_chunkers: Vec<String>,
pub embedding_version: String,
pub prompt_template_version: String,
pub index_version: String,
@@ -63,14 +71,26 @@ pub struct Stats {
/// p9-fb-37: docs whose `updated_at` exceeds the staleness threshold.
#[serde(default)]
pub stale_doc_count: u64,
/// p10-1A-1: code language breakdown (chunk counts by canonical lowercase
/// language identifier). Empty until 1A-2 produces code chunks.
/// p10-1A-1: code language breakdown (**doc** counts by canonical
/// lowercase language identifier). Empty until 1A-2 produces code
/// docs. v0.17.0 PR-C: doc-count semantics corrected here (the
/// previous "chunk counts" wording was a longstanding mis-label —
/// implementation has always been `COUNT(*) FROM documents
/// GROUP BY code_lang`). Use `code_lang_chunk_breakdown` for the
/// chunk-level companion.
#[serde(default)]
pub code_lang_breakdown: std::collections::BTreeMap<String, u32>,
/// p10-1A-1: repo breakdown (chunk counts by `metadata.repo` value).
/// Empty until 1A-2 produces code chunks.
/// p10-1A-1: repo breakdown (**doc** counts by `metadata.repo`
/// value). Empty until 1A-2 produces code docs. v0.17.0 PR-C:
/// doc-count wording corrected (mirror of code_lang_breakdown).
#[serde(default)]
pub repo_breakdown: std::collections::BTreeMap<String, u32>,
/// v0.17.0 PR-C: sister of [`Self::code_lang_breakdown`] returning
/// chunk counts instead of doc counts. Indexing-pressure metric —
/// one PDF spec → 200 chunks vs one Rust file → 5 chunks shows up
/// here in a way `code_lang_breakdown` (doc count) hides.
#[serde(default)]
pub code_lang_chunk_breakdown: std::collections::BTreeMap<String, u32>,
}
const KEBAB_VERSION: &str = env!("CARGO_PKG_VERSION");
@@ -88,6 +108,7 @@ const WIRE_SCHEMAS: &[&str] = &[
"doc_summary.v1",
"chunk_inspection.v1",
"doctor.v1",
"config_migration.v1",
"ingest_report.v1",
"ingest_progress.v1",
"reset_report.v1",
@@ -96,6 +117,9 @@ const WIRE_SCHEMAS: &[&str] = &[
"error.v1",
"bulk_search_item.v1",
"bulk_search_response.v1",
// v0.20.x r2 Enhancement 3: OCR statistics + failures introspection.
"ocr_stats.v1",
"ocr_failures.v1",
];
/// Build a [`SchemaV1`] introspection report for the given config.
@@ -130,10 +154,10 @@ fn capabilities_snapshot() -> Capabilities {
rag_multi_turn: true,
search_cache: true,
incremental_ingest: true,
streaming_ask: false,
streaming_ask: true,
http_daemon: false,
mcp_server: true,
single_file_ingest: false,
single_file_ingest: true,
bulk_search: true,
}
}
@@ -148,12 +172,8 @@ fn open_store_for_stats(cfg: &Config) -> anyhow::Result<kebab_store_sqlite::Sqli
kebab_store_sqlite::SqliteStore::open_existing(&db_path)
}
fn collect_stats(
cfg: &Config,
store: &kebab_store_sqlite::SqliteStore,
) -> anyhow::Result<Stats> {
let counts = store
.count_summary_with_threshold(cfg.search.stale_threshold_days as u64)?;
fn collect_stats(cfg: &Config, store: &kebab_store_sqlite::SqliteStore) -> anyhow::Result<Stats> {
let counts = store.count_summary_with_threshold(u64::from(cfg.search.stale_threshold_days))?;
let data_dir = kebab_config::expand_path(&cfg.storage.data_dir, "");
let index_bytes = kebab_store_sqlite::stats_ext::index_bytes(&data_dir)
.map_err(|e| anyhow::anyhow!("index_bytes: {e}"))?;
@@ -171,16 +191,23 @@ fn collect_stats(
// p10-1A-2 follow-up: dogfooding (2026-05-20) revealed this was a
// placeholder — mirror of code_lang_breakdown for the repo field.
repo_breakdown: store.repo_breakdown()?,
// v0.17.0 PR-C: chunk-level companion (closes HOTFIXES
// 2026-05-22 "code_lang_breakdown chunk granularity" LOW).
code_lang_chunk_breakdown: store.code_lang_chunk_breakdown()?,
})
}
fn collect_models(cfg: &Config, store: &kebab_store_sqlite::SqliteStore) -> Models {
let active_parsers = store.fetch_distinct_parser_versions().unwrap_or_default();
let active_chunkers = store.fetch_distinct_chunker_versions().unwrap_or_default();
Models {
// markdown parser only — pdf-page-v1 (P7) / image extractors (P6)
// maintain their own versions; surface those when SchemaV1.models
// becomes a multi-medium map (P+).
parser_version: kebab_parse_md::PARSER_VERSION.to_string(),
chunker_version: cfg.chunking.chunker_version.clone(),
active_parsers,
active_chunkers,
// EmbeddingModelCfg uses `.model` (not `.id`) — adapt from plan.
embedding_version: cfg.models.embedding.model.clone(),
prompt_template_version: cfg.rag.prompt_template_version.clone(),
@@ -210,6 +237,11 @@ mod tests_stats_ext {
v.get("repo_breakdown").is_some(),
"Stats JSON must include repo_breakdown: {v}"
);
// v0.17.0 PR-C: chunk-level companion field.
assert!(
v.get("code_lang_chunk_breakdown").is_some(),
"Stats JSON must include code_lang_chunk_breakdown (v0.17.0 PR-C): {v}"
);
// Empty BTreeMap serializes as `{}` — confirm it's an object, not null.
assert!(
v["code_lang_breakdown"].is_object(),
@@ -219,6 +251,10 @@ mod tests_stats_ext {
v["repo_breakdown"].is_object(),
"repo_breakdown must be an object: {v}"
);
assert!(
v["code_lang_chunk_breakdown"].is_object(),
"code_lang_chunk_breakdown must be an object: {v}"
);
}
#[test]
@@ -244,3 +280,27 @@ mod tests_stats_ext {
assert_eq!(s.stats.stale_doc_count, 0);
}
}
#[cfg(test)]
mod tests_capabilities {
use super::*;
#[test]
fn capabilities_streaming_ask_matches_cli_surface() {
// Bug #9: kebab ask --stream 가 answer_event.v1 ndjson 191 event 정상 emit →
// capabilities.streaming_ask 가 true 여야 함.
let caps = capabilities_snapshot();
assert!(caps.streaming_ask, "streaming_ask must be true (Bug #9)");
}
#[test]
fn capabilities_single_file_ingest_matches_cli_surface() {
// Bug #9: kebab ingest-file <path> + kebab ingest-stdin --title <T> 양쪽 모두
// ingest_report.v1 정상 emit → capabilities.single_file_ingest 가 true 여야 함.
let caps = capabilities_snapshot();
assert!(
caps.single_file_ingest,
"single_file_ingest must be true (Bug #9)"
);
}
}

View File

@@ -10,11 +10,7 @@ use kebab_core::SearchHit;
///
/// p9-fb-32: mirrored in `kebab_rag::pipeline::compute_stale` (dep-boundary
/// rule prevents `kebab-rag → kebab-app`). Update both together.
pub fn compute_stale(
indexed_at: OffsetDateTime,
now: OffsetDateTime,
threshold_days: u32,
) -> bool {
pub fn compute_stale(indexed_at: OffsetDateTime, now: OffsetDateTime, threshold_days: u32) -> bool {
if threshold_days == 0 {
return false;
}
@@ -23,11 +19,7 @@ pub fn compute_stale(
}
/// Sets `stale` on each hit in place using `compute_stale`.
pub fn mark_stale_in_place(
hits: &mut [SearchHit],
now: OffsetDateTime,
threshold_days: u32,
) {
pub fn mark_stale_in_place(hits: &mut [SearchHit], now: OffsetDateTime, threshold_days: u32) {
for h in hits {
h.stale = compute_stale(h.indexed_at, now, threshold_days);
}

View File

@@ -33,6 +33,7 @@ fn ask_lexical_smoke() {
history: Vec::new(),
conversation_id: None,
turn_index: None,
multi_hop: false,
};
// The fixture workspace contains "ownership" content; the model's
// citation behavior depends on its training, so we don't assert on

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,60 @@
use std::sync::Mutex;
use anyhow::Result;
use kebab_core::{Lang, OcrText};
use kebab_parse_image::OcrEngine;
pub struct MockOcrEngine {
expected_texts: Vec<String>,
call_index: Mutex<usize>,
fail: bool,
}
impl MockOcrEngine {
/// Single text (backward-compat ctor for pdf_ocr_apply.rs 10 sites).
pub fn single(text: impl Into<String>, fail: bool) -> Self {
Self {
expected_texts: vec![text.into()],
call_index: Mutex::new(0),
fail,
}
}
/// Per-page texts (cursor advances per recognize call).
pub fn per_page(texts: Vec<String>, fail: bool) -> Self {
Self {
expected_texts: texts,
call_index: Mutex::new(0),
fail,
}
}
}
impl OcrEngine for MockOcrEngine {
fn engine_name(&self) -> &'static str {
"mock-ocr"
}
fn engine_version(&self) -> String {
"mock-v1".to_string()
}
fn recognize(&self, _img: &[u8], _hint: Option<&Lang>) -> Result<OcrText> {
if self.fail {
anyhow::bail!("mock failure");
}
let mut idx = self.call_index.lock().unwrap();
let text = self
.expected_texts
.get(*idx)
.cloned()
.unwrap_or_else(|| self.expected_texts.last().cloned().unwrap_or_default());
*idx += 1;
Ok(OcrText {
joined: text,
regions: vec![],
engine: "mock-ocr".to_string(),
engine_version: "mock-v1".to_string(),
})
}
}

View File

@@ -93,8 +93,7 @@ impl TestEnv {
/// directly. Caller can invoke this multiple times to simulate
/// re-opening the binary after a corpus revision bump.
pub fn app(&self) -> kebab_app::App {
kebab_app::App::open_with_config(self.config.clone())
.expect("App::open_with_config")
kebab_app::App::open_with_config(self.config.clone()).expect("App::open_with_config")
}
}
@@ -169,3 +168,5 @@ fn copy_dir_recursive(src: &Path, dest: &Path) {
}
}
}
pub mod mock_ocr;

View File

@@ -0,0 +1,85 @@
use std::fs;
#[test]
fn migrate_writes_backup_and_atomic_with_dry_run_noop() {
let dir = tempfile::tempdir().unwrap();
let cfg = dir.path().join("config.toml");
fs::write(
&cfg,
"schema_version = 1\n\n[workspace]\nroot = \"/n\"\ninclude = [\"*.md\"]\n",
)
.unwrap();
// dry-run: 파일·백업 미변경.
let report = kebab_app::config_migrate_with_config_path(Some(&cfg), true).unwrap();
assert!(report.changed);
assert!(report.dry_run);
assert!(report.backup_path.is_none());
assert!(!dir.path().join("config.toml.bak").exists());
assert!(
fs::read_to_string(&cfg).unwrap().contains("include"),
"dry-run modified file"
);
// 실제 적용: 백업 생성 + 파일 갱신.
let report = kebab_app::config_migrate_with_config_path(Some(&cfg), false).unwrap();
assert!(report.changed);
assert!(!report.dry_run);
assert!(report.backup_path.is_some());
assert!(dir.path().join("config.toml.bak").exists());
let new = fs::read_to_string(&cfg).unwrap();
assert!(!new.contains("include"));
assert!(new.contains("[ingest.code]"));
// 멱등: 재실행 changed=false.
let report = kebab_app::config_migrate_with_config_path(Some(&cfg), false).unwrap();
assert!(!report.changed);
}
#[test]
fn migrate_missing_file_errors() {
let dir = tempfile::tempdir().unwrap();
let cfg = dir.path().join("nope.toml");
assert!(kebab_app::config_migrate_with_config_path(Some(&cfg), false).is_err());
}
#[test]
fn annotated_default_serialization_contains_section_comments() {
let doc = kebab_config::migrate::annotated_default_document();
let text = doc.to_string();
assert!(
text.contains("code ingest skip 정책"),
"section comment missing:\n{text}"
);
assert!(text.contains("[ingest.code]"));
}
#[test]
fn doctor_flags_outdated_config() {
let dir = tempfile::tempdir().unwrap();
let cfg = dir.path().join("config.toml");
fs::write(
&cfg,
"schema_version = 1\n\n[workspace]\nroot = \"/n\"\ninclude=[\"*.md\"]\n",
)
.unwrap();
let report = kebab_app::doctor_with_config_path(Some(&cfg)).unwrap();
let check = report
.checks
.iter()
.find(|c| c.name == "config_migration")
.unwrap();
assert!(!check.ok, "outdated config should fail check");
assert!(check.hint.as_deref().unwrap().contains("config migrate"));
assert!(!report.ok, "overall doctor should be false");
// migrate 후엔 통과.
kebab_app::config_migrate_with_config_path(Some(&cfg), false).unwrap();
let report = kebab_app::doctor_with_config_path(Some(&cfg)).unwrap();
let check = report
.checks
.iter()
.find(|c| c.name == "config_migration")
.unwrap();
assert!(check.ok, "after migrate should pass");
}

View File

@@ -12,7 +12,11 @@ fn open(env: &common::TestEnv) -> App {
#[test]
fn fetch_chunk_returns_target_only_when_no_context() {
let env = common::TestEnv::new();
common::ingest_md(&env, "a.md", "# Title\n\nFirst paragraph.\n\n## Section\n\nSecond.\n");
common::ingest_md(
&env,
"a.md",
"# Title\n\nFirst paragraph.\n\n## Section\n\nSecond.\n",
);
let app = open(&env);
// Find a chunk via search to obtain its id.
@@ -38,12 +42,17 @@ fn fetch_chunk_returns_target_only_when_no_context() {
#[test]
fn fetch_chunk_with_context_returns_neighbors() {
let env = common::TestEnv::new();
let body = "# H1\n\nA1\n\n# H2\n\nA2\n\n# H3\n\nA3\n\n# H4\n\nA4\n\n# H5\n\nA5\n";
// v0.17.0 trigram tokenizer: terms must be ≥3 Unicode chars to
// match. The earlier fixture used 2-char tokens like `A1`/`A3` for
// section bodies — those zero-hit under trigram. Use 5-char unique
// words per section so the query can pin one chunk deterministically.
let body =
"# H1\n\napples\n\n# H2\n\nbanana\n\n# H3\n\ncherry\n\n# H4\n\ndurian\n\n# H5\n\nelder\n";
common::ingest_md(&env, "multi.md", body);
let app = env.app();
let q = kebab_core::SearchQuery {
text: "A3".to_string(),
text: "cherry".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 1,
filters: kebab_core::SearchFilters::default(),
@@ -106,7 +115,10 @@ fn fetch_doc_returns_serialized_markdown() {
.unwrap();
assert_eq!(result.kind, FetchKind::Doc);
let text = result.text.expect("doc text");
assert!(text.contains("Heading One"), "doc text contains heading: {text:?}");
assert!(
text.contains("Heading One"),
"doc text contains heading: {text:?}"
);
assert!(text.contains("First paragraph"), "doc text contains body");
assert!(!result.truncated);
}
@@ -151,7 +163,11 @@ fn fetch_doc_with_max_tokens_truncates() {
.unwrap();
assert!(result.truncated);
let text = result.text.expect("doc text");
assert!(text.chars().count() <= 100, "trimmed text len {}", text.chars().count());
assert!(
text.chars().count() <= 100,
"trimmed text len {}",
text.chars().count()
);
}
#[test]
@@ -288,8 +304,7 @@ fn fetch_span_line_start_beyond_total_returns_empty_text() {
fn fetch_chunk_context_at_first_chunk_clamps_lower_bound() {
let env = common::TestEnv::new();
// Multi-chunk markdown so context ±N has neighbors.
let body =
"# H1\n\nFirst chunk text body.\n\n# H2\n\nSecond chunk.\n\n# H3\n\nThird chunk.\n";
let body = "# H1\n\nFirst chunk text body.\n\n# H2\n\nSecond chunk.\n\n# H3\n\nThird chunk.\n";
common::ingest_md(&env, "boundary.md", body);
let app = env.app();
let q = kebab_core::SearchQuery {

View File

@@ -16,8 +16,8 @@
mod common;
use common::TestEnv;
use kebab_app::ingest_with_config_opts;
use kebab_app::IngestOpts;
use kebab_app::ingest_with_config_opts;
use kebab_core::{DocFilter, DocumentStore, SearchMode, SearchQuery, SourceScope};
/// Helper: open the store via `TestEnv` and run `list_documents`.
@@ -125,17 +125,10 @@ fn include_scope_narrowing_does_not_purge() {
include: vec!["**/*.rs".to_string()],
exclude: env.config.workspace.exclude.clone(),
};
let first = ingest_with_config_opts(
env.config.clone(),
wide_scope,
false,
IngestOpts::default(),
)
.expect("first ingest (wide) must succeed");
assert!(
first.new >= 2,
"expected at least 2 new docs: {first:?}"
);
let first =
ingest_with_config_opts(env.config.clone(), wide_scope, false, IngestOpts::default())
.expect("first ingest (wide) must succeed");
assert!(first.new >= 2, "expected at least 2 new docs: {first:?}");
assert_eq!(
first.purged_deleted_files, 0,
"no purges on first ingest: {first:?}"

View File

@@ -24,8 +24,7 @@ use wiremock::{Mock, MockServer, ResponseTemplate};
/// inspectable in stored DB rows.
fn write_red_png(root: &Path, name: &str) -> std::path::PathBuf {
use image::{ImageBuffer, Rgb};
let img: ImageBuffer<Rgb<u8>, _> =
ImageBuffer::from_fn(100, 50, |_, _| Rgb([255, 0, 0]));
let img: ImageBuffer<Rgb<u8>, _> = ImageBuffer::from_fn(100, 50, |_, _| Rgb([255, 0, 0]));
let path = root.join(name);
img.save(&path).expect("write PNG fixture");
path
@@ -80,7 +79,12 @@ async fn ingest_image_with_ocr_produces_chunk_containing_ocr_text() {
// Counters: scanned should include the PNG; new ≥ 1 (markdown
// fixtures from the workspace tree may also count).
assert!(report.scanned >= 1, "scanned={}, items={:?}", report.scanned, report.items);
assert!(
report.scanned >= 1,
"scanned={}, items={:?}",
report.scanned,
report.items
);
assert_eq!(report.errors, 0, "no errors on lenient OCR path");
// Locate the image doc in the report items.
@@ -94,7 +98,11 @@ async fn ingest_image_with_ocr_produces_chunk_containing_ocr_text() {
kebab_core::IngestItemKind::New,
"image asset must be classified New on first ingest"
);
assert_eq!(img_item.chunk_count, Some(1), "image emits exactly one chunk");
assert_eq!(
img_item.chunk_count,
Some(1),
"image emits exactly one chunk"
);
// Inspect the stored chunk text via kb-app's inspect_chunk facade.
let doc_id = img_item.doc_id.clone().expect("image doc id");
@@ -117,10 +125,12 @@ async fn ingest_image_with_ocr_produces_chunk_containing_ocr_text() {
// Sanity: the doc was actually persisted into SQLite (kb-app's
// list_docs facade reads the same store the chunker writes to).
let summaries = kebab_app::list_docs_with_config(cfg, kebab_core::DocFilter::default())
.expect("list_docs");
let summaries =
kebab_app::list_docs_with_config(cfg, kebab_core::DocFilter::default()).expect("list_docs");
assert!(
summaries.iter().any(|s| s.doc_path.0.ends_with("diagram.png")),
summaries
.iter()
.any(|s| s.doc_path.0.ends_with("diagram.png")),
"image doc must appear in list_docs"
);
@@ -171,8 +181,7 @@ async fn ingest_image_with_ocr_and_caption_populates_both_fields() {
.iter()
.find(|i| i.doc_path.0.ends_with("diagram.png"))
.unwrap();
let doc = kebab_app::inspect_doc_with_config(cfg, img_item.doc_id.as_ref().unwrap())
.unwrap();
let doc = kebab_app::inspect_doc_with_config(cfg, img_item.doc_id.as_ref().unwrap()).unwrap();
let block = match &doc.blocks[0] {
kebab_core::Block::ImageRef(b) => b,
_ => unreachable!(),
@@ -267,8 +276,7 @@ async fn image_indexed_with_filename_when_ocr_and_caption_disabled() {
let cfg_clone = cfg.clone();
let scope = env.scope();
let report = spawn_blocking(move || {
kebab_app::ingest_with_config(cfg_clone, scope, false)
.expect("ingest with no OCR/caption")
kebab_app::ingest_with_config(cfg_clone, scope, false).expect("ingest with no OCR/caption")
})
.await
.expect("task");
@@ -282,8 +290,7 @@ async fn image_indexed_with_filename_when_ocr_and_caption_disabled() {
.find(|i| i.doc_path.0.ends_with("raw.png"))
.unwrap();
assert_eq!(img_item.chunk_count, Some(1), "image emits one chunk");
let doc = kebab_app::inspect_doc_with_config(cfg, img_item.doc_id.as_ref().unwrap())
.unwrap();
let doc = kebab_app::inspect_doc_with_config(cfg, img_item.doc_id.as_ref().unwrap()).unwrap();
let block = match &doc.blocks[0] {
kebab_core::Block::ImageRef(b) => b,
_ => unreachable!(),
@@ -392,16 +399,12 @@ async fn re_ingest_image_produces_unchanged_with_same_doc_id() {
let scope1 = scope.clone();
let scope2 = scope.clone();
let r1 = spawn_blocking(move || {
kebab_app::ingest_with_config(cfg1, scope1, false).unwrap()
})
.await
.unwrap();
let r2 = spawn_blocking(move || {
kebab_app::ingest_with_config(cfg2, scope2, false).unwrap()
})
.await
.unwrap();
let r1 = spawn_blocking(move || kebab_app::ingest_with_config(cfg1, scope1, false).unwrap())
.await
.unwrap();
let r2 = spawn_blocking(move || kebab_app::ingest_with_config(cfg2, scope2, false).unwrap())
.await
.unwrap();
let id1 = r1
.items

View File

@@ -21,11 +21,16 @@ fn second_ingest_of_unchanged_corpus_marks_all_unchanged() {
// First ingest — populates the DB. Use the legacy entry so the
// assertions cover the "previously ingested" set without needing
// IngestOpts::default() to behave identically.
let first =
ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let first = ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
assert_eq!(first.errors, 0, "first ingest must not error: {first:?}");
assert!(first.new >= 1, "first ingest must create new docs: {first:?}");
assert_eq!(first.unchanged, 0, "first ingest cannot have unchanged: {first:?}");
assert!(
first.new >= 1,
"first ingest must create new docs: {first:?}"
);
assert_eq!(
first.unchanged, 0,
"first ingest cannot have unchanged: {first:?}"
);
let scanned = first.scanned;
@@ -38,9 +43,15 @@ fn second_ingest_of_unchanged_corpus_marks_all_unchanged() {
IngestOpts::default(),
)
.unwrap();
assert_eq!(second.scanned, scanned, "second scanned matches first: {second:?}");
assert_eq!(
second.scanned, scanned,
"second scanned matches first: {second:?}"
);
assert_eq!(second.new, 0, "no new docs on re-ingest: {second:?}");
assert_eq!(second.updated, 0, "nothing should be marked updated: {second:?}");
assert_eq!(
second.updated, 0,
"nothing should be marked updated: {second:?}"
);
assert_eq!(
second.unchanged, scanned,
"every doc must be Unchanged: {second:?}"
@@ -52,10 +63,12 @@ fn second_ingest_of_unchanged_corpus_marks_all_unchanged() {
fn force_reingest_bypasses_skip() {
let env = TestEnv::lexical_only();
let first =
ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let first = ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
assert_eq!(first.errors, 0, "first ingest must not error: {first:?}");
assert!(first.new >= 1, "first ingest must create new docs: {first:?}");
assert!(
first.new >= 1,
"first ingest must create new docs: {first:?}"
);
let scanned = first.scanned;
let second = ingest_with_config_opts(

View File

@@ -107,13 +107,9 @@ fn cancel_none_is_uncancellable_default() {
// ingest_with_config_progress (no cancel) runs to completion.
let env = TestEnv::lexical_only();
let (tx, rx) = mpsc::channel::<IngestEvent>();
let report = kebab_app::ingest_with_config_progress(
env.config.clone(),
env.scope(),
true,
Some(tx),
)
.unwrap();
let report =
kebab_app::ingest_with_config_progress(env.config.clone(), env.scope(), true, Some(tx))
.unwrap();
assert_eq!(report.scanned, 3);
assert_eq!(report.new, 3);

View File

@@ -33,7 +33,7 @@ fn ingest_file_copies_external_md_and_reports_new() {
assert!(ext_dir.is_dir());
let entries: Vec<_> = fs::read_dir(&ext_dir)
.unwrap()
.filter_map(|e| e.ok())
.filter_map(std::result::Result::ok)
.collect();
assert_eq!(entries.len(), 1, "exactly one file in _external/");
let name = entries[0].file_name().to_string_lossy().into_owned();
@@ -107,5 +107,8 @@ fn ingest_file_errors_on_unsupported_extension() {
let err = kebab_app::ingest_file_with_config(cfg, &docx).unwrap_err();
assert!(err.to_string().contains("unsupported extension"), "{err}");
assert!(err.to_string().contains(".docx") || err.to_string().contains("docx"), "{err}");
assert!(
err.to_string().contains(".docx") || err.to_string().contains("docx"),
"{err}"
);
}

View File

@@ -8,8 +8,7 @@ use common::TestEnv;
#[test]
fn ingest_then_list_inspects_round_trip() {
let env = TestEnv::lexical_only();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
// The fixture has 3 markdown files; first ingest should label them
// all as New.
@@ -27,17 +26,14 @@ fn ingest_then_list_inspects_round_trip() {
}
// list_docs returns the 3 docs.
let docs = kebab_app::list_docs_with_config(
env.config.clone(),
kebab_core::DocFilter::default(),
)
.unwrap();
let docs =
kebab_app::list_docs_with_config(env.config.clone(), kebab_core::DocFilter::default())
.unwrap();
assert_eq!(docs.len(), 3, "docs: {docs:?}");
// inspect_doc round-trips one of them.
let any_doc_id = docs[0].doc_id.clone();
let canonical = kebab_app::inspect_doc_with_config(env.config.clone(), &any_doc_id)
.unwrap();
let canonical = kebab_app::inspect_doc_with_config(env.config.clone(), &any_doc_id).unwrap();
assert_eq!(canonical.doc_id, any_doc_id);
assert!(!canonical.blocks.is_empty(), "blocks empty");
}
@@ -46,12 +42,10 @@ fn ingest_then_list_inspects_round_trip() {
fn ingest_idempotent_on_second_run() {
let env = TestEnv::lexical_only();
let r1 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let r1 = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
assert_eq!(r1.new, 3);
let r2 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let r2 = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
// Same files re-ingested — p9-fb-23 task 7 introduced the early-skip
// path: when checksum + parser/chunker/embedding versions all match,
// the second run reports `Unchanged` rather than `Updated`. Pre-p9-fb-23
@@ -63,19 +57,16 @@ fn ingest_idempotent_on_second_run() {
assert_eq!(r2.unchanged, 3, "second run unchanged: {r2:?}");
// list_docs still has 3 docs (no duplicates).
let docs = kebab_app::list_docs_with_config(
env.config.clone(),
kebab_core::DocFilter::default(),
)
.unwrap();
let docs =
kebab_app::list_docs_with_config(env.config.clone(), kebab_core::DocFilter::default())
.unwrap();
assert_eq!(docs.len(), 3);
}
#[test]
fn ingest_summary_only_drops_items() {
let env = TestEnv::lexical_only();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert_eq!(report.scanned, 3);
assert!(report.items.is_none(), "summary-only should null items");
}
@@ -87,12 +78,10 @@ fn ingest_records_ingest_runs_row_with_aggregate_counts() {
// of every run. `summary_only=true` writes `items_json=NULL`; the
// counts MUST still be present.
let env = TestEnv::lexical_only();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert_eq!(report.scanned, 3);
let db_path = std::path::PathBuf::from(&env.config.storage.data_dir)
.join("kebab.sqlite");
let db_path = std::path::PathBuf::from(&env.config.storage.data_dir).join("kebab.sqlite");
let conn = rusqlite::Connection::open(&db_path).expect("open kebab.sqlite");
let (scanned, new_c, updated, skipped, errors, items_json): (
i64,
@@ -141,25 +130,18 @@ fn ingest_provider_none_skips_lance() {
// tree shape (no `<data_dir>/lancedb` directory, or no `*.lance`
// tables under it).
let env = TestEnv::lexical_only();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
assert_eq!(report.errors, 0, "lexical-only run must not error");
assert_eq!(report.new, 3);
let lance_dir = std::path::PathBuf::from(&env.config.storage.data_dir)
.join("lancedb");
let lance_dir = std::path::PathBuf::from(&env.config.storage.data_dir).join("lancedb");
if lance_dir.exists() {
// If the dir was created (e.g., by an earlier consumer touching
// the path), it MUST contain no `.lance` tables.
let mut had_lance_table = false;
for entry in std::fs::read_dir(&lance_dir).expect("read lance_dir") {
let entry = entry.unwrap();
if entry
.path()
.extension()
.and_then(|s| s.to_str())
== Some("lance")
{
if entry.path().extension().and_then(|s| s.to_str()) == Some("lance") {
had_lance_table = true;
break;
}
@@ -189,8 +171,7 @@ fn list_docs_filters_by_tags_any() {
tags_any: vec!["rust".to_string()],
..Default::default()
};
let rust_docs =
kebab_app::list_docs_with_config(env.config.clone(), rust_filter).unwrap();
let rust_docs = kebab_app::list_docs_with_config(env.config.clone(), rust_filter).unwrap();
// intro.md and notes/cargo.md both tag "rust".
assert_eq!(rust_docs.len(), 2, "expected 2 rust docs: {rust_docs:?}");
}
@@ -198,8 +179,9 @@ fn list_docs_filters_by_tags_any() {
#[test]
fn inspect_doc_not_found_returns_actionable_error() {
let env = TestEnv::lexical_only();
let bogus =
kebab_core::DocumentId("0000000000000000000000000000000000000000000000000000000000000000".to_string());
let bogus = kebab_core::DocumentId(
"0000000000000000000000000000000000000000000000000000000000000000".to_string(),
);
let err = kebab_app::inspect_doc_with_config(env.config.clone(), &bogus).unwrap_err();
let msg = format!("{err:#}");
assert!(
@@ -218,8 +200,7 @@ fn inspect_chunk_not_found_returns_actionable_error() {
let bogus = kebab_core::ChunkId(
"0000000000000000000000000000000000000000000000000000000000000000".to_string(),
);
let err = kebab_app::inspect_chunk_with_config(env.config.clone(), &bogus)
.unwrap_err();
let err = kebab_app::inspect_chunk_with_config(env.config.clone(), &bogus).unwrap_err();
let msg = format!("{err:#}");
assert!(msg.contains("not found"), "got: {msg}");
}
@@ -251,22 +232,18 @@ fn ingest_with_config_opts_default_matches_legacy_behaviour() {
#[test]
fn ingest_stamps_chunker_version_on_document() {
let env = TestEnv::lexical_only();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
assert!(report.new >= 1, "expected at least one new doc: {report:?}");
assert_eq!(report.errors, 0, "no errors expected: {report:?}");
let docs = kebab_app::list_docs_with_config(
env.config.clone(),
kebab_core::DocFilter::default(),
)
.unwrap();
let docs =
kebab_app::list_docs_with_config(env.config.clone(), kebab_core::DocFilter::default())
.unwrap();
assert!(!docs.is_empty(), "no docs after ingest");
for doc_entry in &docs {
let canonical =
kebab_app::inspect_doc_with_config(env.config.clone(), &doc_entry.doc_id)
.unwrap();
kebab_app::inspect_doc_with_config(env.config.clone(), &doc_entry.doc_id).unwrap();
assert!(
canonical.last_chunker_version.is_some(),
"last_chunker_version must be stamped for doc {}: got {:?}",

View File

@@ -0,0 +1,171 @@
// crates/kebab-app/tests/ingest_log_smoke.rs
//
// Integration tests for ingest_log feature (v0.20.x). Spec §5 AC-9 + AC-6.
use std::path::PathBuf;
use kebab_app::{IngestOpts, ingest_with_config_opts};
use kebab_config::{Config, LoggingCfg};
use kebab_core::SourceScope;
use serde_json::Value;
use tempfile::TempDir;
fn minimal_config(workspace: &std::path::Path, log_dir: &std::path::Path) -> Config {
let data_dir = workspace.parent().unwrap().join("data");
std::fs::create_dir_all(&data_dir).unwrap();
let model_dir = workspace.parent().unwrap().join("models");
std::fs::create_dir_all(&model_dir).unwrap();
let mut cfg = Config::defaults();
cfg.workspace.root = workspace.to_string_lossy().into_owned();
cfg.workspace.exclude.clear();
cfg.storage.data_dir = data_dir.to_string_lossy().into_owned();
cfg.storage.model_dir = model_dir.to_string_lossy().into_owned();
cfg.models.embedding.provider = "none".to_string();
cfg.models.embedding.dimensions = 0;
cfg.chunking.target_tokens = 80;
cfg.chunking.overlap_tokens = 20;
cfg.logging = LoggingCfg {
ingest_log_enabled: true,
ingest_log_dir: log_dir.to_path_buf(),
..Default::default()
};
cfg
}
/// AC-9: ingest → log file exists + each line valid JSON + last line kind=summary + scanned>0.
#[test]
fn ingest_log_smoke() {
let tmp = TempDir::new().unwrap();
let workspace = tmp.path().join("kb");
std::fs::create_dir_all(&workspace).unwrap();
let log_dir = tmp.path().join("logs");
// 1. Minimal corpus: 1 markdown + 1 scanned PDF (OCR disabled — no Ollama needed).
std::fs::write(
workspace.join("hello.md"),
"# Hello\n\nThis is a smoke test.\n",
)
.unwrap();
let pdf_src = PathBuf::from("../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf");
if pdf_src.exists() {
std::fs::copy(&pdf_src, workspace.join("scanned.pdf")).unwrap();
}
// 2. Config with logging enabled.
let cfg = minimal_config(&workspace, &log_dir);
let scope = SourceScope {
root: workspace.clone(),
exclude: vec![],
..Default::default()
};
// 3. Run ingest.
ingest_with_config_opts(cfg, scope, false, IngestOpts::default())
.expect("ingest should succeed");
// 4. Assert log file exists in log_dir.
let log_files: Vec<_> = std::fs::read_dir(&log_dir)
.unwrap()
.filter_map(Result::ok)
.filter(|e| {
e.file_name().to_string_lossy().starts_with("ingest-")
&& e.file_name().to_string_lossy().ends_with(".ndjson")
})
.collect();
assert_eq!(
log_files.len(),
1,
"expected exactly 1 ingest-*.ndjson file, found: {log_files:?}"
);
// 5. Parse each line as JSON — assert kind field present and valid.
let body = std::fs::read_to_string(log_files[0].path()).unwrap();
let lines: Vec<&str> = body.lines().collect();
assert!(!lines.is_empty(), "log file should not be empty");
let valid_kinds = ["ocr", "parse_error", "skip", "error", "summary"];
for line in &lines {
let v: Value = serde_json::from_str(line)
.unwrap_or_else(|e| panic!("line is not valid JSON: {e}\nline: {line}"));
let kind = v
.get("kind")
.and_then(|k| k.as_str())
.unwrap_or_else(|| panic!("line missing 'kind' field: {line}"));
assert!(
valid_kinds.contains(&kind),
"unexpected kind '{kind}' in line: {line}"
);
}
// 6. Last line must be kind=summary with scanned > 0.
let last = lines.last().unwrap();
let last_v: Value = serde_json::from_str(last).unwrap();
assert_eq!(
last_v.get("kind").and_then(|k| k.as_str()),
Some("summary"),
"last line must be kind=summary, got: {last}"
);
let scanned = last_v.get("scanned").and_then(Value::as_u64).unwrap_or(0);
assert!(scanned > 0, "summary.scanned should be > 0, got: {last}");
}
/// AC-6: ingest_log_enabled=false → no log file created.
#[test]
fn ingest_log_disabled_emits_no_file() {
let tmp = TempDir::new().unwrap();
let workspace = tmp.path().join("kb");
std::fs::create_dir_all(&workspace).unwrap();
let log_dir = tmp.path().join("logs");
std::fs::write(
workspace.join("hello.md"),
"# Hello\n\nDisabled log test.\n",
)
.unwrap();
let data_dir = tmp.path().join("data");
std::fs::create_dir_all(&data_dir).unwrap();
let model_dir = tmp.path().join("models");
std::fs::create_dir_all(&model_dir).unwrap();
let mut cfg = Config::defaults();
cfg.workspace.root = workspace.to_string_lossy().into_owned();
cfg.workspace.exclude.clear();
cfg.storage.data_dir = data_dir.to_string_lossy().into_owned();
cfg.storage.model_dir = model_dir.to_string_lossy().into_owned();
cfg.models.embedding.provider = "none".to_string();
cfg.models.embedding.dimensions = 0;
cfg.logging = LoggingCfg {
ingest_log_enabled: false,
ingest_log_dir: log_dir.clone(),
..Default::default()
};
let scope = SourceScope {
root: workspace.clone(),
exclude: vec![],
..Default::default()
};
ingest_with_config_opts(cfg, scope, false, IngestOpts::default())
.expect("ingest should succeed");
// log_dir should either not exist or contain 0 ingest-*.ndjson files.
let log_file_count = if log_dir.exists() {
std::fs::read_dir(&log_dir)
.unwrap()
.filter_map(Result::ok)
.filter(|e| {
e.file_name().to_string_lossy().starts_with("ingest-")
&& e.file_name().to_string_lossy().ends_with(".ndjson")
})
.count()
} else {
0
};
assert_eq!(
log_file_count, 0,
"no ingest-*.ndjson file should be created when disabled"
);
}

View File

@@ -0,0 +1,117 @@
//! Integration smoke tests for the PDF OCR pipeline (§ Acceptance §9 #1 + #2).
//!
//! Tests 1 and 2 require a live Ollama endpoint — `#[ignore]` by default.
//! Manual invoke:
//! KEBAB_PDF_OCR_ENDPOINT=http://192.168.0.47:11434 \
//! cargo test -p kebab-app --test ingest_pdf_ocr_smoke --ignored -j 4
//!
//! Test 3 (cancel) uses a dummy endpoint + pre-set cancel — runs by default
//! to verify the cancel wiring doesn't panic/deadlock.
mod common;
use std::path::PathBuf;
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use common::TestEnv;
fn ollama_endpoint() -> String {
std::env::var("KEBAB_PDF_OCR_ENDPOINT").unwrap_or_else(|_| "http://localhost:11434".to_string())
}
fn make_ocr_env_real() -> TestEnv {
let mut env = TestEnv::lexical_only();
env.config.pdf.ocr.enabled = true;
env.config.pdf.ocr.endpoint = Some(ollama_endpoint());
env.config.models.embedding.provider = "none".to_string();
let src = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.unwrap()
.join("kebab-parse-pdf/tests/fixtures/scanned_page1.pdf");
let dest = env.workspace_root.join("scanned_page1.pdf");
std::fs::copy(&src, &dest).expect("copy scanned_page1.pdf to workspace");
env
}
/// § Acceptance §9 #1 — real Ollama OCR + IngestItem.pdf_ocr_pages = Some(1).
#[test]
#[ignore = "real Ollama qwen2.5vl:3b dependency"]
fn ingest_with_mock_ocr_yields_pdf_ocr_summary() {
let env = make_ocr_env_real();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).expect("ingest");
assert!(report.new >= 1, "at least one PDF ingested: {report:?}");
let items = report.items.unwrap_or_default();
let pdf_item = items.iter().find(|i| i.doc_path.0.ends_with(".pdf"));
assert!(
pdf_item.is_some(),
"PDF item must appear in ingest report items: {items:?}"
);
let pdf_item = pdf_item.unwrap();
assert!(
pdf_item.pdf_ocr_pages.is_some(),
"pdf_ocr_pages must be set for scanned PDF: {pdf_item:?}"
);
assert_eq!(
pdf_item.pdf_ocr_pages.unwrap(),
1,
"scanned_page1.pdf has exactly 1 page"
);
}
/// § Acceptance §9 #2 — OCR text indexed and retrievable via lexical search.
#[test]
#[ignore = "real Ollama qwen2.5vl:3b dependency"]
fn ocr_text_indexed_and_searchable() {
let env = make_ocr_env_real();
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).expect("ingest");
// Search for a Korean morpheme expected to appear in qwen2.5vl:3b OCR
// output of the PoC ground-truth page. "다음" is a high-frequency token
// in page1.txt truth file.
let query = common::lexical_query("다음");
let hits = kebab_app::search_with_config(env.config.clone(), query).expect("search");
assert!(
!hits.is_empty(),
"OCR-indexed text must surface in lexical search results"
);
}
/// Production cancel wiring smoke — pre-set cancel exits before any OCR call.
/// Dummy endpoint (port 1 = connection-refused) means OCR HTTP calls would
/// fail, but cancel=true prevents the loop from reaching OCR at all.
/// Verifies no panic/deadlock regardless of Ok/Err outcome.
#[test]
fn ingest_with_cancel_aborts_mid_pdf() {
let mut env = TestEnv::lexical_only();
env.config.pdf.ocr.enabled = true;
env.config.pdf.ocr.endpoint = Some("http://127.0.0.1:1".to_string());
let src = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.unwrap()
.join("kebab-parse-pdf/tests/fixtures/scanned_page1.pdf");
let dest = env.workspace_root.join("scanned_page1.pdf");
std::fs::copy(&src, &dest).expect("copy scanned_page1.pdf to workspace");
let cancel = Arc::new(AtomicBool::new(true)); // pre-set — abort immediately
let result = kebab_app::ingest_with_config_cancellable(
env.config.clone(),
env.scope(),
false,
None,
Some(cancel),
);
// Both Ok (pre-cancel exit) and Err (eager OCR engine fail) are acceptable —
// key assertion is no panic/deadlock.
let _ = result;
}

View File

@@ -13,13 +13,9 @@ use kebab_core::IngestItemKind;
fn run_with_progress() -> Vec<IngestEvent> {
let env = TestEnv::lexical_only();
let (tx, rx) = mpsc::channel::<IngestEvent>();
let report = kebab_app::ingest_with_config_progress(
env.config.clone(),
env.scope(),
false,
Some(tx),
)
.unwrap();
let report =
kebab_app::ingest_with_config_progress(env.config.clone(), env.scope(), false, Some(tx))
.unwrap();
assert_eq!(report.scanned, 3);
assert_eq!(report.new, 3);
@@ -73,40 +69,74 @@ fn progress_event_sequence_matches_design_section_2_4a() {
other => panic!("expected Completed last, got {other:?}"),
}
// Middle: 3 AssetStarted/AssetFinished pairs in monotonic idx order.
let asset_events: Vec<&IngestEvent> = events[2..events.len() - 1].iter().collect();
assert_eq!(
asset_events.len(),
6,
"expected 3 (Started + Finished) pairs, got {asset_events:?}"
);
for (chunk_idx, pair) in asset_events.chunks(2).enumerate() {
let expected_idx = chunk_idx as u32 + 1;
match (pair[0], pair[1]) {
(
IngestEvent::AssetStarted {
idx: si,
total: st,
media,
..
},
IngestEvent::AssetFinished {
idx: fi,
total: ft,
result,
chunks,
},
) => {
assert_eq!(*si, expected_idx, "Started idx mismatch: {pair:?}");
assert_eq!(*fi, expected_idx, "Finished idx mismatch: {pair:?}");
assert_eq!(*st, 3, "Started total mismatch");
assert_eq!(*ft, 3, "Finished total mismatch");
assert_eq!(media, "markdown", "fixture is markdown only");
assert_eq!(*result, IngestItemKind::New, "first ingest → New");
assert!(*chunks >= 1, "chunks: {pair:?}");
// Middle (v0.24.0 ordering invariant §2.4a): per asset the stream is
// AssetStarted < AssetChunked < [ExpansionProgress*] < AssetTimings
// < AssetFinished
// Expansion is disabled in the lexical fixture, so no ExpansionProgress
// frames appear here — but AssetChunked + AssetTimings are emitted for
// every markdown asset.
let middle = &events[2..events.len() - 1];
// 3 AssetStarted events, monotonic idx 1..=3, all markdown, total = 3.
let started: Vec<u32> = middle
.iter()
.filter_map(|e| match e {
IngestEvent::AssetStarted {
idx, total, media, ..
} => {
assert_eq!(*total, 3, "Started total mismatch: {e:?}");
assert_eq!(media, "markdown", "fixture is markdown only: {e:?}");
Some(*idx)
}
other => panic!("expected Started+Finished pair, got {other:?}"),
}
_ => None,
})
.collect();
assert_eq!(started, vec![1, 2, 3], "AssetStarted idx order: {middle:?}");
// 3 AssetFinished events, monotonic idx 1..=3, each New with ≥1 chunk.
let finished: Vec<u32> = middle
.iter()
.filter_map(|e| match e {
IngestEvent::AssetFinished {
idx,
total,
result,
chunks,
} => {
assert_eq!(*total, 3, "Finished total mismatch: {e:?}");
assert_eq!(*result, IngestItemKind::New, "first ingest → New: {e:?}");
assert!(*chunks >= 1, "chunks: {e:?}");
Some(*idx)
}
_ => None,
})
.collect();
assert_eq!(finished, vec![1, 2, 3], "AssetFinished idx order: {middle:?}");
// v0.24.0 additive events: exactly one AssetChunked + one AssetTimings
// per asset, each strictly bracketed by that asset's Started / Finished.
for target in 1u32..=3 {
let started_at = middle
.iter()
.position(|e| matches!(e, IngestEvent::AssetStarted { idx, .. } if *idx == target))
.unwrap_or_else(|| panic!("missing AssetStarted for idx {target}: {middle:?}"));
let finished_at = middle
.iter()
.position(|e| matches!(e, IngestEvent::AssetFinished { idx, .. } if *idx == target))
.unwrap_or_else(|| panic!("missing AssetFinished for idx {target}: {middle:?}"));
let chunked_at = middle
.iter()
.position(|e| matches!(e, IngestEvent::AssetChunked { idx, chunks, .. } if *idx == target && *chunks >= 1))
.unwrap_or_else(|| panic!("missing AssetChunked for idx {target}: {middle:?}"));
let timings_at = middle
.iter()
.position(|e| matches!(e, IngestEvent::AssetTimings { idx, .. } if *idx == target))
.unwrap_or_else(|| panic!("missing AssetTimings for idx {target}: {middle:?}"));
assert!(
started_at < chunked_at && chunked_at < timings_at && timings_at < finished_at,
"idx {target} ordering: started={started_at} chunked={chunked_at} \
timings={timings_at} finished={finished_at}: {middle:?}"
);
}
}
@@ -116,13 +146,9 @@ fn ingest_with_config_progress_none_matches_ingest_with_config() {
// `ingest_with_config_progress(..., None)` must produce identical
// reports modulo wall-clock duration.
let env = TestEnv::lexical_only();
let r_none = kebab_app::ingest_with_config_progress(
env.config.clone(),
env.scope(),
true,
None,
)
.unwrap();
let r_none =
kebab_app::ingest_with_config_progress(env.config.clone(), env.scope(), true, None)
.unwrap();
assert_eq!(r_none.scanned, 3);
assert_eq!(r_none.new, 3);
}
@@ -134,12 +160,77 @@ fn dropped_receiver_does_not_panic_or_fail_ingest() {
let env = TestEnv::lexical_only();
let (tx, rx) = mpsc::channel::<IngestEvent>();
drop(rx);
let report = kebab_app::ingest_with_config_progress(
env.config.clone(),
env.scope(),
true,
Some(tx),
)
.unwrap();
let report =
kebab_app::ingest_with_config_progress(env.config.clone(), env.scope(), true, Some(tx))
.unwrap();
assert_eq!(report.scanned, 3);
}
/// v0.20.0 sub-item 1: pdf_ocr_started + pdf_ocr_finished events 가 PDF asset 의
/// OCR-enabled ingest 시 emit 됨을 검증. real Ollama 의존 — `#[ignore]` default.
///
/// Manual invoke:
/// ```
/// KEBAB_PDF_OCR_ENABLED=true \
/// KEBAB_PDF_OCR_ENDPOINT=http://192.168.0.47:11434 \
/// cargo test -p kebab-app --test ingest_progress \
/// --ignored pdf_ocr_progress_emits_started_finished_events
/// ```
#[test]
#[ignore = "real Ollama dependency — manual invoke via KEBAB_PDF_OCR_ENABLED=true"]
fn pdf_ocr_progress_emits_started_finished_events() {
// F1 fixture (DCTDecode JPEG passthrough) 을 tmpdir 의 workspace 로 copy.
let tmpdir = tempfile::tempdir().expect("create tmpdir");
let workspace = tmpdir.path().join("workspace");
std::fs::create_dir_all(&workspace).expect("create workspace dir");
let f1_src = std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf");
let f1 = std::fs::read(&f1_src).expect("F1 fixture present");
std::fs::write(workspace.join("page1.pdf"), &f1).expect("copy F1");
let data_dir = tmpdir.path().join("data");
std::fs::create_dir_all(&data_dir).expect("create data dir");
let mut config = kebab_config::Config::defaults();
config.workspace.root = workspace.to_string_lossy().into_owned();
config.storage.data_dir = data_dir.to_string_lossy().into_owned();
config.models.embedding.provider = "none".to_string();
config.models.embedding.dimensions = 0;
config.pdf.ocr.enabled = true;
if let Ok(endpoint) = std::env::var("KEBAB_PDF_OCR_ENDPOINT") {
config.pdf.ocr.endpoint = Some(endpoint);
}
let scope = kebab_core::SourceScope {
root: workspace.clone(),
..Default::default()
};
let (tx, rx) = mpsc::channel::<IngestEvent>();
let _report = kebab_app::ingest_with_config_progress(config, scope, false, Some(tx))
.expect("ingest_with_config_progress");
let events: Vec<_> = rx.iter().collect();
let started_count = events
.iter()
.filter(|e| matches!(e, IngestEvent::PdfOcrStarted { .. }))
.count();
let finished_count = events
.iter()
.filter(|e| matches!(e, IngestEvent::PdfOcrFinished { .. }))
.count();
assert!(
started_count >= 1,
"PdfOcrStarted 가 ≥ 1 emit 됨 (got {started_count})"
);
assert!(
finished_count >= 1,
"PdfOcrFinished 가 ≥ 1 emit 됨 (got {finished_count})"
);
assert_eq!(
started_count, finished_count,
"Started 와 Finished 의 count 일치"
);
}

View File

@@ -29,13 +29,15 @@ fn ingest_stdin_writes_frontmatter_and_reports_new() {
"## Body content\n\nMore.",
"Article X",
Some("https://example.com/x"),
).unwrap();
)
.unwrap();
assert_eq!(report.new, 1, "{report:?}");
// _external/ contains exactly one .md file with frontmatter.
let ext_dir = std::path::PathBuf::from(&cfg.workspace.root).join("_external");
let entries: Vec<_> = fs::read_dir(&ext_dir).unwrap()
.filter_map(|e| e.ok())
let entries: Vec<_> = fs::read_dir(&ext_dir)
.unwrap()
.filter_map(std::result::Result::ok)
.collect();
assert_eq!(entries.len(), 1);
let content = fs::read_to_string(entries[0].path()).unwrap();
@@ -50,17 +52,14 @@ fn ingest_stdin_without_source_uri() {
let dir = tempfile::tempdir().unwrap();
let cfg = fresh_cfg(dir.path());
let report = kebab_app::ingest_stdin_with_config(
cfg.clone(),
"## Body",
"Title",
None,
).unwrap();
let report =
kebab_app::ingest_stdin_with_config(cfg.clone(), "## Body", "Title", None).unwrap();
assert_eq!(report.new, 1);
let ext_dir = std::path::PathBuf::from(&cfg.workspace.root).join("_external");
let entries: Vec<_> = fs::read_dir(&ext_dir).unwrap()
.filter_map(|e| e.ok())
let entries: Vec<_> = fs::read_dir(&ext_dir)
.unwrap()
.filter_map(std::result::Result::ok)
.collect();
let content = fs::read_to_string(entries[0].path()).unwrap();
assert!(content.contains("title: \"Title\""));

View File

@@ -17,9 +17,8 @@ fn init_workspace_header_lists_supported_extensions() {
}
kebab_app::init_workspace(true).expect("init_workspace");
let cfg_path = kebab_config::Config::xdg_config_path();
let body = std::fs::read_to_string(&cfg_path).unwrap_or_else(|e| {
panic!("read config at {}: {e}", cfg_path.display())
});
let body = std::fs::read_to_string(&cfg_path)
.unwrap_or_else(|e| panic!("read config at {}: {e}", cfg_path.display()));
assert!(
body.contains("처리 가능한 형식"),
"header lists supported types section: body=\n{body}"

View File

@@ -0,0 +1,122 @@
//! Bug #3 regression: multi-scanned PDF ingest must produce globally unique chunk_ids.
//! v0.20.0 sub-item 1 bugfix.
//!
//! Strategy: helper-level chain test (apply_ocr_to_pdf_pages → PdfPageV1Chunker).
//! Facade mock injection is unavailable (kebab-app hardcodes OllamaVisionOcr), so
//! this test covers the full OCR→chunk pipeline with real PDF fixtures + MockOcrEngine,
//! adding value beyond kebab-chunk unit test B5 (which tests PdfPageV1Chunker alone).
mod common;
use std::collections::HashSet;
use std::path::{Path, PathBuf};
use common::mock_ocr::MockOcrEngine;
use kebab_app::pdf_ocr_apply::{PdfOcrOpts, apply_ocr_to_pdf_pages};
use kebab_chunk::PdfPageV1Chunker;
use kebab_core::{
AssetStorage, Checksum, ChunkPolicy, Chunker, ExtractConfig, ExtractContext, Extractor,
MediaType, RawAsset, SourceUri, WorkspacePath, id_for_asset,
};
use kebab_parse_image::OcrEngine;
use kebab_parse_pdf::PdfTextExtractor;
use time::OffsetDateTime;
fn make_pdf_asset(path: &str, hash_char: char, byte_len: u64) -> RawAsset {
let fake_hash: String = hash_char.to_string().repeat(64);
let asset_id = id_for_asset(&fake_hash);
RawAsset {
asset_id,
source_uri: SourceUri::File(PathBuf::from(path)),
workspace_path: WorkspacePath::new(path.to_string()).unwrap(),
media_type: MediaType::Pdf,
byte_len,
checksum: Checksum(fake_hash),
discovered_at: OffsetDateTime::UNIX_EPOCH,
stored: AssetStorage::Copied {
path: PathBuf::from(path),
},
}
}
fn extract_and_ocr(
bytes: &[u8],
path: &str,
hash_char: char,
engine: &dyn OcrEngine,
) -> kebab_core::CanonicalDocument {
let asset = make_pdf_asset(path, hash_char, bytes.len() as u64);
let workspace_root = Path::new("/");
let config = ExtractConfig::default();
let ctx = ExtractContext {
asset: &asset,
workspace_root,
config: &config,
};
let mut canonical = PdfTextExtractor::new().extract(&ctx, bytes).unwrap();
let opts = PdfOcrOpts {
enabled: true,
always_on: false,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: None,
};
apply_ocr_to_pdf_pages(&mut canonical, engine, bytes, &opts, |_| {}).unwrap();
canonical
}
#[test]
fn multi_scanned_pdf_ingest_no_chunk_id_collision() {
let f1_bytes = std::fs::read("../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf")
.expect("F1 fixture missing");
let f2_bytes = std::fs::read("../kebab-parse-pdf/tests/fixtures/scanned_page2.pdf")
.expect("F2 fixture missing");
// Bug #3 trigger shape: 10-char early segment + ". " + 500-char tail.
// byte_len = 10*3 + 2 + 500*3 = 1532 > target_bytes=1500 → multi-chunk.
// overlap_bytes = min(240, 750) = 240 / chars=80 → second chunk's actual_start
// collapses to prev_min=0 without the fix → same #c0 suffix → chunk_id collision.
let trigger_text = format!("{}. {}", "".repeat(10), "".repeat(500));
let f1_engine = MockOcrEngine::single("F1 mock OCR page text", false);
let f2_engine = MockOcrEngine::single(&trigger_text, false);
let f1_canonical = extract_and_ocr(&f1_bytes, "page1.pdf", '1', &f1_engine);
let f2_canonical = extract_and_ocr(&f2_bytes, "page2.pdf", '2', &f2_engine);
let chunk_policy = ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: PdfPageV1Chunker.chunker_version(),
};
let f1_chunks = PdfPageV1Chunker
.chunk(&f1_canonical, &chunk_policy)
.unwrap();
let f2_chunks = PdfPageV1Chunker
.chunk(&f2_canonical, &chunk_policy)
.unwrap();
assert!(
f2_chunks.len() >= 2,
"F2 trigger text must produce ≥2 chunks for the collision to be possible; got {}",
f2_chunks.len()
);
let all_ids: Vec<&str> = f1_chunks
.iter()
.chain(f2_chunks.iter())
.map(|c| c.chunk_id.0.as_str())
.collect();
let total = all_ids.len();
let unique: HashSet<&str> = all_ids.iter().copied().collect();
assert_eq!(
unique.len(),
total,
"all chunk_ids must be globally unique across F1 + F2 ({} unique vs {} total — collision detected)",
unique.len(),
total,
);
}

View File

@@ -0,0 +1,156 @@
//! Integration smoke tests for `kebab inspect ocr-stats / ocr-failures`.
//! AC-4, AC-5, AC-6, AC-11 (ocr_inspect_smoke binary), AC-13.
mod common;
use common::TestEnv;
use kebab_app::App;
use kebab_store_sqlite::SqliteStore;
/// Insert synthetic pdf_ocr_events rows directly so the test runs without
/// a live Ollama endpoint.
fn seed_ocr_events(env: &TestEnv, store: &SqliteStore) {
// Success rows
for i in 0..3u32 {
store
.record_pdf_ocr_event(
"run-aaa",
&format!("2026-05-28T0{i}:00:00Z"),
Some("doc-abc"),
"path/scanned.pdf",
i + 1,
Some(50_000),
Some(200),
Some(150),
100 + u64::from(i) * 20,
42,
true,
None,
"qwen2.5vl",
)
.expect("seed success row");
}
// Failure row
store
.record_pdf_ocr_event(
"run-bbb",
"2026-05-28T10:00:00Z",
Some("doc-abc"),
"path/scanned.pdf",
4,
Some(30_000),
Some(200),
Some(150),
9999,
0,
false,
Some("ocr_error"),
"qwen2.5vl",
)
.expect("seed failure row");
// Row for different doc
store
.record_pdf_ocr_event(
"run-ccc",
"2026-05-28T11:00:00Z",
Some("doc-xyz"),
"path/other.pdf",
1,
None,
None,
None,
200,
10,
true,
None,
"qwen2.5vl",
)
.expect("seed doc-xyz row");
// Trigger migration (no-op if already done via App::open_with_config)
let _ = env;
}
fn open_app_with_seeded_events(env: &TestEnv) -> App {
let app = env.app();
let store = SqliteStore::open(&env.config).expect("open store for seed");
store.run_migrations().expect("run migrations for seed");
seed_ocr_events(env, &store);
app
}
/// AC-4: `inspect_ocr_stats` returns `schema_version = "ocr_stats.v1"`,
/// `total_events >= 1`, `0 ≤ success_rate ≤ 1`.
#[test]
fn ocr_stats_after_seeded_events() {
let env = TestEnv::lexical_only();
let app = open_app_with_seeded_events(&env);
let stats = app.inspect_ocr_stats().expect("inspect_ocr_stats");
assert_eq!(stats.schema_version, "ocr_stats.v1");
assert!(stats.total_events >= 1, "total_events should be >= 1");
assert!(
(0.0..=1.0).contains(&stats.success_rate),
"success_rate must be in [0, 1]: {}",
stats.success_rate
);
assert!(stats.total_runs >= 1, "total_runs should be >= 1");
// by_engine should have at least one entry
assert!(!stats.by_engine.is_empty(), "by_engine must be non-empty");
}
/// AC-6: `inspect_ocr_failures` (no doc_id, corpus-wide) returns failures list.
#[test]
fn ocr_failures_corpus_wide() {
let env = TestEnv::lexical_only();
let app = open_app_with_seeded_events(&env);
let result = app
.inspect_ocr_failures(None, 10)
.expect("inspect_ocr_failures");
assert_eq!(result.schema_version, "ocr_failures.v1");
assert!(result.failure_count >= 1, "expected at least 1 failure");
assert!(
!result.failures.is_empty(),
"failures list must be non-empty"
);
}
/// AC-5: `inspect_ocr_failures` with doc_id filter returns matching rows.
#[test]
fn ocr_failures_filter_by_doc_id() {
let env = TestEnv::lexical_only();
let app = open_app_with_seeded_events(&env);
let result = app
.inspect_ocr_failures(Some("doc-abc"), 10)
.expect("inspect_ocr_failures by doc_id");
assert_eq!(result.schema_version, "ocr_failures.v1");
assert_eq!(
result.doc_id.as_deref(),
Some("doc-abc"),
"doc_id must be echoed back"
);
// All rows must belong to doc-abc (no cross-doc leak)
for row in &result.failures {
// rows are failure rows for doc-abc only (reason = ocr_error)
assert_eq!(row.reason, "ocr_error");
}
}
/// AC-13: SKILL.md lists both new wire schemas.
#[test]
fn skill_md_lists_new_schemas() {
let skill_md = std::fs::read_to_string("../../integrations/claude-code/kebab/SKILL.md")
.expect("read SKILL.md");
assert!(
skill_md.contains("ocr_stats.v1"),
"SKILL.md must mention ocr_stats.v1"
);
assert!(
skill_md.contains("ocr_failures.v1"),
"SKILL.md must mention ocr_failures.v1"
);
}

View File

@@ -0,0 +1,81 @@
//! Tests for `App::open_with_config`'s NLI verifier construction path.
//!
//! Coverage:
//! 1. `open_with_config_nli_fails_when_model_dir_unwritable_and_threshold_positive` —
//! when `rag.nli_threshold > 0` and `storage.model_dir` is unwritable,
//! `open_with_config` returns `Err` with "OnnxNliVerifier" in the
//! error chain.
//! 2. `open_with_config_nli_skipped_when_threshold_zero` —
//! same bad `model_dir`, but `rag.nli_threshold = 0.0` (gate disabled),
//! so `OnnxNliVerifier::new` is never called and `open_with_config`
//! succeeds.
//!
//! `/proc/1/root` is the init process's filesystem root; on Linux it is
//! owned by root and not traversable by unprivileged users, making
//! `create_dir_all` fail with `EACCES` — a reliable "unwritable path"
//! that requires no test setup beyond the path literal.
use kebab_config::Config;
/// Return a `Config` whose `data_dir` lives in a fresh `TempDir`
/// (so `SqliteStore::open` succeeds) and whose `model_dir` is set to
/// `/proc/1/root` (unwritable by non-root processes on Linux).
///
/// The `TempDir` is returned alongside the config so the caller keeps
/// it alive until the test completes — dropping it early would delete
/// the data directory before any assertions run.
fn config_with_unwritable_model_dir() -> (tempfile::TempDir, Config) {
let tmp = tempfile::tempdir().expect("tempdir");
let mut cfg = Config::defaults();
// Valid data_dir → SqliteStore::open + run_migrations succeed.
cfg.storage.data_dir = tmp.path().to_string_lossy().into_owned();
// /proc/1/root is only accessible to root; create_dir_all will
// return EACCES for any unprivileged user, which is exactly the
// failure mode we want to exercise.
cfg.storage.model_dir = "/proc/1/root".to_string();
(tmp, cfg)
}
// ── 1. Failure path: threshold > 0 + unwritable model_dir ─────────────────
#[test]
fn open_with_config_nli_fails_when_model_dir_unwritable_and_threshold_positive() {
let (_tmp, mut cfg) = config_with_unwritable_model_dir();
cfg.rag.nli_threshold = 0.5; // gate enabled → OnnxNliVerifier::new runs
let result = kebab_app::App::open_with_config(cfg);
let Err(err) = result else {
panic!(
"App::open_with_config must fail when model_dir is unwritable and nli_threshold > 0"
);
};
// The error chain must identify the OnnxNliVerifier as the source so
// an operator reading logs can trace the failure to the NLI config.
let err_chain = format!("{err:?}");
assert!(
err_chain.contains("OnnxNliVerifier"),
"error chain must mention OnnxNliVerifier; full chain: {err_chain}"
);
}
// ── 2. Success path: threshold = 0.0 → NLI verifier never constructed ──────
#[test]
fn open_with_config_nli_skipped_when_threshold_zero() {
let (_tmp, cfg) = config_with_unwritable_model_dir();
// Default nli_threshold is 0.0 — gate disabled, verifier skipped.
assert!(
(cfg.rag.nli_threshold - 0.0).abs() < f32::EPSILON,
"precondition: default nli_threshold must be 0.0 (gate disabled)"
);
// A bad model_dir must NOT cause a failure when the NLI gate is off.
let result = kebab_app::App::open_with_config(cfg);
assert!(
result.is_ok(),
"App::open_with_config must succeed when nli_threshold = 0.0 \
(OnnxNliVerifier is never constructed); err: {:?}",
result.err()
);
}

View File

@@ -0,0 +1,358 @@
//! Integration tests for pdf_ocr_apply helper. spec §5.5 MockOcrEngine pattern.
mod common;
use std::path::{Path, PathBuf};
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use common::mock_ocr::MockOcrEngine;
use kebab_app::pdf_ocr_apply::{PdfOcrOpts, apply_ocr_to_pdf_pages};
use kebab_core::{
AssetStorage, Block, CanonicalDocument, Checksum, ExtractConfig, ExtractContext, Extractor,
Inline, Lang, MediaType, RawAsset, SourceSpan, SourceUri, WorkspacePath, id_for_asset,
};
use kebab_parse_pdf::PdfTextExtractor;
use time::OffsetDateTime;
// ── Fixture helpers ───────────────────────────────────────────────────────
fn f1_pdf_bytes() -> Vec<u8> {
std::fs::read("../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf")
.expect("F1 fixture missing")
}
fn make_raw_asset(path: &str, media_type: MediaType, byte_len: u64) -> RawAsset {
let fake_hash = "0".repeat(64);
let asset_id = id_for_asset(&fake_hash);
RawAsset {
asset_id,
source_uri: SourceUri::File(PathBuf::from(path)),
workspace_path: WorkspacePath::new(path.to_string()).unwrap(),
media_type,
byte_len,
checksum: Checksum(fake_hash.clone()),
discovered_at: OffsetDateTime::UNIX_EPOCH,
stored: AssetStorage::Copied {
path: PathBuf::from(path),
},
}
}
/// Build a CanonicalDocument from raw PDF bytes using PdfTextExtractor.
/// F1 (scanned) returns an empty-text Block::Paragraph per page.
fn extract_canonical_from_bytes(bytes: &[u8]) -> CanonicalDocument {
let asset = make_raw_asset("test.pdf", MediaType::Pdf, bytes.len() as u64);
let workspace_root = Path::new("/");
let config = ExtractConfig::default();
let ctx = ExtractContext {
asset: &asset,
workspace_root,
config: &config,
};
PdfTextExtractor::new().extract(&ctx, bytes).unwrap()
}
/// F1 bytes → canonical with 1 empty Block::Paragraph for page 1.
fn canonical_with_empty_block() -> CanonicalDocument {
extract_canonical_from_bytes(&f1_pdf_bytes())
}
/// F1-based canonical with block text replaced by `text` (high valid_ratio, chars≥20).
fn canonical_with_filled_block(text: &str) -> CanonicalDocument {
let mut canonical = extract_canonical_from_bytes(&f1_pdf_bytes());
if let Some(Block::Paragraph(tb)) = canonical.blocks.first_mut() {
let char_count = text.chars().count() as u32;
tb.text = text.to_string();
tb.inlines = vec![Inline::Text {
text: text.to_string(),
}];
if let SourceSpan::Page { char_end, .. } = &mut tb.common.source_span {
*char_end = Some(char_count);
}
}
canonical
}
/// F1-based canonical with block text replaced by PUA codepoints (low valid_ratio).
fn canonical_with_mojibake_block() -> CanonicalDocument {
let mut canonical = extract_canonical_from_bytes(&f1_pdf_bytes());
if let Some(Block::Paragraph(tb)) = canonical.blocks.first_mut() {
let pua = "\u{E000}".repeat(25); // 25 PUA codepoints → valid_ratio ≈ 0
let char_count = pua.chars().count() as u32;
tb.text = pua.clone();
tb.inlines = vec![Inline::Text { text: pua }];
if let SourceSpan::Page { char_end, .. } = &mut tb.common.source_span {
*char_end = Some(char_count);
}
}
canonical
}
fn default_opts(enabled: bool) -> PdfOcrOpts {
PdfOcrOpts {
enabled,
always_on: false,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: None,
}
}
// ── Tests ─────────────────────────────────────────────────────────────────
// Test 1: F1 + enabled=true → in-place mutate
#[test]
fn f1_input_with_ocr_enabled_replaces_empty_block() {
let bytes = f1_pdf_bytes();
let mut canonical = canonical_with_empty_block();
let engine = MockOcrEngine::single("MOCK_OCR_TEXT", false);
let opts = PdfOcrOpts {
enabled: true,
always_on: false,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: Some(Lang("kor".into())),
cancel: None,
};
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 1);
let first_para = canonical.blocks.iter().find_map(|b| match b {
Block::Paragraph(tb) => Some(tb),
_ => None,
});
assert!(first_para.is_some());
assert_eq!(first_para.unwrap().text, "MOCK_OCR_TEXT");
}
// Test 2: F3 vector (mock filled canonical) + enabled=true → OCR skip (needs_ocr=false)
#[test]
fn f3_input_with_ocr_enabled_keeps_text_detect_blocks() {
let bytes = f1_pdf_bytes(); // reuse F1 bytes; decision is based on canonical text
let text = "충분한 한국어 텍스트 컨텐츠입니다. This has more than twenty characters.";
let mut canonical = canonical_with_filled_block(text);
let engine = MockOcrEngine::single("SHOULD_NOT_BE_CALLED", false);
let opts = default_opts(true);
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 0, "vector PDF 의 OCR 호출 0");
let first_para = canonical.blocks.iter().find_map(|b| match b {
Block::Paragraph(tb) => Some(tb),
_ => None,
});
if let Some(tb) = first_para {
assert!(tb.text.starts_with("충분한"), "원본 text 보존");
}
}
// Test 3: F1 + enabled=false → no-op
#[test]
fn f1_input_with_ocr_disabled_keeps_empty_block() {
let bytes = f1_pdf_bytes();
let mut canonical = canonical_with_empty_block();
let engine = MockOcrEngine::single("IGNORED", false);
let opts = default_opts(false);
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 0);
assert_eq!(summary.ms_total, 0);
}
// Test 4: mojibake canonical (PUA chars) + enabled=true → in-place mutate
#[test]
fn f4_input_with_ocr_enabled_replaces_mojibake_block() {
let bytes = f1_pdf_bytes(); // F1 bytes carry DCTDecode image
let mut canonical = canonical_with_mojibake_block();
let engine = MockOcrEngine::single("OCR_MOJIBAKE_REPLACEMENT", false);
let opts = PdfOcrOpts {
enabled: true,
always_on: false,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: None,
};
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 1, "mojibake page 의 OCR 호출");
let first_para = canonical.blocks.iter().find_map(|b| match b {
Block::Paragraph(tb) => Some(tb),
_ => None,
});
if let Some(tb) = first_para {
assert_eq!(tb.text, "OCR_MOJIBAKE_REPLACEMENT");
}
}
// Test 5: filled canonical + always_on=true → dual-block (+1 OCR block)
#[test]
fn f3_input_with_always_on_pushes_dual_blocks() {
let bytes = f1_pdf_bytes();
let text = "vector PDF 충분한 텍스트 컨텐츠입니다. This has enough characters for valid ratio.";
let mut canonical = canonical_with_filled_block(text);
let original_block_count = canonical.blocks.len();
let engine = MockOcrEngine::single("OCR_DUAL", false);
let opts = PdfOcrOpts {
enabled: true,
always_on: true,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: None,
};
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 1);
assert_eq!(
canonical.blocks.len(),
original_block_count + 1,
"always_on 시 새 Block::Paragraph push"
);
let texts: Vec<&str> = canonical
.blocks
.iter()
.filter_map(|b| match b {
Block::Paragraph(tb) => Some(tb.text.as_str()),
_ => None,
})
.collect();
assert!(texts.contains(&"OCR_DUAL"), "OCR block 포함");
assert!(
texts.iter().any(|t| t.starts_with("vector")),
"원본 text-detect block 보존"
);
}
// Test 6: F6 FlateDecode → extract_dctdecode_page_image=None → skip + warning
#[test]
fn f6_flatedecode_skipped_with_warning() {
let bytes = std::fs::read("../kebab-parse-pdf/tests/fixtures/flate_raw.pdf")
.expect("F6 fixture missing");
let mut canonical = canonical_with_empty_block(); // page-1 block from F1
let engine = MockOcrEngine::single("SHOULD_NOT_BE_CALLED", false);
let opts = default_opts(true);
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(
summary.pages_ocrd, 0,
"FlateDecode page 는 skip (DCTDecode-only v1 invariant)"
);
let warning_count = canonical
.provenance
.events
.iter()
.filter(|e| e.kind == kebab_core::ProvenanceKind::Warning)
.count();
assert!(warning_count >= 1, "FlateDecode skip 시 Warning event 발행");
}
// Test 7: F7 CCITTFax → skip + warning (verifier M-4 split)
#[test]
fn f7_ccittfax_skipped_with_warning() {
let bytes =
std::fs::read("../kebab-parse-pdf/tests/fixtures/ccitt.pdf").expect("F7 fixture missing");
let mut canonical = canonical_with_empty_block(); // page-1 block from F1
let engine = MockOcrEngine::single("SHOULD_NOT_BE_CALLED", false);
let opts = default_opts(true);
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 0, "CCITTFax page 는 skip");
let warning_count = canonical
.provenance
.events
.iter()
.filter(|e| e.kind == kebab_core::ProvenanceKind::Warning)
.count();
assert!(warning_count >= 1, "CCITTFax skip 시 Warning event 발행");
}
// Test 8: OCR engine failure → warning event + skip
#[test]
fn ocr_engine_failure_surfaces_as_warning() {
let bytes = f1_pdf_bytes();
let mut canonical = canonical_with_empty_block();
let engine = MockOcrEngine::single("", true);
let opts = default_opts(true);
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 0, "OCR failure 시 pages_ocrd=0");
let warning_with_failure = canonical.provenance.events.iter().any(|e| {
e.kind == kebab_core::ProvenanceKind::Warning
&& e.note.as_deref().unwrap_or("").contains("mock failure")
});
assert!(
warning_with_failure,
"OCR failure 의 error message 가 warning event 의 note 안"
);
}
// Test 9: dual-block ordinals are deterministic and unique
#[test]
fn dual_block_ordinals_are_deterministic_and_unique() {
let bytes = f1_pdf_bytes(); // 1-page PDF → page_count=1
let text = "vector 충분한 텍스트. This text has more than twenty characters total.";
let mut canonical = canonical_with_filled_block(text);
let engine = MockOcrEngine::single("DUAL", false);
let opts = PdfOcrOpts {
enabled: true,
always_on: true,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: None,
};
apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
// page_count=1 → text-detect ordinal=0, ocr ordinal=1 (page_num-1 + page_count = 0+1=1)
let para_count = canonical
.blocks
.iter()
.filter(|b| matches!(b, Block::Paragraph(_)))
.count();
assert_eq!(para_count, 2, "dual-block: text-detect + OCR");
let all_page_1 = canonical
.blocks
.iter()
.filter_map(|b| match b {
Block::Paragraph(tb) => Some(&tb.common.source_span),
_ => None,
})
.all(|s| matches!(s, SourceSpan::Page { page: 1, .. }));
assert!(all_page_1, "두 block 모두 page=1");
}
// Test 10: cancel handle aborts mid-PDF
#[test]
fn cancel_handle_aborts_mid_pdf() {
let bytes = f1_pdf_bytes();
let mut canonical = canonical_with_empty_block();
let cancel = Arc::new(AtomicBool::new(true)); // pre-cancel
let engine = MockOcrEngine::single("IGNORED", false);
let opts = PdfOcrOpts {
enabled: true,
always_on: false,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: Some(cancel.clone()),
};
let result = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {});
let err = result.expect_err("cancel=true 시 error 반환");
assert!(
format!("{err}").contains("cancelled mid-PDF"),
"error message 가 'cancelled mid-PDF' 포함: {err}"
);
}

View File

@@ -0,0 +1,139 @@
//! Integration smoke test: dual-write (ndjson + SQLite) for PDF OCR events.
//! AC-3: SQLite row count and doc_id matches ndjson LogEvent::Ocr.
//!
//! Uses wiremock to stub the Ollama `/api/generate` endpoint so the test
//! runs without a live Ollama instance.
mod common;
use std::path::PathBuf;
use common::TestEnv;
use kebab_config::LoggingCfg;
use serde_json::Value;
use tokio::task::spawn_blocking;
use wiremock::matchers::{method, path};
use wiremock::{Mock, MockServer, ResponseTemplate};
fn scanned_pdf_src() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.unwrap()
.join("kebab-parse-pdf/tests/fixtures/scanned_page1.pdf")
}
/// AC-3: ndjson OCR line count == pdf_ocr_events row count, and doc_id matches.
#[tokio::test]
async fn ingest_dual_write_doc_id_matches_ndjson() {
let src = scanned_pdf_src();
if !src.exists() {
eprintln!("skipping test: scanned_page1.pdf fixture not found");
return;
}
let server = MockServer::start().await;
// Stub Ollama /api/generate to return a minimal OCR response.
Mock::given(method("POST"))
.and(path("/api/generate"))
.respond_with(ResponseTemplate::new(200).set_body_json(serde_json::json!({
"model": "qwen2.5vl:3b",
"response": "test ocr output",
"done": true,
"done_reason": "stop"
})))
.mount(&server)
.await;
let mock_url = server.uri();
let result = spawn_blocking(move || {
let mut env = TestEnv::lexical_only();
// Enable PDF OCR + set up mock endpoint
env.config.pdf.ocr.enabled = true;
env.config.pdf.ocr.endpoint = Some(mock_url.clone());
env.config.pdf.ocr.model = "qwen2.5vl:3b".to_string();
// Enable ingest log
let log_dir = env.temp.path().join("logs");
std::fs::create_dir_all(&log_dir).unwrap();
env.config.logging = LoggingCfg {
ingest_log_enabled: true,
ingest_log_dir: log_dir.clone(),
..Default::default()
};
// Copy scanned PDF into workspace
let dest = env.workspace_root.join("scanned.pdf");
std::fs::copy(scanned_pdf_src(), &dest).expect("copy scanned PDF");
// Run ingest
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).expect("ingest");
// Read ndjson log
let log_files: Vec<_> = std::fs::read_dir(&log_dir)
.unwrap()
.filter_map(Result::ok)
.filter(|e| {
let name = e.file_name().to_string_lossy().to_string();
name.starts_with("ingest-") && name.ends_with(".ndjson")
})
.collect();
assert_eq!(log_files.len(), 1, "expected 1 ndjson log file");
let body = std::fs::read_to_string(log_files[0].path()).unwrap();
let ocr_lines: Vec<Value> = body
.lines()
.filter_map(|l| serde_json::from_str(l).ok())
.filter(|v: &Value| v.get("kind").and_then(Value::as_str) == Some("ocr"))
.collect();
// Read pdf_ocr_events from SQLite
let db_path = PathBuf::from(&env.config.storage.data_dir).join("kebab.sqlite");
let conn = rusqlite::Connection::open(&db_path).expect("open db");
let rows: Vec<(Option<String>, String)> = {
let mut stmt = conn
.prepare("SELECT doc_id, doc_path FROM pdf_ocr_events ORDER BY id")
.expect("prepare");
stmt.query_map([], |r| Ok((r.get(0)?, r.get(1)?)))
.expect("query")
.map(|r| r.expect("row"))
.collect()
};
(ocr_lines, rows)
})
.await
.expect("spawn_blocking");
let (ocr_lines, rows) = result;
// At least one OCR event must be produced
assert!(!ocr_lines.is_empty(), "expected ≥1 ndjson ocr line");
assert!(!rows.is_empty(), "expected ≥1 pdf_ocr_events row");
// Row counts must match
assert_eq!(
ocr_lines.len(),
rows.len(),
"ndjson ocr lines ({}) must equal pdf_ocr_events rows ({})",
ocr_lines.len(),
rows.len()
);
// doc_id in both sources must be non-null and consistent
for (line, (sql_doc_id, _sql_doc_path)) in ocr_lines.iter().zip(rows.iter()) {
let json_doc_id = line.get("doc_id").and_then(Value::as_str);
assert!(
json_doc_id.is_some(),
"ndjson ocr line should have doc_id: {line}"
);
assert!(
sql_doc_id.is_some(),
"pdf_ocr_events row should have doc_id"
);
assert_eq!(
json_doc_id,
sql_doc_id.as_deref(),
"ndjson doc_id must equal SQLite doc_id"
);
}
}

View File

@@ -46,17 +46,13 @@ fn build_text_pdf(pages: &[Option<&str>]) -> Vec<u8> {
operations: vec![
Operation::new("BT", vec![]),
Operation::new("Tf", vec!["F1".into(), 24.into()]),
Operation::new(
"Td",
vec![Object::Integer(100), Object::Integer(700)],
),
Operation::new("Td", vec![Object::Integer(100), Object::Integer(700)]),
Operation::new("Tj", vec![Object::string_literal(*text)]),
Operation::new("ET", vec![]),
],
};
let stream_data = content.encode().expect("content encode");
let content_id =
doc.add_object(Stream::new(dictionary! {}, stream_data));
let content_id = doc.add_object(Stream::new(dictionary! {}, stream_data));
page_dict.set("Contents", content_id);
}
let page_id = doc.add_object(page_dict);
@@ -76,8 +72,7 @@ fn build_text_pdf(pages: &[Option<&str>]) -> Vec<u8> {
Object::Integer(842),
],
};
doc.objects
.insert(pages_id, Object::Dictionary(pages_dict));
doc.objects.insert(pages_id, Object::Dictionary(pages_dict));
let catalog_id = doc.add_object(dictionary! {
"Type" => "Catalog",
@@ -146,9 +141,8 @@ fn ingest_3_page_pdf_produces_one_doc_and_per_page_chunks() {
write_pdf(&env.workspace_root, "three.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false)
.expect("PDF ingest must succeed");
let report = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false)
.expect("PDF ingest must succeed");
assert_eq!(report.errors, 0);
let items = report.items.as_ref().expect("items present");
@@ -157,23 +151,28 @@ fn ingest_3_page_pdf_produces_one_doc_and_per_page_chunks() {
.find(|i| i.doc_path.0.ends_with("three.pdf"))
.expect("PDF item present");
assert_eq!(pdf_item.kind, IngestItemKind::New);
assert_eq!(pdf_item.block_count, Some(3), "one Block::Paragraph per page");
assert_eq!(pdf_item.chunk_count, Some(3), "one chunk per non-empty page");
assert_eq!(
pdf_item.block_count,
Some(3),
"one Block::Paragraph per page"
);
assert_eq!(
pdf_item.chunk_count,
Some(3),
"one chunk per non-empty page"
);
assert_eq!(
pdf_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("pdf-text-v1")
);
assert_eq!(
pdf_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("pdf-page-v1")
Some("pdf-page-v1.1")
);
// Inspect the stored doc to confirm SourceSpan::Page round-trip.
let doc = kebab_app::inspect_doc_with_config(
cfg,
pdf_item.doc_id.as_ref().unwrap(),
)
.expect("inspect_doc returns the PDF document");
let doc = kebab_app::inspect_doc_with_config(cfg, pdf_item.doc_id.as_ref().unwrap())
.expect("inspect_doc returns the PDF document");
assert_eq!(doc.blocks.len(), 3);
for (i, block) in doc.blocks.iter().enumerate() {
let want_page = (i as u32) + 1;
@@ -202,8 +201,7 @@ fn re_ingest_identical_pdf_produces_unchanged_with_same_doc_id() {
write_pdf(&env.workspace_root, "stable.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report1 =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report1 = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let item1 = report1
.items
.as_ref()
@@ -214,8 +212,7 @@ fn re_ingest_identical_pdf_produces_unchanged_with_same_doc_id() {
.unwrap();
assert_eq!(item1.kind, IngestItemKind::New);
let report2 =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report2 = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let item2 = report2
.items
.unwrap()
@@ -239,8 +236,7 @@ fn re_ingest_edited_pdf_produces_new_doc_id() {
std::fs::write(&path, &bytes_v1).unwrap();
let cfg = cfg_with_pdf(&env);
let report_v1 =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report_v1 = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let id_v1 = report_v1
.items
.as_ref()
@@ -252,12 +248,10 @@ fn re_ingest_edited_pdf_produces_new_doc_id() {
.clone()
.unwrap();
let bytes_v2 =
build_text_pdf(&[Some("VERSION TWO entirely different body content.")]);
let bytes_v2 = build_text_pdf(&[Some("VERSION TWO entirely different body content.")]);
std::fs::write(&path, &bytes_v2).unwrap();
let report_v2 =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report_v2 = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let item_v2 = report_v2
.items
.as_ref()
@@ -282,9 +276,11 @@ fn encrypted_pdf_fails_with_qpdf_hint() {
write_pdf(&env.workspace_root, "secret.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg, env.scope(), false).unwrap();
assert_eq!(report.errors, 1, "encrypted PDF must increment errors exactly once");
let report = kebab_app::ingest_with_config(cfg, env.scope(), false).unwrap();
assert_eq!(
report.errors, 1,
"encrypted PDF must increment errors exactly once"
);
let items = report.items.as_ref().unwrap();
let pdf_item = items
.iter()
@@ -310,9 +306,11 @@ fn corrupt_pdf_fails_without_storing() {
write_pdf(&env.workspace_root, "corrupt.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
assert_eq!(report.errors, 1, "corrupt PDF must increment errors exactly once");
let report = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
assert_eq!(
report.errors, 1,
"corrupt PDF must increment errors exactly once"
);
let items = report.items.as_ref().unwrap();
let pdf_item = items
.iter()
@@ -322,11 +320,8 @@ fn corrupt_pdf_fails_without_storing() {
// Confirm the doc was NOT stored — list_docs returns nothing for
// this path.
let summaries = kebab_app::list_docs_with_config(
cfg,
kebab_core::DocFilter::default(),
)
.unwrap();
let summaries =
kebab_app::list_docs_with_config(cfg, kebab_core::DocFilter::default()).unwrap();
assert!(
!summaries
.iter()
@@ -341,14 +336,15 @@ fn corrupt_pdf_fails_without_storing() {
#[test]
fn mixed_page_pdf_stores_asset_with_scanned_candidate_warning() {
let env = TestEnv::lexical_only();
let bytes =
build_text_pdf(&[Some("first page"), None, Some("third page")]);
let bytes = build_text_pdf(&[Some("first page"), None, Some("third page")]);
write_pdf(&env.workspace_root, "mixed.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
assert_eq!(report.errors, 0, "scanned candidate is a Warning, not Error");
let report = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
assert_eq!(
report.errors, 0,
"scanned candidate is a Warning, not Error"
);
let pdf_item = report
.items
.as_ref()
@@ -365,14 +361,10 @@ fn mixed_page_pdf_stores_asset_with_scanned_candidate_warning() {
assert_eq!(
pdf_item.chunk_count,
Some(2),
"pdf-page-v1 emits 0 chunks for the empty page; total = 2"
"pdf-page-v1.1 emits 0 chunks for the empty page; total = 2"
);
let doc = kebab_app::inspect_doc_with_config(
cfg,
pdf_item.doc_id.as_ref().unwrap(),
)
.unwrap();
let doc = kebab_app::inspect_doc_with_config(cfg, pdf_item.doc_id.as_ref().unwrap()).unwrap();
let warnings: Vec<_> = doc
.provenance
.events
@@ -419,8 +411,7 @@ fn ingest_report_arithmetic_invariant_holds_with_corrupt_pdf() {
write_pdf(&env.workspace_root, "broken.pdf", &corrupt_pdf());
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg, env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(cfg, env.scope(), false).unwrap();
let total = report.new + report.updated + report.skipped + report.errors;
assert_eq!(
report.scanned, total,
@@ -441,14 +432,12 @@ fn long_pdf_round_trips_through_lexical_pipeline() {
let pages: Vec<String> = (1..=50)
.map(|i| format!("Page {i} body — lorem ipsum dolor sit amet."))
.collect();
let page_refs: Vec<Option<&str>> =
pages.iter().map(|s| Some(s.as_str())).collect();
let page_refs: Vec<Option<&str>> = pages.iter().map(|s| Some(s.as_str())).collect();
let bytes = build_text_pdf(&page_refs);
write_pdf(&env.workspace_root, "long.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
assert_eq!(report.errors, 0);
let pdf_item = report
.items
@@ -466,8 +455,7 @@ fn long_pdf_round_trips_through_lexical_pipeline() {
// Round-trip: list_docs sees the long PDF.
let summaries =
kebab_app::list_docs_with_config(cfg, kebab_core::DocFilter::default())
.unwrap();
kebab_app::list_docs_with_config(cfg, kebab_core::DocFilter::default()).unwrap();
assert!(summaries.iter().any(|s| s.doc_path.0.ends_with("long.pdf")));
}
@@ -476,13 +464,11 @@ fn long_pdf_round_trips_through_lexical_pipeline() {
#[test]
fn inspect_doc_surfaces_page_spans() {
let env = TestEnv::lexical_only();
let bytes =
build_text_pdf(&[Some("alpha body"), Some("beta body"), Some("gamma body")]);
let bytes = build_text_pdf(&[Some("alpha body"), Some("beta body"), Some("gamma body")]);
write_pdf(&env.workspace_root, "inspect.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let pdf_item = report
.items
.as_ref()
@@ -490,19 +476,12 @@ fn inspect_doc_surfaces_page_spans() {
.iter()
.find(|i| i.doc_path.0.ends_with("inspect.pdf"))
.unwrap();
let doc = kebab_app::inspect_doc_with_config(
cfg,
pdf_item.doc_id.as_ref().unwrap(),
)
.unwrap();
let doc = kebab_app::inspect_doc_with_config(cfg, pdf_item.doc_id.as_ref().unwrap()).unwrap();
assert_eq!(doc.parser_version.0, "pdf-text-v1");
assert_eq!(doc.blocks.len(), 3);
for block in &doc.blocks {
match block {
Block::Paragraph(p) => assert!(matches!(
p.common.source_span,
SourceSpan::Page { .. }
)),
Block::Paragraph(p) => assert!(matches!(p.common.source_span, SourceSpan::Page { .. })),
other => panic!("expected Paragraph, got {other:?}"),
}
}

View File

@@ -0,0 +1,137 @@
//! Integration test for `kebab reset --orphans-only`.
//!
//! Verifies that stored docs outside the current walker scope are purged
//! from the store without removing any files from the filesystem.
//!
//! Test outline:
//! 1. Ingest 3 .rs files (a.rs, b.rs, c.rs) — all New.
//! 2. Narrow the config `include` to `["a.rs"]` only; b.rs and c.rs are
//! still on disk but outside the walker scope.
//! 3. Run `execute(ResetScope::OrphansOnly, &cfg)` — report must show
//! `orphans_purged == 2` and `purged_paths` contains b.rs + c.rs.
//! 4. `list docs` must show only a.rs.
//! 5. b.rs and c.rs must still exist on disk (no filesystem removal).
//! 6. Second reset → `orphans_purged == 0` (idempotent).
mod common;
use common::TestEnv;
use kebab_app::IngestOpts;
use kebab_app::reset::{ResetScope, execute};
use kebab_core::{DocFilter, DocumentStore, SourceScope};
/// Open the SqliteStore and list all `workspace_path` values.
fn list_doc_paths(env: &TestEnv) -> Vec<String> {
use kebab_store_sqlite::SqliteStore;
let store = SqliteStore::open(&env.config).unwrap();
store.run_migrations().unwrap();
store
.list_documents(&DocFilter::default())
.unwrap()
.into_iter()
.map(|d| d.doc_path.0)
.collect()
}
#[test]
fn reset_orphans_only_purges_out_of_scope_docs() {
let env = TestEnv::lexical_only();
// Write three .rs files into the workspace.
let a_path = env.workspace_root.join("a.rs");
let b_path = env.workspace_root.join("b.rs");
let c_path = env.workspace_root.join("c.rs");
std::fs::write(&a_path, "// file a\nfn alpha() {}\n").unwrap();
std::fs::write(&b_path, "// file b\nfn bravo() {}\n").unwrap();
std::fs::write(&c_path, "// file c\nfn charlie() {}\n").unwrap();
// Ingest all three with a wide scope.
let wide_scope = SourceScope {
root: env.workspace_root.clone(),
include: vec!["**/*.rs".to_string()],
exclude: env.config.workspace.exclude.clone(),
};
let first = kebab_app::ingest_with_config_opts(
env.config.clone(),
wide_scope,
false,
IngestOpts::default(),
)
.expect("first ingest must succeed");
// The fixture workspace may contain other .rs files — just assert we
// got at least 3 new docs (our a.rs, b.rs, c.rs).
assert!(first.new >= 3, "expected at least 3 new docs: {first:?}");
assert_eq!(first.errors, 0, "no errors on first ingest");
// Narrow config to include only a.rs; b.rs + c.rs are still on disk.
let mut narrow_cfg = env.config.clone();
narrow_cfg.workspace.exclude.clear();
// Re-point workspace root (already correct) and restrict include via
// the SourceScope in the connector. The config's `workspace.root` is
// used by `enumerate_orphans` to build its scope — we keep that
// pointing at the workspace root. We simulate narrowing by setting a
// glob that only matches a.rs.
//
// NOTE: `kebab_config::WorkspaceCfg` does not have an `include` field
// (it was removed in p9-fb-25). We narrow the scope via the walker
// exclude list: exclude b.rs and c.rs explicitly.
narrow_cfg.workspace.exclude = vec!["b.rs".to_string(), "c.rs".to_string()];
// Run orphans-only reset.
let report =
execute(ResetScope::OrphansOnly, &narrow_cfg).expect("orphans-only reset must succeed");
assert_eq!(
report.orphans_purged, 2,
"expected 2 orphans purged (b.rs + c.rs): {report:?}"
);
let mut purged: Vec<String> = report.purged_paths.iter().map(|p| p.0.clone()).collect();
purged.sort();
assert_eq!(
purged,
vec!["b.rs".to_string(), "c.rs".to_string()],
"purged_paths must list b.rs and c.rs in sorted order: {purged:?}"
);
// list docs must show only a.rs (and any pre-existing fixture files
// that are not excluded by the narrow config).
let doc_paths = list_doc_paths(&env);
// The narrow_cfg excludes b.rs + c.rs — they must no longer be in store.
assert!(
!doc_paths.iter().any(|p| p == "b.rs"),
"b.rs must be gone from store after orphans-only reset; got: {doc_paths:?}"
);
assert!(
!doc_paths.iter().any(|p| p == "c.rs"),
"c.rs must be gone from store after orphans-only reset; got: {doc_paths:?}"
);
assert!(
doc_paths.iter().any(|p| p == "a.rs"),
"a.rs must still be in store; got: {doc_paths:?}"
);
// Both b.rs and c.rs must still exist on the filesystem — no file
// removal is performed by orphans-only.
assert!(
b_path.exists(),
"b.rs must still be on disk after orphans-only reset"
);
assert!(
c_path.exists(),
"c.rs must still be on disk after orphans-only reset"
);
// Second reset must be idempotent: nothing left to purge.
let second = execute(ResetScope::OrphansOnly, &narrow_cfg)
.expect("second orphans-only reset must succeed");
assert_eq!(
second.orphans_purged, 0,
"second reset must be idempotent (orphans_purged == 0): {second:?}"
);
assert!(
second.purged_paths.is_empty(),
"second reset purged_paths must be empty: {:?}",
second.purged_paths
);
}

View File

@@ -0,0 +1,79 @@
//! Integration tests for Bug #13: schema.v1.models.active_parsers + active_chunkers.
use kebab_app::schema_with_config;
use kebab_config::Config;
use kebab_core::SourceScope;
fn minimal_config(data_dir: &std::path::Path, workspace_root: &std::path::Path) -> Config {
let mut cfg = Config::defaults();
cfg.workspace.root = workspace_root.to_string_lossy().into_owned();
cfg.workspace.exclude.clear();
cfg.storage.data_dir = data_dir.to_string_lossy().into_owned();
cfg.storage.model_dir = data_dir.join("models").to_string_lossy().into_owned();
cfg.models.embedding.provider = "none".to_string();
cfg.models.embedding.dimensions = 0;
cfg.chunking.target_tokens = 80;
cfg.chunking.overlap_tokens = 20;
cfg
}
fn minimal_scope(workspace_root: &std::path::Path) -> SourceScope {
SourceScope {
root: workspace_root.to_path_buf(),
include: vec![],
exclude: vec![],
}
}
#[test]
fn schema_models_active_arrays_empty_on_empty_corpus() {
let dir = tempfile::tempdir().unwrap();
let workspace = dir.path().join("kb");
std::fs::create_dir_all(&workspace).unwrap();
let cfg = minimal_config(dir.path(), &workspace);
let store = kebab_store_sqlite::SqliteStore::open(&cfg).unwrap();
store.run_migrations().unwrap();
drop(store);
let s = schema_with_config(&cfg).unwrap();
assert!(
s.models.active_parsers.is_empty(),
"empty corpus → no parsers"
);
assert!(
s.models.active_chunkers.is_empty(),
"empty corpus → no chunkers"
);
// backward compat: 기존 단일 field 는 markdown default 보존.
assert_eq!(s.models.parser_version, kebab_parse_md::PARSER_VERSION);
}
#[test]
fn schema_emits_active_parsers_and_chunkers_array_after_ingest() {
let dir = tempfile::tempdir().unwrap();
let workspace = dir.path().join("kb");
std::fs::create_dir_all(&workspace).unwrap();
std::fs::write(workspace.join("a.md"), "# A\nhello world\n").unwrap();
let cfg = minimal_config(dir.path(), &workspace);
let scope = minimal_scope(&workspace);
kebab_app::ingest_with_config(cfg.clone(), scope, false).unwrap();
let s = schema_with_config(&cfg).unwrap();
assert!(
!s.models.active_parsers.is_empty(),
"active_parsers populated after ingest"
);
assert!(
!s.models.active_chunkers.is_empty(),
"active_chunkers populated after ingest"
);
// active arrays must be sorted (ORDER BY in SQL).
let mut sorted = s.models.active_parsers.clone();
sorted.sort();
assert_eq!(
s.models.active_parsers, sorted,
"active_parsers must be sorted"
);
}

View File

@@ -57,7 +57,7 @@ fn schema_report_reflects_freshly_ingested_kb() {
schema.wire.schemas
);
assert!(schema.capabilities.json_mode);
assert!(!schema.capabilities.streaming_ask);
assert!(schema.capabilities.streaming_ask); // Bug #9: streaming_ask is now true
assert!(
schema.capabilities.mcp_server,
"mcp_server should be true after fb-30",

View File

@@ -27,7 +27,10 @@ fn search_with_opts_no_budget_matches_search() {
assert_eq!(resp.hits.len(), baseline.len());
assert!(!resp.truncated);
assert!(resp.next_cursor.is_none(), "k=5 against 1 doc → no next page");
assert!(
resp.next_cursor.is_none(),
"k=5 against 1 doc → no next page"
);
}
#[test]
@@ -62,7 +65,11 @@ fn budget_truncates_snippets_when_below_threshold() {
fn cursor_paginates_to_next_page() {
let env = common::TestEnv::new();
for i in 0..6 {
common::ingest_md(&env, &format!("d{i}.md"), &format!("# T{i}\n\nrust topic {i}\n"));
common::ingest_md(
&env,
&format!("d{i}.md"),
&format!("# T{i}\n\nrust topic {i}\n"),
);
}
let app = env.app();
@@ -88,7 +95,10 @@ fn cursor_paginates_to_next_page() {
page1.hits.iter().map(|h| h.chunk_id.0.clone()).collect();
let p2_ids: std::collections::HashSet<_> =
page2.hits.iter().map(|h| h.chunk_id.0.clone()).collect();
assert!(p1_ids.is_disjoint(&p2_ids), "page 2 must not repeat page 1 hits");
assert!(
p1_ids.is_disjoint(&p2_ids),
"page 2 must not repeat page 1 hits"
);
}
#[test]

View File

@@ -46,3 +46,152 @@ fn korean_lexical_query_returns_korean_document() {
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}
/// A4 Step 1c — multi-token Korean query (`해시 충돌`) must hit when
/// the lexical builder routes it through a whole-phrase MATCH candidate.
///
/// Expected: FAIL until A5 (`build_match_string` redesign) lands — the
/// current builder emits `"해시" "충돌"` AND, but FTS5 trigram tokenizer
/// has no 2-char terms so each side is 0-hit. A5 introduces a whole-
/// phrase candidate (`"해시 충돌"`) OR'd with the token AND, restoring
/// hits for the dominant Korean usage pattern.
#[test]
fn lexical_multi_token_korean_query_hits() {
let env = TestEnv::lexical_only();
// Copy the synthetic Korean fixture (introduced in A4 Step 0) into
// the test workspace. The fixture contains the exact phrase
// "해시 충돌" multiple times.
let dest = env.workspace_root.join("hash-table.md");
let src = std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("..")
.join("..")
.join("fixtures")
.join("search")
.join("korean")
.join("hash-table.md");
std::fs::copy(&src, &dest).expect("copy korean fixture");
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), common::lexical_query("해시 충돌"))
.expect("search must succeed");
assert!(
!hits.is_empty(),
"multi-token Korean query '해시 충돌' must hit the hash-table fixture; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
let any_hash_table = hits.iter().any(|h| h.doc_path.0.contains("hash-table"));
assert!(
any_hash_table,
"expected at least one hit on the hash-table fixture, got: {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}
/// A4 Step 1c — mixed Korean+English multi-token query (`Rust 충돌은`).
/// Both tokens are ≥3 chars, so the redesigned builder (A5) emits
/// `("Rust 충돌은") OR ("Rust" AND "충돌은")`. With trigram tokenizer
/// each side has substring coverage in the document, so the AND branch
/// alone is enough. Expected: FAIL pre-A5, PASS post-A5.
#[test]
fn lexical_mixed_korean_english_multi_token_query_hits() {
let env = TestEnv::lexical_only();
let doc_path = env.workspace_root.join("rust-hash.md");
std::fs::write(
&doc_path,
"# Rust 해시 테이블\n\nRust 의 std::collections::HashMap 에서 \
해시 충돌은 SipHash 로 완화한다.\n",
)
.expect("write rust-hash fixture");
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), common::lexical_query("Rust 충돌은"))
.expect("search must succeed");
assert!(
!hits.is_empty(),
"mixed Korean+English multi-token query 'Rust 충돌은' must hit the rust-hash fixture; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
let any_rust_hash = hits.iter().any(|h| h.doc_path.0.contains("rust-hash"));
assert!(
any_rust_hash,
"expected at least one hit on the rust-hash fixture, got: {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}
// ── S7 V009 morphological tokenizer end-to-end tests ─────────────────
/// S7 — V009 morphological tokenizer: 한국어 2자 query 가 end-to-end
/// lexical 경로에서 hit. lindera ko-dic 이 '한국어를' → '한국어' 형태소로
/// 분해, '서울은' → '서울' 로 분해하여 tokenized_korean_text column 에
/// 기록 → FTS5 매칭.
#[test]
fn korean_morphological_2char_query_lexical_mode() {
let env = TestEnv::lexical_only();
let doc_path = env.workspace_root.join("korean-wiki.md");
std::fs::write(
&doc_path,
"# 한국어 위키\n\n한국어를 공부합니다.\n서울은 한국의 수도입니다.\n",
)
.expect("write korean-wiki fixture");
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query("한국"))
.expect("search 한국");
assert!(
!hits.is_empty(),
"'한국' 2-char Korean query must return at least one hit (V009 morphological); got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query("서울"))
.expect("search 서울");
assert!(
!hits.is_empty(),
"'서울' 2-char Korean query must return at least one hit; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}
/// S7 — V009 morphological tokenizer: 한-영 혼합 query lexical hit.
/// 'Rust' (English whole-token) + '최적화' (Korean morpheme) 각각 hit.
#[test]
fn korean_morphological_mixed_english_korean_query() {
let env = TestEnv::lexical_only();
let doc_path = env.workspace_root.join("rust-optimization.md");
std::fs::write(
&doc_path,
"# Rust 최적화 노트\n\nRust 최적화는 zero-cost abstraction 을 강조한다.\n",
)
.expect("write rust-optimization fixture");
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query("Rust"))
.expect("search Rust");
assert!(
!hits.is_empty(),
"'Rust' English whole-token must hit; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query("최적화"))
.expect("search 최적화");
assert!(
!hits.is_empty(),
"'최적화' Korean morpheme must hit; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}

View File

@@ -35,8 +35,8 @@ fn lexical_search_returns_hits_after_ingest() {
fn lexical_search_empty_query_returns_empty() {
let env = TestEnv::lexical_only();
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query(" "))
.unwrap();
let hits =
kebab_app::search_with_config(env.config.clone(), common::lexical_query(" ")).unwrap();
assert!(hits.is_empty(), "blank query must short-circuit empty");
}
@@ -107,20 +107,25 @@ fn search_uncached_returns_same_hits_as_cached() {
#[test]
fn first_ingest_bumps_corpus_revision() {
let env = TestEnv::lexical_only();
let store_before =
kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
let store_before = kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
store_before.run_migrations().unwrap();
assert_eq!(store_before.corpus_revision(), 0, "fresh store seeds 0");
// V004 seeds 0; V009 + V010 + V011 migrations each bump by 1 to
// invalidate stale LRU caches (spec §5.2). Baseline before ingest = 3.
// (V012 derivation_cache + V013 drop-chunk-aliases are structural/additive
// — neither bumps corpus_revision.)
let baseline = store_before.corpus_revision();
assert_eq!(baseline, 3, "fresh store post-V011 baseline = 3");
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert!(report.new + report.updated > 0, "first ingest must commit ≥1 doc");
let store_after =
kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert!(
store_after.corpus_revision() >= 1,
"ingest commit must bump corpus_revision (got {})",
report.new + report.updated > 0,
"first ingest must commit ≥1 doc"
);
let store_after = kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
assert!(
store_after.corpus_revision() > baseline,
"ingest commit must bump corpus_revision past baseline {baseline} (got {})",
store_after.corpus_revision(),
);
}

View File

@@ -29,7 +29,9 @@ fn fresh_doc_is_not_stale_with_default_threshold() {
assert!(
hits.iter().all(|h| !h.stale),
"freshly-ingested doc must not be stale at default 30d threshold: {:?}",
hits.iter().map(|h| (h.doc_path.0.clone(), h.stale)).collect::<Vec<_>>()
hits.iter()
.map(|h| (h.doc_path.0.clone(), h.stale))
.collect::<Vec<_>>()
);
}
@@ -50,7 +52,9 @@ fn threshold_zero_disables_staleness() {
assert!(
hits.iter().all(|h| !h.stale),
"threshold=0 disables staleness even for year-old docs: {:?}",
hits.iter().map(|h| (h.doc_path.0.clone(), h.stale)).collect::<Vec<_>>()
hits.iter()
.map(|h| (h.doc_path.0.clone(), h.stale))
.collect::<Vec<_>>()
);
}

View File

@@ -14,12 +14,11 @@ use common::TestEnv;
fn require_avx_or_panic() {
#[cfg(target_arch = "x86_64")]
{
if !std::is_x86_feature_detected!("avx") {
panic!(
"kb-app vector integration test requires AVX-capable hardware; \
host CPU lacks AVX. Run on an AVX-capable machine."
);
}
assert!(
std::is_x86_feature_detected!("avx"),
"kb-app vector integration test requires AVX-capable hardware; \
host CPU lacks AVX. Run on an AVX-capable machine."
);
}
}
@@ -30,8 +29,7 @@ fn ingest_then_hybrid_search_returns_hits() {
require_avx_or_panic();
let env = TestEnv::with_embeddings();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert_eq!(report.errors, 0, "no per-file errors: {report:?}");
assert_eq!(report.new, 3);
@@ -57,8 +55,7 @@ fn ingest_then_vector_search_carries_embedding_model() {
require_avx_or_panic();
let env = TestEnv::with_embeddings();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert_eq!(report.errors, 0, "no per-file errors: {report:?}");
assert_eq!(report.new, 3);

View File

@@ -13,11 +13,7 @@ fn unsupported_extension_skip_carries_warning_and_is_aggregated() {
std::fs::write(workspace_root.join("legacy.docx"), b"unsupported").unwrap();
std::fs::write(workspace_root.join("Makefile"), b"unsupported").unwrap();
let report = kebab_app::ingest_with_config(
env.config.clone(),
env.scope(),
false,
).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let items = report.items.as_ref().expect("items array populated");
let docx_item = items
@@ -39,5 +35,8 @@ fn unsupported_extension_skip_carries_warning_and_is_aggregated() {
vec!["unsupported media type: <no-ext>".to_string()],
);
assert_eq!(report.skipped_by_extension.get("docx").copied(), Some(1));
assert_eq!(report.skipped_by_extension.get("<no-ext>").copied(), Some(1));
assert_eq!(
report.skipped_by_extension.get("<no-ext>").copied(),
Some(1)
);
}

View File

@@ -0,0 +1,178 @@
//! Regression test for the twin-file fetch_span media-type lookup bug.
//!
//! Twin files (identical content at different workspace paths) share one
//! `assets` row whose PRIMARY KEY is the blake3 content hash. The old
//! `fetch_span` implementation called
//! `get_asset_by_workspace_path(&doc.workspace_path)` to check whether the
//! media type was PDF/audio (and therefore reject span fetch). For a twin
//! file that lookup could silently return the *other* twin's asset row if
//! `assets.workspace_path` had been overwritten on the most recent ingest of
//! the sibling — making the media-type branch decision incorrect.
//!
//! Fix: `fetch_span` now uses the 2-step lookup
//! `get_document_by_workspace_path` → `doc.source_asset_id` → `get_asset`
//! so the result is always anchored to the requesting document, not
//! whichever twin last updated `assets.workspace_path`.
//!
//! This test builds a twin-file scenario (two .md files at different paths
//! with identical content), ingests both, then calls `fetch_span` on each
//! twin's `doc_id` and asserts it succeeds. Before the fix, if the asset
//! row's workspace_path happened to point at the wrong twin the span could
//! return an incorrect `span_not_supported` for a non-PDF/audio file, or
//! conversely allow span on a PDF twin by accident. After the fix, the
//! lookup is always doc-specific.
mod common;
use common::TestEnv;
use kebab_app::ingest_with_config;
use kebab_core::{DocumentStore, FetchKind, FetchOpts, FetchQuery, IngestItemKind};
#[test]
fn twin_files_fetch_span_uses_correct_asset() {
let env = TestEnv::lexical_only();
// Write two markdown files with identical content at different paths.
let dir_a = env.workspace_root.join("src_a");
let dir_b = env.workspace_root.join("src_b");
std::fs::create_dir_all(&dir_a).unwrap();
std::fs::create_dir_all(&dir_b).unwrap();
// The content must produce at least 1 line so span fetch is non-trivial.
let content = "# Twin\n\nLine one.\n\nLine two.\n\nLine three.\n";
std::fs::write(dir_a.join("note.md"), content).unwrap();
std::fs::write(dir_b.join("note.md"), content).unwrap();
// Ingest all files (fixture workspace + our two new twins).
let report =
ingest_with_config(env.config.clone(), env.scope(), false).expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors; report={report:?}");
// Both twin paths must appear as New in the report.
let items = report.items.as_ref().expect("items must be present");
let twin_items: Vec<_> = items
.iter()
.filter(|i| {
i.doc_path.0.ends_with("src_a/note.md") || i.doc_path.0.ends_with("src_b/note.md")
})
.collect();
assert_eq!(
twin_items.len(),
2,
"exactly 2 twin items expected; items={items:?}"
);
for item in &twin_items {
assert_eq!(
item.kind,
IngestItemKind::New,
"each twin must be New; item={item:?}"
);
}
// Resolve doc_ids for both workspace paths.
// The ingest layer normalises workspace_path to the path relative to
// workspace_root (e.g. "src_a/note.md"), so we look up by that form.
let store = kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
store.run_migrations().unwrap();
// Find the twin items by matching on suffix so the test is robust to
// however the workspace root is represented.
let items = report.items.as_ref().expect("items must be present");
let path_a_str = items
.iter()
.find(|i| i.doc_path.0.ends_with("src_a/note.md"))
.map(|i| i.doc_path.0.clone())
.expect("src_a/note.md must appear in ingest report");
let path_b_str = items
.iter()
.find(|i| i.doc_path.0.ends_with("src_b/note.md"))
.map(|i| i.doc_path.0.clone())
.expect("src_b/note.md must appear in ingest report");
let path_a = kebab_core::WorkspacePath(path_a_str);
let path_b = kebab_core::WorkspacePath(path_b_str);
let doc_a = store
.get_document_by_workspace_path(&path_a)
.expect("get_document_by_workspace_path path_a")
.expect("doc_a must exist after ingest");
let doc_b = store
.get_document_by_workspace_path(&path_b)
.expect("get_document_by_workspace_path path_b")
.expect("doc_b must exist after ingest");
// Both twins share one asset_id (same content hash).
assert_eq!(
doc_a.source_asset_id, doc_b.source_asset_id,
"twin files must share one asset_id"
);
// Open App and issue span fetch on each twin's doc_id.
let app = env.app();
let result_a = app
.fetch(
FetchQuery::Span {
doc_id: doc_a.doc_id.clone(),
line_start: 1,
line_end: 2,
},
FetchOpts::default(),
)
.expect("fetch_span on twin A must succeed for a markdown file");
assert_eq!(result_a.kind, FetchKind::Span);
assert!(
result_a.text.as_deref().is_some_and(|t| !t.is_empty()),
"span text for twin A must not be empty"
);
let result_b = app
.fetch(
FetchQuery::Span {
doc_id: doc_b.doc_id.clone(),
line_start: 1,
line_end: 2,
},
FetchOpts::default(),
)
.expect("fetch_span on twin B must succeed for a markdown file");
assert_eq!(result_b.kind, FetchKind::Span);
assert!(
result_b.text.as_deref().is_some_and(|t| !t.is_empty()),
"span text for twin B must not be empty"
);
// Ingest again to force the asset.workspace_path flip-flop, then
// re-check. Pre-fix this was the scenario that triggered the bug:
// after the second ingest the asset row's workspace_path could point
// at either twin, making one twin's span fetch behave incorrectly.
let report2 = ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest must succeed");
assert_eq!(
report2.errors, 0,
"no ingest errors on second run; report={report2:?}"
);
// Re-open app after second ingest and verify span still works on both.
let app2 = env.app();
app2.fetch(
FetchQuery::Span {
doc_id: doc_a.doc_id.clone(),
line_start: 1,
line_end: 3,
},
FetchOpts::default(),
)
.expect("fetch_span on twin A after flip-flop must still succeed");
app2.fetch(
FetchQuery::Span {
doc_id: doc_b.doc_id.clone(),
line_start: 1,
line_end: 3,
},
FetchOpts::default(),
)
.expect("fetch_span on twin B after flip-flop must still succeed");
}

View File

@@ -43,9 +43,7 @@ fn twin_files_second_ingest_is_unchanged() {
let items = first.items.as_ref().expect("items must be present");
let twin_items: Vec<_> = items
.iter()
.filter(|i| {
i.doc_path.0.ends_with("__init__.py")
})
.filter(|i| i.doc_path.0.ends_with("__init__.py"))
.collect();
assert_eq!(
twin_items.len(),
@@ -63,8 +61,14 @@ fn twin_files_second_ingest_is_unchanged() {
// Second ingest — same files, same content → both must be Unchanged.
let second = ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest must succeed");
assert_eq!(second.errors, 0, "second ingest: no errors; report={second:?}");
assert_eq!(second.new, 0, "second ingest: no new docs; report={second:?}");
assert_eq!(
second.errors, 0,
"second ingest: no errors; report={second:?}"
);
assert_eq!(
second.new, 0,
"second ingest: no new docs; report={second:?}"
);
assert_eq!(
second.updated, 0,
"second ingest: no updated docs (twin-file bug would set this to 2); report={second:?}"

View File

@@ -13,14 +13,21 @@ serde_json_canonicalizer = "0.3"
blake3 = { workspace = true }
anyhow = { workspace = true }
tracing = { workspace = true }
serde_yaml = { workspace = true }
lindera = { workspace = true, features = ["embed-ko-dic"] }
lindera-ko-dic = { workspace = true, features = ["embed-ko-dic"] }
[dev-dependencies]
# kb-parse-md / kb-normalize are dev-only — used by the snapshot integration
# test to build a CanonicalDocument from a fixture Markdown file. Forbidden as
# regular deps per design §8 (chunker consumes CanonicalDocument from kb-core
# only); `cargo tree -p kb-chunk --depth 1` (default scope, excludes dev-deps)
# kb-parse-md / kb-parse-code are dev-only — used by the snapshot integration
# tests to build a CanonicalDocument from fixture files. kb-parse-md absorbed
# kb-normalize in v0.19.0 (HOTFIXES.md 2026-05-26). Forbidden as regular deps
# per design §8 (chunker consumes CanonicalDocument from kb-core only);
# `cargo tree -p kb-chunk --depth 1` (default scope, excludes dev-deps)
# confirms this.
kebab-parse-md = { path = "../kebab-parse-md" }
kebab-normalize = { path = "../kebab-normalize" }
serde_json = { workspace = true }
time = { workspace = true }
kebab-parse-md = { path = "../kebab-parse-md" }
kebab-parse-code = { path = "../kebab-parse-code" }
serde_json = { workspace = true }
time = { workspace = true }
[lints]
workspace = true

View File

@@ -0,0 +1,375 @@
//! `code-c-ast-v1` — maps a tree-sitter-derived C AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-c-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeCAstV1Chunker;
impl Chunker for CodeCAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!("CodeCAstV1Chunker only handles code docs (got non-Code block)"),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeCAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-c-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.c".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-c-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("c".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("c".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("c".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_c_ast_v1() {
assert_eq!(
CodeCAstV1Chunker.chunker_version(),
ChunkerVersion("code-c-ast-v1".into())
);
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "int parse() {\n\t// x\n}"),
("print", 5, 7, "void print() {\n\t//\n\treturn;\n}"),
]);
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-c-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500)
.map(|i| format!("\tx{i} = {i};\n"))
.collect::<String>();
let code = format!("int big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "int parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(),
inlines: vec![],
})];
let err = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeCAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "int parse() {}\n")]);
let base: Vec<String> = CodeCAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeCAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(
CodeCAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -0,0 +1,377 @@
//! `code-cpp-ast-v1` — maps a tree-sitter-derived C++ AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-cpp-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeCppAstV1Chunker;
impl Chunker for CodeCppAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => {
anyhow::bail!("CodeCppAstV1Chunker only handles code docs (got non-Code block)")
}
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeCppAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-cpp-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.cpp".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-cpp-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("cpp".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("cpp".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("cpp".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_cpp_ast_v1() {
assert_eq!(
CodeCppAstV1Chunker.chunker_version(),
ChunkerVersion("code-cpp-ast-v1".into())
);
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "int parse() {\n\t// x\n}"),
("print", 5, 7, "void print() {\n\t//\n\treturn;\n}"),
]);
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-cpp-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500)
.map(|i| format!("\tx{i} = {i};\n"))
.collect::<String>();
let code = format!("int big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "int parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(),
inlines: vec![],
})];
let err = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeCppAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "int parse() {}\n")]);
let base: Vec<String> = CodeCppAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeCppAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(
CodeCppAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -0,0 +1,383 @@
//! `code-go-ast-v1` — maps a tree-sitter-derived Go AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-go-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeGoAstV1Chunker;
impl Chunker for CodeGoAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => {
anyhow::bail!("CodeGoAstV1Chunker only handles code docs (got non-Code block)")
}
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeGoAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-go-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.go".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-go-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("go".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("go".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("go".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_go_ast_v1() {
assert_eq!(
CodeGoAstV1Chunker.chunker_version(),
ChunkerVersion("code-go-ast-v1".into())
);
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "func parse() {\n\t// x\n}"),
(
"Foo.double",
5,
7,
"func double() int {\n\t//\n\treturn 0\n}",
),
]);
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-go-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500)
.map(|i| format!("\tx{i} := {i}"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("func big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "func parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(),
inlines: vec![],
})];
let err = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeGoAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "func parse() {}\n")]);
let base: Vec<String> = CodeGoAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeGoAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(
CodeGoAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -0,0 +1,378 @@
//! `code-java-ast-v1` — maps a tree-sitter-derived Java AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-java-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeJavaAstV1Chunker;
impl Chunker for CodeJavaAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeJavaAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeJavaAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-java-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/Main.java".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-java-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("java".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("java".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("java".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_java_ast_v1() {
assert_eq!(
CodeJavaAstV1Chunker.chunker_version(),
ChunkerVersion("code-java-ast-v1".into())
);
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "void parse() {\n\t// x\n}"),
("Foo.double", 5, 7, "int double() {\n\t//\n\treturn 0;\n}"),
]);
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-java-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500)
.map(|i| format!("\tint x{i} = {i};"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("void big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "void parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(),
inlines: vec![],
})];
let err = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeJavaAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "void parse() {}\n")]);
let base: Vec<String> = CodeJavaAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeJavaAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(
CodeJavaAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -39,17 +39,13 @@ impl Chunker for CodeJsAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeJsAstV1Chunker only handles code docs (got non-Code block)"
),
_ => {
anyhow::bail!("CodeJsAstV1Chunker only handles code docs (got non-Code block)")
}
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
@@ -68,9 +64,12 @@ impl Chunker for CodeJsAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodeJsAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodeJsAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodeJsAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,46 +214,72 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("javascript".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("javascript".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("javascript".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_js_ast_v1() {
assert_eq!(CodeJsAstV1Chunker.chunker_version(),
ChunkerVersion("code-js-ast-v1".into()));
assert_eq!(
CodeJsAstV1Chunker.chunker_version(),
ChunkerVersion("code-js-ast-v1".into())
);
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "function parse() {\n // x\n}"),
("Foo.double", 5, 7, "function double() {\n //\n return 0;\n}"),
(
"Foo.double",
5,
7,
"function double() {\n //\n return 0;\n}",
),
]);
let chunks = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
@@ -256,7 +290,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-js-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +305,33 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" const x{i} = {i};")).collect::<Vec<_>>().join("\n");
let body = (0..500)
.map(|i| format!(" const x{i} = {i};"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("function big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +345,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeJsAstV1Chunker"));
@@ -304,11 +355,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "function parse() {}\n")]);
let base: Vec<String> = CodeJsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodeJsAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeJsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodeJsAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +375,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeJsAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodeJsAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -0,0 +1,383 @@
//! `code-kotlin-ast-v1` — maps a tree-sitter-derived Kotlin AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-kotlin-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeKotlinAstV1Chunker;
impl Chunker for CodeKotlinAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeKotlinAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeKotlinAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-kotlin-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/Main.kt".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-kotlin-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("kotlin".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("kotlin".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("kotlin".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_kotlin_ast_v1() {
assert_eq!(
CodeKotlinAstV1Chunker.chunker_version(),
ChunkerVersion("code-kotlin-ast-v1".into())
);
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "fun parse() {\n\t// x\n}"),
(
"Foo.double",
5,
7,
"fun double(): Int {\n\t//\n\treturn 0\n}",
),
]);
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-kotlin-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500)
.map(|i| format!("\tval x{i} = {i}"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("fun big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "fun parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(),
inlines: vec![],
})];
let err = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeKotlinAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "fun parse() {}\n")]);
let base: Vec<String> = CodeKotlinAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeKotlinAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(
CodeKotlinAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -39,11 +39,7 @@ impl Chunker for CodePythonAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
@@ -68,9 +64,12 @@ impl Chunker for CodePythonAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodePythonAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodePythonAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodePythonAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,39 +214,60 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("python".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("python".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("python".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_python_ast_v1() {
assert_eq!(CodePythonAstV1Chunker.chunker_version(),
ChunkerVersion("code-python-ast-v1".into()));
assert_eq!(
CodePythonAstV1Chunker.chunker_version(),
ChunkerVersion("code-python-ast-v1".into())
);
}
#[test]
@@ -256,7 +285,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-python-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +300,33 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" x{i} = {i}")).collect::<Vec<_>>().join("\n");
let body = (0..500)
.map(|i| format!(" x{i} = {i}"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("def big():\n{body}\n");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodePythonAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +340,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodePythonAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodePythonAstV1Chunker"));
@@ -304,11 +350,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "def parse(): pass\n")]);
let base: Vec<String> = CodePythonAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodePythonAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodePythonAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodePythonAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +370,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodePythonAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodePythonAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -39,11 +39,7 @@ impl Chunker for CodeRustAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
@@ -68,9 +64,12 @@ impl Chunker for CodeRustAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodeRustAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodeRustAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodeRustAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,39 +214,60 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("rust".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("rust".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("rust".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_rust_ast_v1() {
assert_eq!(CodeRustAstV1Chunker.chunker_version(),
ChunkerVersion("code-rust-ast-v1".into()));
assert_eq!(
CodeRustAstV1Chunker.chunker_version(),
ChunkerVersion("code-rust-ast-v1".into())
);
}
#[test]
@@ -256,7 +285,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-rust-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +300,33 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" let x{i} = {i};")).collect::<Vec<_>>().join("\n");
let body = (0..500)
.map(|i| format!(" let x{i} = {i};"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("pub fn big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +340,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeRustAstV1Chunker"));
@@ -304,11 +350,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "fn parse(){}\n}")]);
let base: Vec<String> = CodeRustAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodeRustAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeRustAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodeRustAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +370,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeRustAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodeRustAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -0,0 +1,170 @@
//! p10-3: Tier 3 paragraph + line-window fallback chunker.
//!
//! Splits code/text files on blank-line paragraph boundaries. Paragraphs
//! with more than 80 lines are further split into 80-line windows with a
//! 20-line overlap (stride 60) — the same oversize pattern used by Tier 1/2
//! chunkers but without AST structure, hence no symbol.
//!
//! Per spec §9.3: all emitted chunks carry `symbol: None`.
use crate::tier2_shared::{build_chunk_no_symbol, policy_hash};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion};
pub const VERSION_LABEL: &str = "code-text-paragraph-v1";
/// Lines-per-window for the oversize fallback (Tier 3).
const FALLBACK_LINES_PER_CHUNK: usize = 80;
/// Overlap between consecutive windows.
const FALLBACK_LINES_OVERLAP: usize = 20;
// stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP = 60.
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeTextParagraphV1Chunker;
impl Chunker for CodeTextParagraphV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full source text.
let (text, lang_str) = match doc.blocks.first() {
Some(Block::Code(cb)) => (cb.code.as_str(), cb.lang.as_deref().unwrap_or("")),
_ => return Ok(vec![]),
};
let mut chunks = Vec::new();
for para in split_paragraphs(text) {
push_paragraph(&mut chunks, doc, policy, &para, lang_str)?;
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"code-text-paragraph-v1 chunked",
);
Ok(chunks)
}
}
/// A contiguous run of non-blank lines from the source text.
struct Paragraph {
/// Lines joined with `\n` (no trailing newline).
text: String,
/// 1-indexed line number of the first line in the source file.
line_start: u32,
/// 1-indexed line number of the last line in the source file.
line_end: u32,
}
/// Split `text` into `Paragraph`s separated by blank (all-whitespace) lines.
///
/// Blank lines are treated as boundaries and are NOT included in any
/// paragraph's line range. Paragraphs that would consist entirely of blank
/// lines are skipped.
fn split_paragraphs(text: &str) -> Vec<Paragraph> {
let mut paragraphs = Vec::new();
let mut current: Vec<&str> = Vec::new();
let mut current_start: Option<u32> = None;
for (idx, line) in text.lines().enumerate() {
let line_no = (idx + 1) as u32;
let is_blank = line.trim().is_empty();
if is_blank {
if let Some(start) = current_start.take() {
let end = start + current.len() as u32 - 1;
paragraphs.push(Paragraph {
text: current.join("\n"),
line_start: start,
line_end: end,
});
current.clear();
}
} else {
if current_start.is_none() {
current_start = Some(line_no);
}
current.push(line);
}
}
// Flush any trailing paragraph not terminated by a blank line.
if let Some(start) = current_start {
let end = start + current.len() as u32 - 1;
paragraphs.push(Paragraph {
text: current.join("\n"),
line_start: start,
line_end: end,
});
}
paragraphs
}
/// Emit one or more chunks for a single paragraph.
///
/// Paragraphs with ≤ `FALLBACK_LINES_PER_CHUNK` lines become a single chunk.
/// Larger paragraphs are split into overlapping windows of
/// `FALLBACK_LINES_PER_CHUNK` lines with stride `FALLBACK_LINES_PER_CHUNK -
/// FALLBACK_LINES_OVERLAP`. The last window may be shorter. Window starts
/// are passed as `split_key` so `id_for_chunk` can produce distinct ids
/// across windows.
fn push_paragraph(
out: &mut Vec<Chunk>,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
para: &Paragraph,
lang: &str,
) -> Result<()> {
let n_lines = (para.line_end - para.line_start + 1) as usize;
if n_lines <= FALLBACK_LINES_PER_CHUNK {
// Use line_start as split_key so each paragraph gets a distinct
// chunk_id even when block_ids is empty (no symbol, no AST structure).
// Without this, all short paragraphs from the same doc share the same
// base_policy_hash and therefore the same id_for_chunk result.
out.push(build_chunk_no_symbol(
doc,
policy,
&para.text,
para.line_start,
para.line_end,
lang,
VERSION_LABEL,
Some(para.line_start),
));
return Ok(());
}
// Oversize: line-window split with overlap.
let stride = FALLBACK_LINES_PER_CHUNK - FALLBACK_LINES_OVERLAP;
let lines: Vec<&str> = para.text.lines().collect();
let mut i = 0usize;
loop {
let end = (i + FALLBACK_LINES_PER_CHUNK).min(lines.len());
let window_text = lines[i..end].join("\n");
let window_start = para.line_start + i as u32;
let window_end = para.line_start + (end as u32) - 1;
// Use window_start as split_key so chunk_ids are unique across windows.
out.push(build_chunk_no_symbol(
doc,
policy,
&window_text,
window_start,
window_end,
lang,
VERSION_LABEL,
Some(window_start),
));
if end == lines.len() {
break;
}
i += stride;
}
Ok(())
}

View File

@@ -39,17 +39,13 @@ impl Chunker for CodeTsAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeTsAstV1Chunker only handles code docs (got non-Code block)"
),
_ => {
anyhow::bail!("CodeTsAstV1Chunker only handles code docs (got non-Code block)")
}
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
@@ -68,9 +64,12 @@ impl Chunker for CodeTsAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodeTsAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodeTsAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodeTsAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,46 +214,72 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("typescript".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("typescript".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("typescript".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_ts_ast_v1() {
assert_eq!(CodeTsAstV1Chunker.chunker_version(),
ChunkerVersion("code-ts-ast-v1".into()));
assert_eq!(
CodeTsAstV1Chunker.chunker_version(),
ChunkerVersion("code-ts-ast-v1".into())
);
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "function parse(): void {\n // x\n}"),
("Foo.double", 5, 7, "function double(): number {\n //\n return 0;\n}"),
(
"Foo.double",
5,
7,
"function double(): number {\n //\n return 0;\n}",
),
]);
let chunks = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
@@ -256,7 +290,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-ts-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +305,33 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" const x{i} = {i};")).collect::<Vec<_>>().join("\n");
let body = (0..500)
.map(|i| format!(" const x{i} = {i};"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("function big(): void {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +345,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeTsAstV1Chunker"));
@@ -304,11 +355,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "function parse(): void {}\n")]);
let base: Vec<String> = CodeTsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodeTsAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeTsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodeTsAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +375,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeTsAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodeTsAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -0,0 +1,58 @@
//! p10-2: dockerfile whole-file chunker (Tier 2).
//!
//! Reads entire Dockerfile content and emits a single Chunk with symbol
//! "<dockerfile>", code_lang "dockerfile", line range 1..EOF.
//! Oversize >200 lines splits into line-windows sharing the symbol via
//! tier2_shared::push_chunks_with_oversize.
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion};
pub const VERSION_LABEL: &str = "dockerfile-file-v1";
#[derive(Clone, Copy, Debug, Default)]
pub struct DockerfileFileV1Chunker;
impl Chunker for DockerfileFileV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full Dockerfile text.
let text = match doc.blocks.first() {
Some(Block::Code(cb)) => cb.code.as_str(),
_ => return Ok(vec![]),
};
let total_lines = text.lines().count().max(1) as u32;
let mut chunks = Vec::new();
push_chunks_with_oversize(
&mut chunks,
doc,
policy,
text,
1,
total_lines,
"<dockerfile>",
"dockerfile",
VERSION_LABEL,
None,
)?;
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"dockerfile-file-v1 chunked",
);
Ok(chunks)
}
}

View File

@@ -0,0 +1,162 @@
//! p10-2: k8s manifest resource-aware chunker.
//!
//! Splits a multi-document YAML file on `^---\s*$` boundaries, recognises
//! documents that have both `apiVersion` and `kind` string fields as k8s
//! resources, and emits one `Chunk` per resource (with oversize >200-line
//! fallback). Non-k8s documents are skipped; invalid YAML yields 0 chunks
//! for the entire file.
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion};
pub const VERSION_LABEL: &str = "k8s-manifest-resource-v1";
#[derive(Clone, Copy, Debug, Default)]
pub struct K8sManifestResourceV1Chunker;
impl Chunker for K8sManifestResourceV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full YAML text.
let text = match doc.blocks.first() {
Some(Block::Code(cb)) => cb.code.as_str(),
_ => return Ok(vec![]),
};
let slices = split_yaml_documents(text);
let mut chunks: Vec<Chunk> = Vec::new();
for slice in slices {
// Invalid YAML in any document → return 0 chunks for the file.
let value: serde_yaml::Value = match serde_yaml::from_str(slice.text) {
Ok(v) => v,
Err(_) => return Ok(vec![]),
};
let Some(mapping) = value.as_mapping() else {
continue;
};
let api = mapping
.get("apiVersion")
.and_then(|v| v.as_str())
.unwrap_or("");
let kind = mapping.get("kind").and_then(|v| v.as_str()).unwrap_or("");
// Skip non-k8s documents.
if api.is_empty() || kind.is_empty() {
continue;
}
let metadata = mapping.get("metadata").and_then(|v| v.as_mapping());
let name = metadata
.and_then(|m| m.get("name"))
.and_then(|v| v.as_str())
.unwrap_or("<unnamed>");
let namespace = metadata
.and_then(|m| m.get("namespace"))
.and_then(|v| v.as_str());
let symbol = match namespace {
Some(ns) if !ns.is_empty() => format!("{kind}/{ns}/{name}"),
_ => format!("{kind}/{name}"),
};
push_chunks_with_oversize(
&mut chunks,
doc,
policy,
slice.text,
slice.line_start,
slice.line_end,
&symbol,
"yaml",
VERSION_LABEL,
Some(slice.line_start),
)?;
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"k8s-manifest-resource-v1 chunked",
);
Ok(chunks)
}
}
struct YamlSlice<'a> {
text: &'a str,
line_start: u32,
line_end: u32,
}
/// Split raw YAML text into per-document slices on `---` separator lines.
/// Line numbers are 1-indexed.
fn split_yaml_documents(text: &str) -> Vec<YamlSlice<'_>> {
let lines: Vec<&str> = text.lines().collect();
// Collect indices of separator lines (0-based), then append a sentinel at
// the end so the last slice is always terminated.
let mut separators: Vec<usize> = lines
.iter()
.enumerate()
.filter_map(|(i, l)| {
let trimmed = l.trim_end();
if trimmed == "---" || trimmed.starts_with("--- ") || trimmed.starts_with("---\t") {
Some(i)
} else {
None
}
})
.collect();
separators.push(lines.len());
let mut slices: Vec<YamlSlice<'_>> = Vec::new();
let mut doc_start_line: usize = 0; // 0-based index of current doc start
for sep_line in separators {
if sep_line > doc_start_line {
let start_byte = byte_offset_of_line(text, doc_start_line);
let end_byte = byte_offset_of_line(text, sep_line);
let slice_text = &text[start_byte..end_byte];
if !slice_text.trim().is_empty() {
slices.push(YamlSlice {
text: slice_text,
line_start: (doc_start_line + 1) as u32,
line_end: sep_line as u32,
});
}
}
doc_start_line = sep_line + 1;
}
slices
}
/// Return the byte offset of the start of `line_idx` (0-based line index).
fn byte_offset_of_line(text: &str, line_idx: usize) -> usize {
if line_idx == 0 {
return 0;
}
let mut count = 0usize;
for (i, c) in text.char_indices() {
if c == '\n' {
count += 1;
if count == line_idx {
return i + 1;
}
}
}
text.len()
}

View File

@@ -15,16 +15,107 @@
//! embedder, the retriever, the LLM, the RAG layer, or the UI layers.
//! It consumes `CanonicalDocument` purely through `kb-core` types.
mod code_c_ast_v1;
mod code_cpp_ast_v1;
mod code_go_ast_v1;
mod code_java_ast_v1;
mod code_js_ast_v1;
mod code_kotlin_ast_v1;
mod code_python_ast_v1;
mod code_rust_ast_v1;
pub mod code_text_paragraph_v1;
mod code_ts_ast_v1;
pub mod dockerfile_file_v1;
pub mod k8s_manifest_resource_v1;
pub mod manifest_file_v1;
mod md_heading_v1;
mod pdf_page_v1;
mod tier2_shared;
pub use code_c_ast_v1::CodeCAstV1Chunker;
pub use code_cpp_ast_v1::CodeCppAstV1Chunker;
pub use code_go_ast_v1::CodeGoAstV1Chunker;
pub use code_java_ast_v1::CodeJavaAstV1Chunker;
pub use code_js_ast_v1::CodeJsAstV1Chunker;
pub use code_kotlin_ast_v1::CodeKotlinAstV1Chunker;
pub use code_python_ast_v1::CodePythonAstV1Chunker;
pub use code_rust_ast_v1::CodeRustAstV1Chunker;
pub use code_text_paragraph_v1::CodeTextParagraphV1Chunker;
pub use code_ts_ast_v1::CodeTsAstV1Chunker;
pub use dockerfile_file_v1::DockerfileFileV1Chunker;
pub use k8s_manifest_resource_v1::K8sManifestResourceV1Chunker;
pub use manifest_file_v1::ManifestFileV1Chunker;
pub use md_heading_v1::MdHeadingV1Chunker;
pub use pdf_page_v1::PdfPageV1Chunker;
// ── Korean morphological tokenizer ───────────────────────────────────────────
use lindera::dictionary::{DictionaryKind, load_embedded_dictionary};
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
static KOREAN_TOKENIZER: std::sync::OnceLock<Option<Tokenizer>> = std::sync::OnceLock::new();
/// 한 codepoint 가 한글 음절 또는 자모인지 판정 — N-gram supplement 의 emit 대상 필터링.
fn is_hangul(c: char) -> bool {
matches!(
c,
'\u{AC00}'..='\u{D7A3}' // 한글 음절 (precomposed)
| '\u{1100}'..='\u{11FF}' // 한글 자모
| '\u{3130}'..='\u{318F}' // 한글 호환 자모
)
}
/// 한국어 chunk text 를 lindera ko-dic 으로 형태소 분해해 공백 join 한 결과를 반환.
/// chunker 들이 `Chunk.tokenized_korean_text` pre-fill 에 사용.
/// 분석 실패 시 None — 호출자는 NULL fallback 처리.
/// Tokenizer 는 OnceLock 으로 1회 초기화; dict load 실패 시 영구 None.
///
/// v0.21.0 — N-gram supplement (Option β, post-v0.20.1 enhancement).
/// ko-dic 가 compound noun (`한국정부`, `서울특별시` 등) 을 단일 token 으로
/// 저장하는 정책 의 한계 해소 — morpheme 길이 ≥ 3 인 한글 token 에 대해
/// 2-char sliding window n-gram 도 추가 emit. `'한국정부'` morpheme →
/// `[한국정부, 한국, 국정, 정부]` 의 4 token 으로 expand. 사용자 의 2-char
/// query (`'한국'`) 가 compound chunk 에서도 hit. 영어/숫자 token 은 영향
/// 없음 (is_hangul filter). DB size + ingest latency 의 trade-off 는
/// HOTFIXES 2026-05-28 의 "N-gram supplement (Option β)" 보강 entry.
pub fn tokenize_korean_morphological(text: &str) -> Option<String> {
if text.trim().is_empty() {
return None;
}
let tokenizer = KOREAN_TOKENIZER.get_or_init(|| {
let dict = match load_embedded_dictionary(DictionaryKind::KoDic) {
Ok(d) => d,
Err(e) => {
tracing::warn!(target: "kebab-chunk", "tokenize_korean_morphological: dict load failed: {e}");
return None;
}
};
let segmenter = Segmenter::new(Mode::Normal, dict, None);
Some(Tokenizer::new(segmenter))
});
let tokenizer = tokenizer.as_ref()?;
let tokens = tokenizer.tokenize(text).ok()?;
let mut out_tokens: Vec<String> = Vec::with_capacity(tokens.len() * 2);
for tok in tokens.iter() {
let surface = tok.surface.as_ref();
out_tokens.push(surface.to_string());
// N-gram supplement: 한글 morpheme 의 2-char sliding window.
let chars: Vec<char> = surface.chars().collect();
if chars.len() >= 3 && chars.iter().all(|c| is_hangul(*c)) {
for window in chars.windows(2) {
out_tokens.push(window.iter().collect());
}
}
}
let joined = out_tokens.join(" ");
if joined.is_empty() {
None
} else {
Some(joined)
}
}

View File

@@ -0,0 +1,59 @@
//! p10-2: manifest whole-file chunker (Tier 2).
//!
//! Reads entire manifest file (Cargo.toml / package.json / pom.xml / go.mod /
//! build.gradle / pyproject.toml / tsconfig.json) and emits a single Chunk
//! with symbol "<manifest>", code_lang read from Block::Code.lang, line range
//! 1..EOF. Oversize >200 lines splits into line-windows sharing the symbol via
//! tier2_shared::push_chunks_with_oversize.
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion};
pub const VERSION_LABEL: &str = "manifest-file-v1";
#[derive(Clone, Copy, Debug, Default)]
pub struct ManifestFileV1Chunker;
impl Chunker for ManifestFileV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
policy_hash(policy)
}
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> Result<Vec<Chunk>> {
// Expect a single Block::Code carrying the full manifest text.
let (text, lang) = match doc.blocks.first() {
Some(Block::Code(cb)) => (cb.code.as_str(), cb.lang.as_deref().unwrap_or("")),
_ => return Ok(vec![]),
};
let total_lines = text.lines().count().max(1) as u32;
let mut chunks = Vec::new();
push_chunks_with_oversize(
&mut chunks,
doc,
policy,
text,
1,
total_lines,
"<manifest>",
lang,
VERSION_LABEL,
None,
)?;
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = chunks.len(),
"manifest-file-v1 chunked",
);
Ok(chunks)
}
}

View File

@@ -1,8 +1,8 @@
//! `md-heading-v1` — heading-aware Markdown chunker.
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker,
ChunkerVersion, DocumentId, SourceSpan, id_for_chunk,
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
/// Version label emitted by [`MdHeadingV1Chunker`]. Bumping this label
@@ -99,11 +99,7 @@ impl Chunker for MdHeadingV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
let policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
@@ -152,22 +148,12 @@ impl Chunker for MdHeadingV1Chunker {
// `collect_overlap_seed` keeps seed ≤ target/2, so
// a flush here never produces a chunk smaller than
// the seed budget.
let would_exceed = acc.text_tokens + next_tokens
> policy.target_tokens
let would_exceed = acc.text_tokens + next_tokens > policy.target_tokens
&& acc.has_non_heading_content();
if would_exceed {
let overlap_seed = collect_overlap_seed(
&acc,
policy.overlap_tokens,
policy.target_tokens,
);
flush(
&mut acc,
doc,
&chunker_version,
&policy_hash,
&mut out,
);
let overlap_seed =
collect_overlap_seed(&acc, policy.overlap_tokens, policy.target_tokens);
flush(&mut acc, doc, &chunker_version, &policy_hash, &mut out);
// Seed next accumulator with the prior chunk's
// tail blocks (paragraph-level overlap). The
// heading is *not* re-included here — it lives
@@ -292,10 +278,11 @@ fn build_chunk(
) -> Chunk {
debug_assert!(!blocks.is_empty(), "build_chunk requires ≥1 block");
let block_ids: Vec<BlockId> =
blocks.iter().map(|b| common(b).block_id.clone()).collect();
let source_spans: Vec<SourceSpan> =
blocks.iter().map(|b| common(b).source_span.clone()).collect();
let block_ids: Vec<BlockId> = blocks.iter().map(|b| common(b).block_id.clone()).collect();
let source_spans: Vec<SourceSpan> = blocks
.iter()
.map(|b| common(b).source_span.clone())
.collect();
// heading_path: pick the first non-Heading block's heading_path
// (which already includes every parent heading per kb-normalize).
@@ -339,17 +326,13 @@ fn build_chunk(
text.len().div_ceil(BYTES_PER_TOKEN)
};
let chunk_id = id_for_chunk(
&doc.doc_id,
chunker_version,
&block_ids,
policy_hash,
);
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, &block_ids, policy_hash);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids,
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path,
source_spans,
@@ -387,9 +370,7 @@ fn render_block_text(b: &Block) -> String {
// alt keeps lexical search hits on filenames working even when
// P6-1's filename auto-fill is bypassed.
Block::ImageRef(i) => {
let alt = if !i.alt.is_empty() {
i.alt.clone()
} else {
let alt = if i.alt.is_empty() {
// P6-1 falls back to filename so this branch is
// defensive — keep it lest a future test fixture or
// synthetic block path skip the auto-fill.
@@ -399,17 +380,11 @@ fn render_block_text(b: &Block) -> String {
.filter(|s| !s.is_empty())
.unwrap_or("[image]")
.to_string()
} else {
i.alt.clone()
};
let ocr = i
.ocr
.as_ref()
.map(|o| o.joined.as_str())
.unwrap_or("");
let cap = i
.caption
.as_ref()
.map(|c| c.text.as_str())
.unwrap_or("");
let ocr = i.ocr.as_ref().map_or("", |o| o.joined.as_str());
let cap = i.caption.as_ref().map_or("", |c| c.text.as_str());
[alt.as_str(), ocr, cap]
.iter()
.filter(|s| !s.is_empty())
@@ -449,9 +424,8 @@ fn common(b: &Block) -> &kebab_core::CommonBlock {
mod tests {
use super::*;
use kebab_core::{
AssetId, CodeBlock, CommonBlock, HeadingBlock, ImageRefBlock, Lang,
Metadata, Provenance, SourceType, TableBlock, TextBlock, TrustLevel,
WorkspacePath, id_for_block,
AssetId, CodeBlock, CommonBlock, HeadingBlock, ImageRefBlock, Lang, Metadata, Provenance,
SourceType, TableBlock, TextBlock, TrustLevel, WorkspacePath, id_for_block,
};
use time::OffsetDateTime;
@@ -494,12 +468,7 @@ mod tests {
SourceSpan::Line { start, end }
}
fn common_for(
kind: &str,
heading_path: &[String],
ordinal: u32,
s: SourceSpan,
) -> CommonBlock {
fn common_for(kind: &str, heading_path: &[String], ordinal: u32, s: SourceSpan) -> CommonBlock {
CommonBlock {
block_id: id_for_block(&doc_id(), kind, heading_path, ordinal, &s),
heading_path: heading_path.to_vec(),
@@ -534,12 +503,7 @@ mod tests {
})
}
fn paragraph(
text: &str,
heading_path: &[&str],
ordinal: u32,
line: u32,
) -> Block {
fn paragraph(text: &str, heading_path: &[&str], ordinal: u32, line: u32) -> Block {
let hp: Vec<String> = heading_path.iter().map(|s| (*s).into()).collect();
Block::Paragraph(TextBlock {
common: common_for("paragraph", &hp, ordinal, span(line, line)),
@@ -548,12 +512,7 @@ mod tests {
})
}
fn code_block(
code: &str,
heading_path: &[&str],
ordinal: u32,
s: SourceSpan,
) -> Block {
fn code_block(code: &str, heading_path: &[&str], ordinal: u32, s: SourceSpan) -> Block {
let hp: Vec<String> = heading_path.iter().map(|s| (*s).into()).collect();
Block::Code(CodeBlock {
common: common_for("code", &hp, ordinal, s),
@@ -580,12 +539,7 @@ mod tests {
})
}
fn image_ref(
alt: &str,
heading_path: &[&str],
ordinal: u32,
line: u32,
) -> Block {
fn image_ref(alt: &str, heading_path: &[&str], ordinal: u32, line: u32) -> Block {
let hp: Vec<String> = heading_path.iter().map(|s| (*s).into()).collect();
Block::ImageRef(ImageRefBlock {
common: common_for("imageref", &hp, ordinal, span(line, line)),

View File

@@ -53,18 +53,21 @@
//! one chunk per atomic block. PdfPageV1 cannot.
//!
//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
//! variant `format!("{base_policy_hash}#c{char_start}")` into the
//! recipe's `policy_hash` slot (so distinct chunks distinguish via
//! different policy_hash inputs), while storing the unmodified
//! `base_policy_hash` in `Chunk.policy_hash` so the field still answers
//! "what policy was active". Logged in `tasks/HOTFIXES.md`.
//! variant `format!("{base_policy_hash}#c{segment_start}")` into the
//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap
//! segment boundary, strictly increasing across the returned chunks
//! even when the overlap walk collapses `actual_start` to a previous
//! chunk's `prev_min`. Unmodified `base_policy_hash` is stored in
//! `Chunk.policy_hash` so the field still answers "what policy was
//! active". v1.1 second-iteration patch — logged in
//! `tasks/HOTFIXES.md` (2026-05-27).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "pdf-page-v1";
const VERSION_LABEL: &str = "pdf-page-v1.1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
@@ -89,11 +92,7 @@ impl Chunker for PdfPageV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
// Validate up front — every block must be a Paragraph carrying
// SourceSpan::Page. A mixed document signals a routing bug in
// the caller (e.g. running this chunker on Markdown) and is
@@ -106,18 +105,13 @@ impl Chunker for PdfPageV1Chunker {
),
};
if !matches!(common.source_span, SourceSpan::Page { .. }) {
anyhow::bail!(
"PdfPageV1Chunker only handles PDF docs (got non-Page source_span)"
);
anyhow::bail!("PdfPageV1Chunker only handles PDF docs (got non-Page source_span)");
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let target_bytes = policy
.target_tokens
.saturating_mul(BYTES_PER_TOKEN)
.max(1);
let target_bytes = policy.target_tokens.saturating_mul(BYTES_PER_TOKEN).max(1);
// Clamp the overlap to half the target. Without this, a policy
// with `overlap_tokens >= target_tokens` would make every chunk
// fully re-emit the previous chunk's text — mirrors
@@ -146,7 +140,7 @@ impl Chunker for PdfPageV1Chunker {
continue;
}
for (char_start, char_end, slice) in
for (segment_start, char_start, char_end, slice) in
chunk_page(&p.text, target_bytes, overlap_bytes)
{
// PDF chars-per-page comfortably fits in u32 (a single
@@ -154,20 +148,20 @@ impl Chunker for PdfPageV1Chunker {
// typography); silent `as u32` truncation would only
// surface on corrupted input, where an explicit panic
// is preferable to an off-by-2^32 span.
let char_start_u32 = u32::try_from(char_start)
.expect("page chars fit in u32");
let char_end_u32 =
u32::try_from(char_end).expect("page chars fit in u32");
let char_start_u32 = u32::try_from(char_start).expect("page chars fit in u32");
let char_end_u32 = u32::try_from(char_end).expect("page chars fit in u32");
let span = SourceSpan::Page {
page: page_num,
char_start: Some(char_start_u32),
char_end: Some(char_end_u32),
};
let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
// Per-chunk policy_hash variant prevents chunk_id
// collision when a page produces multiple chunks. See
// module docs for rationale.
let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
// v0.20.0 sub-item 1 bugfix (#3): per-chunk policy_hash
// variant uses `segment_start` (pre-overlap boundary,
// strictly increasing) instead of `char_start` (post-
// overlap, may collapse to prev_min). See module docs +
// spec §4.1 root cause + HOTFIXES.md 2026-05-27.
let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}");
let chunk_id =
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
let token_estimate = slice.len().div_ceil(BYTES_PER_TOKEN);
@@ -176,6 +170,7 @@ impl Chunker for PdfPageV1Chunker {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids,
tokenized_korean_text: crate::tokenize_korean_morphological(&slice),
text: slice,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -198,18 +193,28 @@ impl Chunker for PdfPageV1Chunker {
}
/// Split a single page's text into ordered chunks, each represented as
/// `(char_start, char_end, text_slice)`. Char positions are within the
/// page text, suitable for `SourceSpan::Page::char_start` / `char_end`.
/// `(segment_start, actual_start, chunk_end, text_slice)`.
///
/// - `segment_start` = pre-overlap segment boundary. Strictly increasing
/// across the returned vec. Use this for chunk_id uniqueness suffixes.
/// - `actual_start` = post-overlap start char index. May collapse to a
/// previous chunk's `actual_start` under aggressive overlap policy.
/// Use this for `SourceSpan::Page::char_start`.
/// - `chunk_end` = chunk's end char index (exclusive).
///
/// Returns an empty vector when `text` is empty or whitespace-only.
fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usize, usize, String)> {
fn chunk_page(
text: &str,
target_bytes: usize,
overlap_bytes: usize,
) -> Vec<(usize, usize, usize, String)> {
let chars: Vec<char> = text.chars().collect();
let n = chars.len();
if n == 0 {
return Vec::new();
}
if text.len() <= target_bytes {
return vec![(0, n, text.to_string())];
return vec![(0, 0, n, text.to_string())];
}
// Build candidate boundary positions (char indices where a chunk
@@ -222,8 +227,7 @@ fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usi
let c = chars[k];
let nx = chars[k + 1];
let is_paragraph_break = c == '\n' && nx == '\n';
let is_sentence_end =
matches!(c, '.' | '?' | '!') && nx.is_whitespace();
let is_sentence_end = matches!(c, '.' | '?' | '!') && nx.is_whitespace();
if (is_paragraph_break || is_sentence_end) && k + 2 <= n {
bounds.push(k + 2);
}
@@ -235,11 +239,9 @@ fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usi
bounds.dedup();
// UTF-8 byte length of the slice between two char indices.
let byte_len = |a: usize, b: usize| -> usize {
chars[a..b].iter().map(|c| c.len_utf8()).sum()
};
let byte_len = |a: usize, b: usize| -> usize { chars[a..b].iter().map(|c| c.len_utf8()).sum() };
let mut chunks: Vec<(usize, usize, String)> = Vec::new();
let mut chunks: Vec<(usize, usize, usize, String)> = Vec::new();
let mut seg_idx: usize = 0;
while seg_idx + 1 < bounds.len() {
let start = bounds[seg_idx];
@@ -264,7 +266,9 @@ fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usi
// have absorbed up to `overlap_bytes` of bytes, but never past
// the previous chunk's start (no full re-emission).
let actual_start = if let Some(prev) = chunks.last() {
let prev_min = prev.0;
// prev tuple shape = (segment_start, actual_start, chunk_end, slice).
// overlap walk floor = previous chunk's actual_start (prev.1).
let prev_min = prev.1;
let mut a = start;
let mut acc_o: usize = 0;
while a > prev_min {
@@ -281,7 +285,7 @@ fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usi
};
let slice: String = chars[actual_start..chunk_end].iter().collect();
chunks.push((actual_start, chunk_end, slice));
chunks.push((start, actual_start, chunk_end, slice));
seg_idx = end_idx;
}
@@ -390,7 +394,11 @@ mod tests {
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.source_spans.len(), 1);
match c.source_spans[0] {
SourceSpan::Page { page, char_start, char_end } => {
SourceSpan::Page {
page,
char_start,
char_end,
} => {
assert_eq!(page, (i as u32) + 1);
assert_eq!(char_start, Some(0));
assert!(char_end.unwrap() > 0);
@@ -435,11 +443,16 @@ mod tests {
// N-1's char_end).
for w in chunks.windows(2) {
let prev_end = match w[0].source_spans[0] {
SourceSpan::Page { char_end: Some(e), .. } => e,
SourceSpan::Page {
char_end: Some(e), ..
} => e,
_ => panic!("missing char_end"),
};
let next_start = match w[1].source_spans[0] {
SourceSpan::Page { char_start: Some(s), .. } => s,
SourceSpan::Page {
char_start: Some(s),
..
} => s,
_ => panic!("missing char_start"),
};
assert!(
@@ -450,7 +463,7 @@ mod tests {
// chunk_ids stay distinct despite identical block_ids — the
// per-chunk policy_hash variant is doing its job.
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
ids.sort();
ids.sort_unstable();
let total = ids.len();
ids.dedup();
assert_eq!(ids.len(), total, "all chunk_ids must be unique");
@@ -653,11 +666,17 @@ mod tests {
// overlap) is the failure mode.
for w in chunks.windows(2) {
let prev_start = match w[0].source_spans[0] {
SourceSpan::Page { char_start: Some(s), .. } => s,
SourceSpan::Page {
char_start: Some(s),
..
} => s,
_ => panic!("missing char_start"),
};
let next_start = match w[1].source_spans[0] {
SourceSpan::Page { char_start: Some(s), .. } => s,
SourceSpan::Page {
char_start: Some(s),
..
} => s,
_ => panic!("missing char_start"),
};
assert!(
@@ -668,12 +687,49 @@ mod tests {
// chunk_ids stay distinct (the per-chunk hash variant keys off
// char_start which is now strictly increasing).
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
ids.sort();
ids.sort_unstable();
let total = ids.len();
ids.dedup();
assert_eq!(ids.len(), total, "chunk_ids must remain unique");
}
#[test]
fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() {
// 한국어 OCR text 의 trigger shape: 10 char "가" + ". " + 500 char "나".
// → first segment [0, 12), second segment [12, n).
// page_text byte_len = 10*3 + 2 + 500*3 = 1532 > target_bytes=1500
// → multi-chunk. overlap_bytes = min(240, 750) = 240 chars=80
// → second chunk 의 actual_start 가 prev_min=0 collapse → same `#c0`.
//
// default_policy(500, 80) — target_tokens=500 → target_bytes=500*3=1500
// (한국어 3byte/char 환산), overlap_tokens=80 → overlap_bytes=min(240, 750)=240.
// verifier round 1 L-3 보강.
let early_seg = "".repeat(10);
let tail = "".repeat(500);
let page_text = format!("{early_seg}. {tail}");
let doc = make_pdf_doc(&[&page_text]);
let policy = default_policy(500, 80); // target=1500 byte, overlap=240 byte
let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
assert!(
chunks.len() >= 2,
"expected ≥2 chunks for {} byte page; got {}",
page_text.len(),
chunks.len()
);
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
ids.sort_unstable();
let total = ids.len();
ids.dedup();
assert_eq!(
ids.len(),
total,
"all chunk_ids must be unique even when overlap walks actual_start back to prev_min"
);
}
#[test]
fn policy_hash_matches_md_heading_v1_for_identical_policy() {
// Cross-chunker policy fingerprint identity — important so a

View File

@@ -0,0 +1,200 @@
//! p10-2: Tier 2 chunker shared helpers (oversize fallback + Chunk build).
//!
//! Mirrors `code_rust_ast_v1`'s Chunk-construction pattern exactly so that
//! id / hashes / token-count / ChunkPolicy semantics stay identical across
//! Tier 1 (AST) and Tier 2 (resource-aware) chunkers.
use anyhow::Result;
use kebab_core::{
BlockId, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, DocumentId, SourceSpan,
id_for_chunk,
};
pub(crate) const AST_CHUNK_MAX_LINES: u32 = 200;
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
/// Compute the policy hash the same way `code_rust_ast_v1` does.
pub(crate) fn policy_hash(policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
/// Emit one chunk for `(text, line_start..=line_end, symbol, lang)`, splitting
/// into line-windows of at most `AST_CHUNK_MAX_LINES` if the slice is oversize.
/// Mirrors the oversize path in `code_rust_ast_v1`'s `chunk` impl.
///
/// `base_split_key` is used as the `split_key` for the non-oversize single-chunk
/// case. Callers that emit multiple chunks from the same document (e.g.
/// `K8sManifestResourceV1Chunker` — one call per k8s resource) MUST pass
/// `Some(line_start)` so that each call produces a distinct `chunk_id`.
/// Single-chunk callers (dockerfile-file-v1, manifest-file-v1) pass `None` to
/// keep chunk_ids stable (no sibling can collide when there's only one chunk).
#[allow(clippy::too_many_arguments)]
pub(crate) fn push_chunks_with_oversize(
out: &mut Vec<Chunk>,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
text: &str,
line_start: u32,
line_end: u32,
symbol: &str,
lang: &str,
chunker_version: &str,
base_split_key: Option<u32>,
) -> Result<()> {
let n_lines = (line_end - line_start + 1).max(1);
let cv = ChunkerVersion(chunker_version.to_string());
let base_policy_hash = policy_hash(policy);
if n_lines <= AST_CHUNK_MAX_LINES {
out.push(build_chunk(
doc,
&cv,
&base_policy_hash,
text,
line_start,
line_end,
symbol,
lang,
base_split_key,
));
return Ok(());
}
let lines: Vec<&str> = text.lines().collect();
let total = lines.len();
let mut window_start = line_start;
let mut i = 0usize;
while i < total {
let take = (AST_CHUNK_MAX_LINES as usize).min(total - i);
let window_text = lines[i..i + take].join("\n");
let window_end = window_start + take as u32 - 1;
out.push(build_chunk(
doc,
&cv,
&base_policy_hash,
&window_text,
window_start,
window_end,
symbol,
lang,
Some(window_start),
));
i += take;
window_start = window_end + 1;
}
Ok(())
}
/// Build a single `Chunk`, mirroring `make_chunk` in `code_rust_ast_v1.rs`
/// exactly (same id recipe, same token estimate, same field set).
///
/// `split_key` is `Some(line_start_of_window)` for oversize splits, `None`
/// for normal single-chunk emission. Mirrors the `Some(part_ls)` / `None`
/// split_key pattern in 1A-2.
#[allow(clippy::too_many_arguments)]
pub(crate) fn build_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
base_policy_hash: &str,
text: &str,
line_start: u32,
line_end: u32,
symbol: &str,
lang: &str,
split_key: Option<u32>,
) -> Chunk {
let span = SourceSpan::Code {
line_start,
line_end,
symbol: Some(symbol.to_string()),
lang: Some(lang.to_string()),
};
build_chunk_from_span(
doc,
chunker_version,
base_policy_hash,
text,
span,
split_key,
)
}
/// Like `build_chunk` but emits `symbol: None`. Used by Tier 3 (per spec §9.3).
///
/// Accepts `policy: &ChunkPolicy` and `chunker_version: &str` (string slice)
/// so callers don't need to pre-compute the hash and version wrapper.
/// `split_key` is `Some(window_start)` for oversize line-window splits.
#[allow(clippy::too_many_arguments)]
pub(crate) fn build_chunk_no_symbol(
doc: &CanonicalDocument,
policy: &ChunkPolicy,
text: &str,
line_start: u32,
line_end: u32,
lang: &str,
chunker_version: &str,
split_key: Option<u32>,
) -> Chunk {
let cv = ChunkerVersion(chunker_version.to_string());
let base_policy_hash = policy_hash(policy);
let span = SourceSpan::Code {
line_start,
line_end,
symbol: None,
lang: Some(lang.to_string()),
};
build_chunk_from_span(doc, &cv, &base_policy_hash, text, span, split_key)
}
/// Core chunk-building logic shared by `build_chunk` and `build_chunk_no_symbol`.
///
/// Takes a pre-built `SourceSpan` so the only difference between the two
/// public helpers is whether `symbol` is `Some` or `None`. All id/hash/
/// token mechanics are identical.
fn build_chunk_from_span(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
base_policy_hash: &str,
text: &str,
span: SourceSpan,
split_key: Option<u32>,
) -> Chunk {
// id_hash mirrors code_rust_ast_v1's make_chunk logic:
// split_key Some(k) => "{base_policy_hash}#L{k}"
// split_key None => base_policy_hash
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
// block_ids: Tier 2/3 chunkers have no per-block structure (the whole file
// is one Block::Code), so we pass an empty slice — same as using the doc-
// level slice without explicit block granularity.
let block_ids: Vec<BlockId> = vec![];
let chunk_id = id_for_chunk(
&DocumentId(doc.doc_id.0.clone()),
chunker_version,
&block_ids,
&id_hash,
);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids,
tokenized_korean_text: crate::tokenize_korean_morphological(text),
text: text.to_string(),
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}

View File

@@ -0,0 +1,196 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative C code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_go_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeCAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("projects/record.c".into());
let aid = AssetId("c".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-c-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Representative units:
// 0. imports + defines (lines 14, ≤200)
// 1. status_t enum typedef (lines 69, ≤200)
// 2. record_t struct typedef (lines 1116, ≤200)
// 3. static counter decl glue (line 18, ≤200)
// 4. parse_record fn (lines 2023, ≤200)
// 5. print_record fn (lines 2527, ≤200)
// 6. main fn (lines 2933, ≤200)
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"<top-level>",
1,
18,
"#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_BUF 4096\n\ntypedef enum {\n OK = 0,\n ERR_PARSE,\n ERR_IO,\n} status_t;\n\ntypedef struct {\n int id;\n char name[64];\n status_t status;\n} record_t;\n\nstatic int counter = 0;".to_string(),
),
(
"parse_record",
20,
23,
"int parse_record(const char *line, record_t *out) {\n if (line == NULL || out == NULL) return ERR_PARSE;\n return OK;\n}".to_string(),
),
(
"print_record",
25,
27,
"void print_record(const record_t *r) {\n printf(\"[%d] %s (status=%d)\\n\", r->id, r->name, r->status);\n}".to_string(),
),
(
"main",
29,
33,
"int main(void) {\n record_t r = { .id = 1, .name = \"foo\", .status = OK };\n print_record(&r);\n return 0;\n}".to_string(),
),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("c".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("c".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "record.c".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("c".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-c-ast-v1".into()),
}
}
#[test]
fn code_c_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.c.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-c-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_c_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeCAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeCAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,354 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative C++ code `CanonicalDocument`.
//!
//! Two complementary tests:
//! 1. `code_cpp_ast_chunks_snapshot` — hand-built `fixed_doc()` validates the
//! chunker's 1:1 mapping (design §6.3 / §8 boundary: no parse-code dep needed).
//! 2. `code_cpp_ast_extractor_snapshot` — invokes `CppAstExtractor` against the
//! real `tests/fixtures/sample.cpp` fixture, validating the extractor → chunker
//! end-to-end pipeline. `kebab-parse-code` is a dev-dep (same pattern as
//! `kebab-parse-md` in Markdown snapshot tests).
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeCppAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use kebab_parse_code::CppAstExtractor;
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("projects/record.cpp".into());
let aid = AssetId("c".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-cpp-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Representative units (C++ specific):
// 0. includes + namespace opening (lines 14, ≤200)
// 1. class definition (lines 620, ≤200)
// 2. template function (lines 2225, ≤200)
// 3. namespace closing + free fn (lines 2729, ≤200)
// 4. main fn (lines 3134, ≤200)
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"<top-level>",
1,
4,
"#include <string>\n#include <vector>\n\nnamespace kebab {".to_string(),
),
(
"kebab::chunk::MdHeadingV1Chunker",
6,
20,
"class MdHeadingV1Chunker {\npublic:\n MdHeadingV1Chunker() = default;\n ~MdHeadingV1Chunker() = default;\n\n std::string chunk_doc(const std::string& doc) {\n return doc;\n }\n\n int operator()(int x) const {\n return x * 2;\n }\n\nprivate:\n int counter_ = 0;\n};".to_string(),
),
(
"kebab::identity",
22,
25,
"template <typename T>\nT identity(T value) {\n return value;\n}".to_string(),
),
(
"kebab::global_helper",
27,
29,
"void global_helper() {\n // free function in kebab namespace\n}".to_string(),
),
(
"main",
31,
34,
"int main() {\n kebab::chunk::MdHeadingV1Chunker c;\n return 0;\n}".to_string(),
),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("cpp".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("cpp".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "record.cpp".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("cpp".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-cpp-ast-v1".into()),
}
}
// ---------------------------------------------------------------------------
// Helper: run the real CppAstExtractor against tests/fixtures/sample.cpp
// ---------------------------------------------------------------------------
fn extract_cpp_fixture() -> CanonicalDocument {
use kebab_core::{
AssetId, AssetStorage, Checksum, ExtractConfig, ExtractContext, Extractor, RawAsset,
SourceUri, WorkspacePath,
};
use std::path::PathBuf;
let bytes = std::fs::read(fixtures_dir().join("sample.cpp")).expect("read sample.cpp fixture");
let src = String::from_utf8(bytes).expect("fixture is valid UTF-8");
let wp = WorkspacePath("tests/fixtures/sample.cpp".to_string());
let asset = RawAsset {
asset_id: AssetId("e".repeat(64)),
source_uri: SourceUri::File(PathBuf::from("tests/fixtures/sample.cpp")),
workspace_path: wp,
media_type: kebab_core::MediaType::Code("cpp".to_string()),
byte_len: src.len() as u64,
checksum: Checksum("f".repeat(64)),
discovered_at: time::OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
stored: AssetStorage::Reference {
path: PathBuf::from("tests/fixtures/sample.cpp"),
sha: Checksum("f".repeat(64)),
},
};
let cfg = ExtractConfig::default();
let root = PathBuf::from("/tmp");
let ctx = ExtractContext {
asset: &asset,
workspace_root: &root,
config: &cfg,
};
CppAstExtractor::new()
.extract(&ctx, src.as_bytes())
.unwrap()
}
// ---------------------------------------------------------------------------
// Test 1 (hand-built): chunker-only 1:1 mapping validation
// ---------------------------------------------------------------------------
#[test]
fn code_cpp_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.cpp.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-cpp-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_cpp_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeCppAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeCppAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}
// ---------------------------------------------------------------------------
// Test 2 (real extractor): end-to-end extractor → chunker pipeline
// ---------------------------------------------------------------------------
/// Validates that the real `CppAstExtractor` processes `sample.cpp` and
/// emits the expected set of symbols through the full chunker pipeline.
///
/// `sample.cpp` contains:
/// - `#include` directives + nested namespace `kebab::chunk` → glue + struct unit
/// - `class MdHeadingV1Chunker` with methods (ctor, dtor, chunk_doc, operator())
/// - `template <typename T> T identity(T value)` (template fn)
/// - `void kebab::global_helper()` (free fn in namespace)
/// - `int main()` (global free fn)
#[test]
fn code_cpp_ast_extractor_snapshot() {
let doc = extract_cpp_fixture();
// Verify the extractor emits all expected named units.
let block_syms: Vec<Option<String>> = doc
.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, .. } => Some(symbol.clone()),
_ => None,
},
_ => None,
})
.collect();
// Must include namespace-qualified class and its methods
assert!(
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker")),
"class unit missing: {block_syms:?}"
);
assert!(
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::MdHeadingV1Chunker")),
"ctor unit missing: {block_syms:?}"
);
assert!(
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::~MdHeadingV1Chunker")),
"dtor unit missing: {block_syms:?}"
);
assert!(
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::chunk_doc")),
"chunk_doc unit missing: {block_syms:?}"
);
assert!(
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::operator()")),
"operator() unit missing: {block_syms:?}"
);
// Template function (inside kebab::chunk namespace in the fixture)
assert!(
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::identity")),
"identity template fn unit missing: {block_syms:?}"
);
// Free function in outer namespace
assert!(
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::global_helper")),
"global_helper unit missing: {block_syms:?}"
);
// Global main
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("main")),
"main unit missing: {block_syms:?}"
);
}
/// End-to-end chunker output from real extractor is deterministic.
#[test]
fn code_cpp_ast_extractor_chunks_deterministic() {
let doc1 = extract_cpp_fixture();
let doc2 = extract_cpp_fixture();
assert_eq!(
doc1.blocks, doc2.blocks,
"extractor output non-deterministic"
);
let policy = fixed_policy();
let chunks1 = CodeCppAstV1Chunker.chunk(&doc1, &policy).unwrap();
let chunks2 = CodeCppAstV1Chunker.chunk(&doc2, &policy).unwrap();
assert_eq!(
chunks1
.iter()
.map(|c| c.chunk_id.0.clone())
.collect::<Vec<_>>(),
chunks2
.iter()
.map(|c| c.chunk_id.0.clone())
.collect::<Vec<_>>(),
"chunker output non-deterministic"
);
}

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative Go code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeGoAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("kebab_eval/metrics.go".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-go-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line function body to force split_oversize.
let big_body: String = {
let header = "func BigCompute(data []int) int {\n";
let body: String = (0..210u32)
.map(|i| format!("\tv{i} := 0\n\tif {i} < len(data) {{\n\t\tv{i} = data[{i}]\n\t}}\n"))
.collect();
let footer = "\treturn len(data)\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. import block (lines 15, ≤200)
// 1. free fn `ComputeMRR` (lines 712, ≤200)
// 2. struct `MetricsCollector` (lines 1420, ≤200)
// 3. struct `BaseEvaluator` (lines 2230, ≤200)
// 4. method `Run` (lines 3238, ≤200)
// 5. method `Report` (lines 4046, ≤200)
// 6. BigCompute (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"imports",
1,
5,
"import (\n\t\"fmt\"\n\t\"os\"\n\t\"strings\"\n)".to_string(),
),
(
"ComputeMRR",
7,
12,
"func ComputeMRR(scores []float64) float64 {\n\tif len(scores) == 0 {\n\t\treturn 0.0\n\t}\n\t_ = fmt.Sprintf(\"%v\", scores)\n\treturn 1.0 / float64(len(scores))\n}".to_string(),
),
(
"MetricsCollector",
14,
20,
"type MetricsCollector struct {\n\tScores []float64\n\tLabels []string\n\tCounts map[string]int\n\tTotals map[string]float64\n\tTags []string\n}".to_string(),
),
(
"BaseEvaluator",
22,
30,
"type BaseEvaluator struct {\n\tName string\n}\n\nfunc (e *BaseEvaluator) Evaluate(data []string) error {\n\t_ = os.Stderr\n\t_ = strings.Join(data, \",\")\n\treturn nil\n}".to_string(),
),
(
"MetricsCollector.Run",
32,
38,
"func (m *MetricsCollector) Run(inputs []float64) {\n\tfor _, inp := range inputs {\n\t\tm.Scores = append(\n\t\t\tm.Scores,\n\t\t\tinp,\n\t\t)\n\t}\n}".to_string(),
),
(
"MetricsCollector.Report",
40,
46,
"func (m *MetricsCollector) Report() map[string]interface{} {\n\treturn map[string]interface{}{\n\t\t\"mean\": 0.0,\n\t\t\"count\": len(m.Scores),\n\t\t\"tags\": m.Tags,\n\t}\n}".to_string(),
),
("BigCompute", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("go".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("go".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "metrics.go".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("go".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-go-ast-v1".into()),
}
}
#[test]
fn code_go_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.go.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-go-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_go_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeGoAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeGoAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative Java code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeJavaAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("src/main/java/com/example/Metrics.java".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-java-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line method body to force split_oversize.
let big_body: String = {
let header = "public class BigCompute {\n public int compute(int[] data) {\n";
let body: String = (0..210u32)
.map(|i| format!(" int v{i} = {i} < data.length ? data[{i}] : 0;\n"))
.collect();
let footer = " return data.length;\n }\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. import block (lines 15, ≤200)
// 1. free method `computeMRR` (lines 712, ≤200)
// 2. class `MetricsCollector` (lines 1420, ≤200)
// 3. class `BaseEvaluator` (lines 2230, ≤200)
// 4. method `MetricsCollector.run` (lines 3238, ≤200)
// 5. method `MetricsCollector.report` (lines 4046, ≤200)
// 6. BigCompute (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"imports",
1,
5,
"import java.util.List;\nimport java.util.Map;\nimport java.util.ArrayList;\nimport java.util.HashMap;\nimport java.util.stream.Collectors;".to_string(),
),
(
"computeMRR",
7,
12,
"public static double computeMRR(List<Double> scores) {\n if (scores.isEmpty()) {\n return 0.0;\n }\n return 1.0 / scores.size();\n}".to_string(),
),
(
"MetricsCollector",
14,
20,
"public class MetricsCollector {\n private List<Double> scores;\n private List<String> labels;\n private Map<String, Integer> counts;\n private Map<String, Double> totals;\n private List<String> tags;\n}".to_string(),
),
(
"BaseEvaluator",
22,
30,
"public class BaseEvaluator {\n private String name;\n\n public BaseEvaluator(String name) {\n this.name = name;\n }\n\n public void evaluate(List<String> data) throws Exception {\n String joined = String.join(\",\", data);\n }\n}".to_string(),
),
(
"MetricsCollector.run",
32,
38,
"public void run(List<Double> inputs) {\n for (Double inp : inputs) {\n scores.add(\n inp\n );\n }\n}".to_string(),
),
(
"MetricsCollector.report",
40,
46,
"public Map<String, Object> report() {\n Map<String, Object> result = new HashMap<>();\n result.put(\"mean\", 0.0);\n result.put(\"count\", scores.size());\n result.put(\"tags\", tags);\n return result;\n}".to_string(),
),
("BigCompute", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("java".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("java".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "Metrics.java".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("java".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-java-ast-v1".into()),
}
}
#[test]
fn code_java_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.java.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-java-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_java_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeJavaAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeJavaAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodeJsAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative Kotlin code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeKotlinAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("src/main/kotlin/com/example/Metrics.kt".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-kotlin-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line function body to force split_oversize.
let big_body: String = {
let header = "class BigCompute {\n fun compute(data: IntArray): Int {\n";
let body: String = (0..210u32)
.map(|i| format!(" val v{i} = if ({i} < data.size) data[{i}] else 0\n"))
.collect();
let footer = " return data.size\n }\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. import block (lines 15, ≤200)
// 1. top-level fn `computeMRR` (lines 712, ≤200)
// 2. data class `MetricsCollector` (lines 1420, ≤200)
// 3. class `BaseEvaluator` (lines 2230, ≤200)
// 4. method `MetricsCollector.run` (lines 3238, ≤200)
// 5. method `MetricsCollector.report` (lines 4046, ≤200)
// 6. BigCompute (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"imports",
1,
5,
"import kotlin.collections.List\nimport kotlin.collections.Map\nimport kotlin.collections.MutableList\nimport kotlin.collections.MutableMap\nimport kotlin.collections.mutableListOf".to_string(),
),
(
"computeMRR",
7,
12,
"fun computeMRR(scores: List<Double>): Double {\n if (scores.isEmpty()) {\n return 0.0\n }\n return 1.0 / scores.size\n}".to_string(),
),
(
"MetricsCollector",
14,
20,
"data class MetricsCollector(\n val scores: MutableList<Double> = mutableListOf(),\n val labels: MutableList<String> = mutableListOf(),\n val counts: MutableMap<String, Int> = mutableMapOf(),\n val totals: MutableMap<String, Double> = mutableMapOf(),\n val tags: MutableList<String> = mutableListOf(),\n)".to_string(),
),
(
"BaseEvaluator",
22,
30,
"open class BaseEvaluator(val name: String) {\n\n fun evaluate(data: List<String>) {\n val joined = data.joinToString(\",\")\n println(joined)\n }\n\n open fun describe(): String = name\n}".to_string(),
),
(
"MetricsCollector.run",
32,
38,
"fun MetricsCollector.run(inputs: List<Double>) {\n for (inp in inputs) {\n scores.add(\n inp\n )\n }\n}".to_string(),
),
(
"MetricsCollector.report",
40,
46,
"fun MetricsCollector.report(): Map<String, Any> {\n return mapOf(\n \"mean\" to 0.0,\n \"count\" to scores.size,\n \"tags\" to tags,\n )\n}".to_string(),
),
("BigCompute", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("kotlin".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("kotlin".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "Metrics.kt".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("kotlin".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-kotlin-ast-v1".into()),
}
}
#[test]
fn code_kotlin_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.kt.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-kotlin-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_kotlin_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeKotlinAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeKotlinAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodePythonAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodeRustAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -0,0 +1,270 @@
//! Behavioural tests for `CodeTextParagraphV1Chunker`.
//!
//! Documents are constructed manually (no kebab-parse-code dependency) by
//! placing raw text into a single `Block::Code`, mirroring the pattern used
//! in `k8s_manifest_resource_v1.rs`.
use std::path::PathBuf;
use kebab_chunk::CodeTextParagraphV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
// ── helpers ──────────────────────────────────────────────────────────────────
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
/// Build a `CanonicalDocument` with a single `Block::Code` containing `text`
/// and the supplied `lang` label.
fn text_doc(lang: &str, text: &str) -> CanonicalDocument {
let wp = WorkspacePath("scripts/sample.sh".into());
let aid = AssetId("d".repeat(64));
let pv = ParserVersion("code-text-paragraph-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some(lang.into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some(lang.into()),
code: text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "sample.sh".into(),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some(lang.into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-text-paragraph-v1".into()),
}
}
// ── tests ─────────────────────────────────────────────────────────────────────
/// `sample_shell.sh` has 4 paragraphs separated by 3 blank lines:
/// - paragraph 1: lines 1-2 (shebang + set -euo pipefail)
/// - paragraph 2: lines 4-7 (env setup block)
/// - paragraph 3: lines 9-11 (ingest block)
/// - paragraph 4: lines 13-15 (report block)
///
/// We assert:
/// - exactly 4 chunks (one per paragraph)
/// - all symbols are None (Tier 3 spec §9.3)
/// - all langs are "shell"
/// - line ranges are strictly ascending and do NOT include the blank lines
/// (lines 3, 8, 12 must not appear in any range)
#[test]
fn shell_multi_paragraph_splits_on_blank_lines() {
let fixture_path = fixtures_dir().join("sample_shell.sh");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = text_doc("shell", &text);
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
4,
"expected 4 chunks (one per paragraph), got {}: {chunks:#?}",
chunks.len()
);
// All symbols must be None (Tier 3 requirement).
for (i, chunk) in chunks.iter().enumerate() {
match &chunk.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(
symbol.is_none(),
"chunk[{i}] symbol must be None for Tier 3 chunker, got {symbol:?}"
);
}
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
}
}
// All langs must be "shell".
for (i, chunk) in chunks.iter().enumerate() {
match &chunk.source_spans[0] {
SourceSpan::Code { lang, .. } => {
assert_eq!(
lang.as_deref(),
Some("shell"),
"chunk[{i}] lang must be 'shell', got {lang:?}"
);
}
other => panic!("chunk[{i}]: expected Code span, got {other:?}"),
}
}
// Line ranges must be strictly ascending with no overlap,
// and blank lines (3, 8, 12) must not be included in any range.
let expected_ranges: &[(u32, u32)] = &[(1, 2), (4, 7), (9, 11), (13, 15)];
let actual_ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code {
line_start,
line_end,
..
} => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
assert_eq!(
actual_ranges, expected_ranges,
"line ranges mismatch: got {actual_ranges:?}, expected {expected_ranges:?}"
);
}
/// `sample_long_paragraph.txt` has exactly 200 non-blank lines and no blank
/// lines, so the entire file is one paragraph. 200 > 80 (FALLBACK_LINES_PER_CHUNK),
/// so the oversize window split fires with stride 60:
/// - window 1: lines 1-80
/// - window 2: lines 61-140
/// - window 3: lines 121-200
///
/// All chunk_ids must be distinct (the #L{window_start} split_key suffix).
#[test]
fn single_long_paragraph_line_window_split() {
let fixture_path = fixtures_dir().join("sample_long_paragraph.txt");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
assert_eq!(
text.lines().count(),
200,
"fixture must have exactly 200 lines"
);
let doc = text_doc("shell", &text);
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
3,
"expected 3 window chunks for 200-line paragraph, got {}: {chunks:#?}",
chunks.len()
);
let expected_ranges: &[(u32, u32)] = &[(1, 80), (61, 140), (121, 200)];
let actual_ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code {
line_start,
line_end,
..
} => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
assert_eq!(
actual_ranges, expected_ranges,
"window ranges mismatch: got {actual_ranges:?}, expected {expected_ranges:?}"
);
// All chunk_ids must be distinct (#L{window_start} suffix differentiates them).
let ids: std::collections::HashSet<_> = chunks.iter().map(|c| c.chunk_id.clone()).collect();
assert_eq!(
ids.len(),
chunks.len(),
"oversize window chunks must have distinct chunk_ids"
);
}
/// An empty source file (no non-blank lines) must yield zero chunks.
#[test]
fn empty_file_emits_zero_chunks() {
let doc = text_doc("shell", "");
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
0,
"empty file must yield 0 chunks, got {}: {chunks:#?}",
chunks.len()
);
}
/// The `lang` field on each emitted chunk must match the `lang` passed to
/// `text_doc`, regardless of content. `symbol` must be `None` (Tier 3 spec).
#[test]
fn lang_field_preserved_from_input_doc() {
let doc = text_doc("yaml", "key1: value1\nkey2: value2\n");
let chunks = CodeTextParagraphV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert!(!chunks.is_empty(), "expected at least one chunk");
match &chunks[0].source_spans[0] {
SourceSpan::Code { lang, symbol, .. } => {
assert_eq!(
lang.as_deref(),
Some("yaml"),
"lang must be 'yaml', got {lang:?}"
);
assert!(
symbol.is_none(),
"symbol must be None for Tier 3 chunker, got {symbol:?}"
);
}
other => panic!("expected Code span, got {other:?}"),
}
}

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodeTsAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -0,0 +1,138 @@
//! Behavioural tests for `DockerfileFileV1Chunker`.
//!
//! Documents are constructed manually (no kebab-parse-code dependency) by
//! placing the raw Dockerfile text into a single `Block::Code`, mirroring the
//! pattern used in `k8s_manifest_resource_v1.rs`.
use std::path::PathBuf;
use kebab_chunk::DockerfileFileV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
// ── helpers ──────────────────────────────────────────────────────────────────
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
/// Build a `CanonicalDocument` with a single `Block::Code` containing `dockerfile_text`.
fn dockerfile_doc(dockerfile_text: &str) -> CanonicalDocument {
let wp = WorkspacePath("build/Dockerfile".into());
let aid = AssetId("d".repeat(64));
let pv = ParserVersion("code-dockerfile-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let line_count = dockerfile_text.lines().count() as u32;
let span = SourceSpan::Code {
line_start: 1,
line_end: line_count.max(1),
symbol: None,
lang: Some("dockerfile".into()),
};
let bid = id_for_block(&doc_id, "code", &[], 0, &span);
let block = Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("dockerfile".into()),
code: dockerfile_text.to_string(),
});
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "Dockerfile".into(),
lang: Lang("und".into()),
blocks: vec![block],
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("dockerfile".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("dockerfile-file-v1".into()),
}
}
// ── tests ─────────────────────────────────────────────────────────────────────
/// A simple 5-line Dockerfile fixture must emit exactly 1 chunk with the
/// correct symbol, lang, and line range.
#[test]
fn dockerfile_emits_single_chunk() {
let fixture_path = fixtures_dir().join("sample.dockerfile");
let text = std::fs::read_to_string(&fixture_path)
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = dockerfile_doc(&text);
let chunks = DockerfileFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
assert_eq!(
chunks.len(),
1,
"expected 1 chunk, got {}: {chunks:#?}",
chunks.len()
);
// Inspect the Chunk's source_spans for symbol / lang / line range.
let span = chunks[0].source_spans.first().expect("at least one span");
match span {
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => {
assert_eq!(*line_start, 1, "line_start must be 1");
assert_eq!(*line_end, 5, "line_end must be 5 (5-line fixture)");
assert_eq!(
symbol.as_deref(),
Some("<dockerfile>"),
"symbol must be '<dockerfile>'"
);
assert_eq!(
lang.as_deref(),
Some("dockerfile"),
"lang must be 'dockerfile'"
);
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}
// Verify chunker_version label.
assert_eq!(chunks[0].chunker_version.0, "dockerfile-file-v1");
}

View File

@@ -0,0 +1,90 @@
[
{
"block_ids": [
"8149e12ca002489acb4a0f74c97a061a"
],
"chunk_id": "ec3cf06ae56c8e9796bbc9196438b7c5",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 18,
"line_start": 1,
"symbol": "<top-level>"
}
],
"text": "#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_BUF 4096\n\ntypedef enum {\n OK = 0,\n ERR_PARSE,\n ERR_IO,\n} status_t;\n\ntypedef struct {\n int id;\n char name[64];\n status_t status;\n} record_t;\n\nstatic int counter = 0;",
"token_estimate": 78,
"tokenized_korean_text": "# include < stdio . h > # include < stdlib . h > # define MAX _ BUF 4096 typedef enum { OK = 0 , ERR _ PARSE , ERR _ IO , } status _ t ; typedef struct { int id ; char name [ 64 ]; status _ t status ; } record _ t ; static int counter = 0 ;"
},
{
"block_ids": [
"1baaa89f21a47b2f32d6396a24a85454"
],
"chunk_id": "c2d7a81c898106733ef2e703774a6a4a",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 23,
"line_start": 20,
"symbol": "parse_record"
}
],
"text": "int parse_record(const char *line, record_t *out) {\n if (line == NULL || out == NULL) return ERR_PARSE;\n return OK;\n}",
"token_estimate": 41,
"tokenized_korean_text": "int parse _ record ( const char * line , record _ t * out ) { if ( line == NULL || out == NULL ) return ERR _ PARSE ; return OK ; }"
},
{
"block_ids": [
"8d0e14cbcc6d1e92d7878ab796ea68b8"
],
"chunk_id": "0e4d7b131ab64eba03b51903b5d8f96d",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 27,
"line_start": 25,
"symbol": "print_record"
}
],
"text": "void print_record(const record_t *r) {\n printf(\"[%d] %s (status=%d)\\n\", r->id, r->name, r->status);\n}",
"token_estimate": 35,
"tokenized_korean_text": "void print _ record ( const record _ t * r ) { printf (\"[% d ] % s ( status =% d )\\ n \", r -> id , r -> name , r -> status ); }"
},
{
"block_ids": [
"9c2ede84423871b615d48c38fefb1853"
],
"chunk_id": "e076f8edb2ff141d7e99b4106bb95157",
"chunker_version": "code-c-ast-v1",
"doc_id": "6bec42dd593920a060541db16c4e8e45",
"heading_path": [],
"policy_hash": "ecfad2ec1223662d",
"source_spans": [
{
"kind": "code",
"lang": "c",
"line_end": 33,
"line_start": 29,
"symbol": "main"
}
],
"text": "int main(void) {\n record_t r = { .id = 1, .name = \"foo\", .status = OK };\n print_record(&r);\n return 0;\n}",
"token_estimate": 38,
"tokenized_korean_text": "int main ( void ) { record _ t r = { . id = 1 , . name = \" foo \", . status = OK }; print _ record (& r ); return 0 ; }"
}
]

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,112 @@
[
{
"block_ids": [
"53292605459065d170cd36c118e20546"
],
"chunk_id": "50a5b324300d9082eac4ce2a422810e1",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 4,
"line_start": 1,
"symbol": "<top-level>"
}
],
"text": "#include <string>\n#include <vector>\n\nnamespace kebab {",
"token_estimate": 18,
"tokenized_korean_text": "# include < string > # include < vector > namespace kebab {"
},
{
"block_ids": [
"f349acad94c9fa4cf9ad1c0a93e83610"
],
"chunk_id": "0e6bc7c522665af8a4b0f66afb9d29c8",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 20,
"line_start": 6,
"symbol": "kebab::chunk::MdHeadingV1Chunker"
}
],
"text": "class MdHeadingV1Chunker {\npublic:\n MdHeadingV1Chunker() = default;\n ~MdHeadingV1Chunker() = default;\n\n std::string chunk_doc(const std::string& doc) {\n return doc;\n }\n\n int operator()(int x) const {\n return x * 2;\n }\n\nprivate:\n int counter_ = 0;\n};",
"token_estimate": 95,
"tokenized_korean_text": "class MdHeadingV 1 Chunker { public : MdHeadingV 1 Chunker ( ) = default ; ~ MdHeadingV 1 Chunker ( ) = default ; std : : string chunk _ doc ( const std : : string & doc ) { return doc ; } int operator ( ) ( int x ) const { return x * 2 ; } private : int counter _ = 0 ; };"
},
{
"block_ids": [
"8b9811387717d0bd4abf84abcc35b8b1"
],
"chunk_id": "d9326d252905b665b2adb9a416c20451",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 25,
"line_start": 22,
"symbol": "kebab::identity"
}
],
"text": "template <typename T>\nT identity(T value) {\n return value;\n}",
"token_estimate": 21,
"tokenized_korean_text": "template < typename T > T identity ( T value ) { return value ; }"
},
{
"block_ids": [
"1754cb6b971f6a4cb292f144a4f0570b"
],
"chunk_id": "56ee5f991de4a413c016da8dc4acfc35",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 29,
"line_start": 27,
"symbol": "kebab::global_helper"
}
],
"text": "void global_helper() {\n // free function in kebab namespace\n}",
"token_estimate": 22,
"tokenized_korean_text": "void global _ helper ( ) { / / free function in kebab namespace }"
},
{
"block_ids": [
"14b5f3393d6d25f822f5b70763d24acd"
],
"chunk_id": "c0d7c043cdd575c530db3909b54cc906",
"chunker_version": "code-cpp-ast-v1",
"doc_id": "fff1e1f0a7ff70ef682937470e5d1d28",
"heading_path": [],
"policy_hash": "71f3c07bb9ec1d09",
"source_spans": [
{
"kind": "code",
"lang": "cpp",
"line_end": 34,
"line_start": 31,
"symbol": "main"
}
],
"text": "int main() {\n kebab::chunk::MdHeadingV1Chunker c;\n return 0;\n}",
"token_estimate": 23,
"tokenized_korean_text": "int main ( ) { kebab : : chunk : : MdHeadingV 1 Chunker c ; return 0 ; }"
}
]

View File

@@ -0,0 +1,244 @@
[
{
"block_ids": [
"c182bf37e32c7fc1b868bd617f8eaf66"
],
"chunk_id": "43de518d946dc18ec040ae20d74e0cff",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 5,
"line_start": 1,
"symbol": "imports"
}
],
"text": "import (\n\t\"fmt\"\n\t\"os\"\n\t\"strings\"\n)",
"token_estimate": 12,
"tokenized_korean_text": "import ( \" fmt \" \" os \" \" strings \" )"
},
{
"block_ids": [
"c9992cdcfdf3c2a7700a4abc4782a8a4"
],
"chunk_id": "af4c382a83f1e8cdea495d8b33c11abc",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 12,
"line_start": 7,
"symbol": "ComputeMRR"
}
],
"text": "func ComputeMRR(scores []float64) float64 {\n\tif len(scores) == 0 {\n\t\treturn 0.0\n\t}\n\t_ = fmt.Sprintf(\"%v\", scores)\n\treturn 1.0 / float64(len(scores))\n}",
"token_estimate": 50,
"tokenized_korean_text": "func ComputeMRR ( scores [ ] float 64 ) float 64 { if len ( scores ) == 0 { return 0 . 0 } _ = fmt . Sprintf (\"% v \", scores ) return 1 . 0 / float 64 ( len ( scores ) ) }"
},
{
"block_ids": [
"5f18dc3e79fe946ba05d32c3bfc00684"
],
"chunk_id": "4be6d8f180bc19b8651877e5264852ac",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 20,
"line_start": 14,
"symbol": "MetricsCollector"
}
],
"text": "type MetricsCollector struct {\n\tScores []float64\n\tLabels []string\n\tCounts map[string]int\n\tTotals map[string]float64\n\tTags []string\n}",
"token_estimate": 45,
"tokenized_korean_text": "type MetricsCollector struct { Scores [ ] float 64 Labels [ ] string Counts map [ string ] int Totals map [ string ] float 64 Tags [ ] string }"
},
{
"block_ids": [
"3009cc022ca832c323393e4f9bcdb388"
],
"chunk_id": "3ae182f4c6d304ee7f0aaf447142f948",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 30,
"line_start": 22,
"symbol": "BaseEvaluator"
}
],
"text": "type BaseEvaluator struct {\n\tName string\n}\n\nfunc (e *BaseEvaluator) Evaluate(data []string) error {\n\t_ = os.Stderr\n\t_ = strings.Join(data, \",\")\n\treturn nil\n}",
"token_estimate": 53,
"tokenized_korean_text": "type BaseEvaluator struct { Name string } func ( e * BaseEvaluator ) Evaluate ( data [ ] string ) error { _ = os . Stderr _ = strings . Join ( data , \",\") return nil }"
},
{
"block_ids": [
"e0e83d1d7f9327a1902ae9a8f67c1f1c"
],
"chunk_id": "b962f14980e756bb8ba514e2282756cd",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 38,
"line_start": 32,
"symbol": "MetricsCollector.Run"
}
],
"text": "func (m *MetricsCollector) Run(inputs []float64) {\n\tfor _, inp := range inputs {\n\t\tm.Scores = append(\n\t\t\tm.Scores,\n\t\t\tinp,\n\t\t)\n\t}\n}",
"token_estimate": 44,
"tokenized_korean_text": "func ( m * MetricsCollector ) Run ( inputs [ ] float 64 ) { for _, inp := range inputs { m . Scores = append ( m . Scores , inp , ) } }"
},
{
"block_ids": [
"0e6a572bc3fe2bd6d173fe614bd1b763"
],
"chunk_id": "441c695e990e7f49188068433e313e87",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 46,
"line_start": 40,
"symbol": "MetricsCollector.Report"
}
],
"text": "func (m *MetricsCollector) Report() map[string]interface{} {\n\treturn map[string]interface{}{\n\t\t\"mean\": 0.0,\n\t\t\"count\": len(m.Scores),\n\t\t\"tags\": m.Tags,\n\t}\n}",
"token_estimate": 53,
"tokenized_korean_text": "func ( m * MetricsCollector ) Report ( ) map [ string ] interface {} { return map [ string ] interface {}{ \" mean \": 0 . 0 , \" count \": len ( m . Scores ) , \" tags \": m . Tags , } }"
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "7a942d871c588ec69426290561f05179",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 247,
"line_start": 48,
"symbol": "BigCompute [part 1/5]"
}
],
"text": "func BigCompute(data []int) int {\n\tv0 := 0\n\tif 0 < len(data) {\n\t\tv0 = data[0]\n\t}\n\tv1 := 0\n\tif 1 < len(data) {\n\t\tv1 = data[1]\n\t}\n\tv2 := 0\n\tif 2 < len(data) {\n\t\tv2 = data[2]\n\t}\n\tv3 := 0\n\tif 3 < len(data) {\n\t\tv3 = data[3]\n\t}\n\tv4 := 0\n\tif 4 < len(data) {\n\t\tv4 = data[4]\n\t}\n\tv5 := 0\n\tif 5 < len(data) {\n\t\tv5 = data[5]\n\t}\n\tv6 := 0\n\tif 6 < len(data) {\n\t\tv6 = data[6]\n\t}\n\tv7 := 0\n\tif 7 < len(data) {\n\t\tv7 = data[7]\n\t}\n\tv8 := 0\n\tif 8 < len(data) {\n\t\tv8 = data[8]\n\t}\n\tv9 := 0\n\tif 9 < len(data) {\n\t\tv9 = data[9]\n\t}\n\tv10 := 0\n\tif 10 < len(data) {\n\t\tv10 = data[10]\n\t}\n\tv11 := 0\n\tif 11 < len(data) {\n\t\tv11 = data[11]\n\t}\n\tv12 := 0\n\tif 12 < len(data) {\n\t\tv12 = data[12]\n\t}\n\tv13 := 0\n\tif 13 < len(data) {\n\t\tv13 = data[13]\n\t}\n\tv14 := 0\n\tif 14 < len(data) {\n\t\tv14 = data[14]\n\t}\n\tv15 := 0\n\tif 15 < len(data) {\n\t\tv15 = data[15]\n\t}\n\tv16 := 0\n\tif 16 < len(data) {\n\t\tv16 = data[16]\n\t}\n\tv17 := 0\n\tif 17 < len(data) {\n\t\tv17 = data[17]\n\t}\n\tv18 := 0\n\tif 18 < len(data) {\n\t\tv18 = data[18]\n\t}\n\tv19 := 0\n\tif 19 < len(data) {\n\t\tv19 = data[19]\n\t}\n\tv20 := 0\n\tif 20 < len(data) {\n\t\tv20 = data[20]\n\t}\n\tv21 := 0\n\tif 21 < len(data) {\n\t\tv21 = data[21]\n\t}\n\tv22 := 0\n\tif 22 < len(data) {\n\t\tv22 = data[22]\n\t}\n\tv23 := 0\n\tif 23 < len(data) {\n\t\tv23 = data[23]\n\t}\n\tv24 := 0\n\tif 24 < len(data) {\n\t\tv24 = data[24]\n\t}\n\tv25 := 0\n\tif 25 < len(data) {\n\t\tv25 = data[25]\n\t}\n\tv26 := 0\n\tif 26 < len(data) {\n\t\tv26 = data[26]\n\t}\n\tv27 := 0\n\tif 27 < len(data) {\n\t\tv27 = data[27]\n\t}\n\tv28 := 0\n\tif 28 < len(data) {\n\t\tv28 = data[28]\n\t}\n\tv29 := 0\n\tif 29 < len(data) {\n\t\tv29 = data[29]\n\t}\n\tv30 := 0\n\tif 30 < len(data) {\n\t\tv30 = data[30]\n\t}\n\tv31 := 0\n\tif 31 < len(data) {\n\t\tv31 = data[31]\n\t}\n\tv32 := 0\n\tif 32 < len(data) {\n\t\tv32 = data[32]\n\t}\n\tv33 := 0\n\tif 33 < len(data) {\n\t\tv33 = data[33]\n\t}\n\tv34 := 0\n\tif 34 < len(data) {\n\t\tv34 = data[34]\n\t}\n\tv35 := 0\n\tif 35 < len(data) {\n\t\tv35 = data[35]\n\t}\n\tv36 := 0\n\tif 36 < len(data) {\n\t\tv36 = data[36]\n\t}\n\tv37 := 0\n\tif 37 < len(data) {\n\t\tv37 = data[37]\n\t}\n\tv38 := 0\n\tif 38 < len(data) {\n\t\tv38 = data[38]\n\t}\n\tv39 := 0\n\tif 39 < len(data) {\n\t\tv39 = data[39]\n\t}\n\tv40 := 0\n\tif 40 < len(data) {\n\t\tv40 = data[40]\n\t}\n\tv41 := 0\n\tif 41 < len(data) {\n\t\tv41 = data[41]\n\t}\n\tv42 := 0\n\tif 42 < len(data) {\n\t\tv42 = data[42]\n\t}\n\tv43 := 0\n\tif 43 < len(data) {\n\t\tv43 = data[43]\n\t}\n\tv44 := 0\n\tif 44 < len(data) {\n\t\tv44 = data[44]\n\t}\n\tv45 := 0\n\tif 45 < len(data) {\n\t\tv45 = data[45]\n\t}\n\tv46 := 0\n\tif 46 < len(data) {\n\t\tv46 = data[46]\n\t}\n\tv47 := 0\n\tif 47 < len(data) {\n\t\tv47 = data[47]\n\t}\n\tv48 := 0\n\tif 48 < len(data) {\n\t\tv48 = data[48]\n\t}\n\tv49 := 0\n\tif 49 < len(data) {\n\t\tv49 = data[49]",
"token_estimate": 847,
"tokenized_korean_text": "func BigCompute ( data [ ] int ) int { v 0 := 0 if 0 < len ( data ) { v 0 = data [ 0 ] } v 1 := 0 if 1 < len ( data ) { v 1 = data [ 1 ] } v 2 := 0 if 2 < len ( data ) { v 2 = data [ 2 ] } v 3 := 0 if 3 < len ( data ) { v 3 = data [ 3 ] } v 4 := 0 if 4 < len ( data ) { v 4 = data [ 4 ] } v 5 := 0 if 5 < len ( data ) { v 5 = data [ 5 ] } v 6 := 0 if 6 < len ( data ) { v 6 = data [ 6 ] } v 7 := 0 if 7 < len ( data ) { v 7 = data [ 7 ] } v 8 := 0 if 8 < len ( data ) { v 8 = data [ 8 ] } v 9 := 0 if 9 < len ( data ) { v 9 = data [ 9 ] } v 10 := 0 if 10 < len ( data ) { v 10 = data [ 10 ] } v 11 := 0 if 11 < len ( data ) { v 11 = data [ 11 ] } v 12 := 0 if 12 < len ( data ) { v 12 = data [ 12 ] } v 13 := 0 if 13 < len ( data ) { v 13 = data [ 13 ] } v 14 := 0 if 14 < len ( data ) { v 14 = data [ 14 ] } v 15 := 0 if 15 < len ( data ) { v 15 = data [ 15 ] } v 16 := 0 if 16 < len ( data ) { v 16 = data [ 16 ] } v 17 := 0 if 17 < len ( data ) { v 17 = data [ 17 ] } v 18 := 0 if 18 < len ( data ) { v 18 = data [ 18 ] } v 19 := 0 if 19 < len ( data ) { v 19 = data [ 19 ] } v 20 := 0 if 20 < len ( data ) { v 20 = data [ 20 ] } v 21 := 0 if 21 < len ( data ) { v 21 = data [ 21 ] } v 22 := 0 if 22 < len ( data ) { v 22 = data [ 22 ] } v 23 := 0 if 23 < len ( data ) { v 23 = data [ 23 ] } v 24 := 0 if 24 < len ( data ) { v 24 = data [ 24 ] } v 25 := 0 if 25 < len ( data ) { v 25 = data [ 25 ] } v 26 := 0 if 26 < len ( data ) { v 26 = data [ 26 ] } v 27 := 0 if 27 < len ( data ) { v 27 = data [ 27 ] } v 28 := 0 if 28 < len ( data ) { v 28 = data [ 28 ] } v 29 := 0 if 29 < len ( data ) { v 29 = data [ 29 ] } v 30 := 0 if 30 < len ( data ) { v 30 = data [ 30 ] } v 31 := 0 if 31 < len ( data ) { v 31 = data [ 31 ] } v 32 := 0 if 32 < len ( data ) { v 32 = data [ 32 ] } v 33 := 0 if 33 < len ( data ) { v 33 = data [ 33 ] } v 34 := 0 if 34 < len ( data ) { v 34 = data [ 34 ] } v 35 := 0 if 35 < len ( data ) { v 35 = data [ 35 ] } v 36 := 0 if 36 < len ( data ) { v 36 = data [ 36 ] } v 37 := 0 if 37 < len ( data ) { v 37 = data [ 37 ] } v 38 := 0 if 38 < len ( data ) { v 38 = data [ 38 ] } v 39 := 0 if 39 < len ( data ) { v 39 = data [ 39 ] } v 40 := 0 if 40 < len ( data ) { v 40 = data [ 40 ] } v 41 := 0 if 41 < len ( data ) { v 41 = data [ 41 ] } v 42 := 0 if 42 < len ( data ) { v 42 = data [ 42 ] } v 43 := 0 if 43 < len ( data ) { v 43 = data [ 43 ] } v 44 := 0 if 44 < len ( data ) { v 44 = data [ 44 ] } v 45 := 0 if 45 < len ( data ) { v 45 = data [ 45 ] } v 46 := 0 if 46 < len ( data ) { v 46 = data [ 46 ] } v 47 := 0 if 47 < len ( data ) { v 47 = data [ 47 ] } v 48 := 0 if 48 < len ( data ) { v 48 = data [ 48 ] } v 49 := 0 if 49 < len ( data ) { v 49 = data [ 49 ]"
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "3f44ba43c9415652e2705bb667776e76",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 447,
"line_start": 248,
"symbol": "BigCompute [part 2/5]"
}
],
"text": "\t}\n\tv50 := 0\n\tif 50 < len(data) {\n\t\tv50 = data[50]\n\t}\n\tv51 := 0\n\tif 51 < len(data) {\n\t\tv51 = data[51]\n\t}\n\tv52 := 0\n\tif 52 < len(data) {\n\t\tv52 = data[52]\n\t}\n\tv53 := 0\n\tif 53 < len(data) {\n\t\tv53 = data[53]\n\t}\n\tv54 := 0\n\tif 54 < len(data) {\n\t\tv54 = data[54]\n\t}\n\tv55 := 0\n\tif 55 < len(data) {\n\t\tv55 = data[55]\n\t}\n\tv56 := 0\n\tif 56 < len(data) {\n\t\tv56 = data[56]\n\t}\n\tv57 := 0\n\tif 57 < len(data) {\n\t\tv57 = data[57]\n\t}\n\tv58 := 0\n\tif 58 < len(data) {\n\t\tv58 = data[58]\n\t}\n\tv59 := 0\n\tif 59 < len(data) {\n\t\tv59 = data[59]\n\t}\n\tv60 := 0\n\tif 60 < len(data) {\n\t\tv60 = data[60]\n\t}\n\tv61 := 0\n\tif 61 < len(data) {\n\t\tv61 = data[61]\n\t}\n\tv62 := 0\n\tif 62 < len(data) {\n\t\tv62 = data[62]\n\t}\n\tv63 := 0\n\tif 63 < len(data) {\n\t\tv63 = data[63]\n\t}\n\tv64 := 0\n\tif 64 < len(data) {\n\t\tv64 = data[64]\n\t}\n\tv65 := 0\n\tif 65 < len(data) {\n\t\tv65 = data[65]\n\t}\n\tv66 := 0\n\tif 66 < len(data) {\n\t\tv66 = data[66]\n\t}\n\tv67 := 0\n\tif 67 < len(data) {\n\t\tv67 = data[67]\n\t}\n\tv68 := 0\n\tif 68 < len(data) {\n\t\tv68 = data[68]\n\t}\n\tv69 := 0\n\tif 69 < len(data) {\n\t\tv69 = data[69]\n\t}\n\tv70 := 0\n\tif 70 < len(data) {\n\t\tv70 = data[70]\n\t}\n\tv71 := 0\n\tif 71 < len(data) {\n\t\tv71 = data[71]\n\t}\n\tv72 := 0\n\tif 72 < len(data) {\n\t\tv72 = data[72]\n\t}\n\tv73 := 0\n\tif 73 < len(data) {\n\t\tv73 = data[73]\n\t}\n\tv74 := 0\n\tif 74 < len(data) {\n\t\tv74 = data[74]\n\t}\n\tv75 := 0\n\tif 75 < len(data) {\n\t\tv75 = data[75]\n\t}\n\tv76 := 0\n\tif 76 < len(data) {\n\t\tv76 = data[76]\n\t}\n\tv77 := 0\n\tif 77 < len(data) {\n\t\tv77 = data[77]\n\t}\n\tv78 := 0\n\tif 78 < len(data) {\n\t\tv78 = data[78]\n\t}\n\tv79 := 0\n\tif 79 < len(data) {\n\t\tv79 = data[79]\n\t}\n\tv80 := 0\n\tif 80 < len(data) {\n\t\tv80 = data[80]\n\t}\n\tv81 := 0\n\tif 81 < len(data) {\n\t\tv81 = data[81]\n\t}\n\tv82 := 0\n\tif 82 < len(data) {\n\t\tv82 = data[82]\n\t}\n\tv83 := 0\n\tif 83 < len(data) {\n\t\tv83 = data[83]\n\t}\n\tv84 := 0\n\tif 84 < len(data) {\n\t\tv84 = data[84]\n\t}\n\tv85 := 0\n\tif 85 < len(data) {\n\t\tv85 = data[85]\n\t}\n\tv86 := 0\n\tif 86 < len(data) {\n\t\tv86 = data[86]\n\t}\n\tv87 := 0\n\tif 87 < len(data) {\n\t\tv87 = data[87]\n\t}\n\tv88 := 0\n\tif 88 < len(data) {\n\t\tv88 = data[88]\n\t}\n\tv89 := 0\n\tif 89 < len(data) {\n\t\tv89 = data[89]\n\t}\n\tv90 := 0\n\tif 90 < len(data) {\n\t\tv90 = data[90]\n\t}\n\tv91 := 0\n\tif 91 < len(data) {\n\t\tv91 = data[91]\n\t}\n\tv92 := 0\n\tif 92 < len(data) {\n\t\tv92 = data[92]\n\t}\n\tv93 := 0\n\tif 93 < len(data) {\n\t\tv93 = data[93]\n\t}\n\tv94 := 0\n\tif 94 < len(data) {\n\t\tv94 = data[94]\n\t}\n\tv95 := 0\n\tif 95 < len(data) {\n\t\tv95 = data[95]\n\t}\n\tv96 := 0\n\tif 96 < len(data) {\n\t\tv96 = data[96]\n\t}\n\tv97 := 0\n\tif 97 < len(data) {\n\t\tv97 = data[97]\n\t}\n\tv98 := 0\n\tif 98 < len(data) {\n\t\tv98 = data[98]\n\t}\n\tv99 := 0\n\tif 99 < len(data) {\n\t\tv99 = data[99]",
"token_estimate": 850,
"tokenized_korean_text": "} v 50 := 0 if 50 < len ( data ) { v 50 = data [ 50 ] } v 51 := 0 if 51 < len ( data ) { v 51 = data [ 51 ] } v 52 := 0 if 52 < len ( data ) { v 52 = data [ 52 ] } v 53 := 0 if 53 < len ( data ) { v 53 = data [ 53 ] } v 54 := 0 if 54 < len ( data ) { v 54 = data [ 54 ] } v 55 := 0 if 55 < len ( data ) { v 55 = data [ 55 ] } v 56 := 0 if 56 < len ( data ) { v 56 = data [ 56 ] } v 57 := 0 if 57 < len ( data ) { v 57 = data [ 57 ] } v 58 := 0 if 58 < len ( data ) { v 58 = data [ 58 ] } v 59 := 0 if 59 < len ( data ) { v 59 = data [ 59 ] } v 60 := 0 if 60 < len ( data ) { v 60 = data [ 60 ] } v 61 := 0 if 61 < len ( data ) { v 61 = data [ 61 ] } v 62 := 0 if 62 < len ( data ) { v 62 = data [ 62 ] } v 63 := 0 if 63 < len ( data ) { v 63 = data [ 63 ] } v 64 := 0 if 64 < len ( data ) { v 64 = data [ 64 ] } v 65 := 0 if 65 < len ( data ) { v 65 = data [ 65 ] } v 66 := 0 if 66 < len ( data ) { v 66 = data [ 66 ] } v 67 := 0 if 67 < len ( data ) { v 67 = data [ 67 ] } v 68 := 0 if 68 < len ( data ) { v 68 = data [ 68 ] } v 69 := 0 if 69 < len ( data ) { v 69 = data [ 69 ] } v 70 := 0 if 70 < len ( data ) { v 70 = data [ 70 ] } v 71 := 0 if 71 < len ( data ) { v 71 = data [ 71 ] } v 72 := 0 if 72 < len ( data ) { v 72 = data [ 72 ] } v 73 := 0 if 73 < len ( data ) { v 73 = data [ 73 ] } v 74 := 0 if 74 < len ( data ) { v 74 = data [ 74 ] } v 75 := 0 if 75 < len ( data ) { v 75 = data [ 75 ] } v 76 := 0 if 76 < len ( data ) { v 76 = data [ 76 ] } v 77 := 0 if 77 < len ( data ) { v 77 = data [ 77 ] } v 78 := 0 if 78 < len ( data ) { v 78 = data [ 78 ] } v 79 := 0 if 79 < len ( data ) { v 79 = data [ 79 ] } v 80 := 0 if 80 < len ( data ) { v 80 = data [ 80 ] } v 81 := 0 if 81 < len ( data ) { v 81 = data [ 81 ] } v 82 := 0 if 82 < len ( data ) { v 82 = data [ 82 ] } v 83 := 0 if 83 < len ( data ) { v 83 = data [ 83 ] } v 84 := 0 if 84 < len ( data ) { v 84 = data [ 84 ] } v 85 := 0 if 85 < len ( data ) { v 85 = data [ 85 ] } v 86 := 0 if 86 < len ( data ) { v 86 = data [ 86 ] } v 87 := 0 if 87 < len ( data ) { v 87 = data [ 87 ] } v 88 := 0 if 88 < len ( data ) { v 88 = data [ 88 ] } v 89 := 0 if 89 < len ( data ) { v 89 = data [ 89 ] } v 90 := 0 if 90 < len ( data ) { v 90 = data [ 90 ] } v 91 := 0 if 91 < len ( data ) { v 91 = data [ 91 ] } v 92 := 0 if 92 < len ( data ) { v 92 = data [ 92 ] } v 93 := 0 if 93 < len ( data ) { v 93 = data [ 93 ] } v 94 := 0 if 94 < len ( data ) { v 94 = data [ 94 ] } v 95 := 0 if 95 < len ( data ) { v 95 = data [ 95 ] } v 96 := 0 if 96 < len ( data ) { v 96 = data [ 96 ] } v 97 := 0 if 97 < len ( data ) { v 97 = data [ 97 ] } v 98 := 0 if 98 < len ( data ) { v 98 = data [ 98 ] } v 99 := 0 if 99 < len ( data ) { v 99 = data [ 99 ]"
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "e4763e10f059d97f40c2932761b56c3e",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 647,
"line_start": 448,
"symbol": "BigCompute [part 3/5]"
}
],
"text": "\t}\n\tv100 := 0\n\tif 100 < len(data) {\n\t\tv100 = data[100]\n\t}\n\tv101 := 0\n\tif 101 < len(data) {\n\t\tv101 = data[101]\n\t}\n\tv102 := 0\n\tif 102 < len(data) {\n\t\tv102 = data[102]\n\t}\n\tv103 := 0\n\tif 103 < len(data) {\n\t\tv103 = data[103]\n\t}\n\tv104 := 0\n\tif 104 < len(data) {\n\t\tv104 = data[104]\n\t}\n\tv105 := 0\n\tif 105 < len(data) {\n\t\tv105 = data[105]\n\t}\n\tv106 := 0\n\tif 106 < len(data) {\n\t\tv106 = data[106]\n\t}\n\tv107 := 0\n\tif 107 < len(data) {\n\t\tv107 = data[107]\n\t}\n\tv108 := 0\n\tif 108 < len(data) {\n\t\tv108 = data[108]\n\t}\n\tv109 := 0\n\tif 109 < len(data) {\n\t\tv109 = data[109]\n\t}\n\tv110 := 0\n\tif 110 < len(data) {\n\t\tv110 = data[110]\n\t}\n\tv111 := 0\n\tif 111 < len(data) {\n\t\tv111 = data[111]\n\t}\n\tv112 := 0\n\tif 112 < len(data) {\n\t\tv112 = data[112]\n\t}\n\tv113 := 0\n\tif 113 < len(data) {\n\t\tv113 = data[113]\n\t}\n\tv114 := 0\n\tif 114 < len(data) {\n\t\tv114 = data[114]\n\t}\n\tv115 := 0\n\tif 115 < len(data) {\n\t\tv115 = data[115]\n\t}\n\tv116 := 0\n\tif 116 < len(data) {\n\t\tv116 = data[116]\n\t}\n\tv117 := 0\n\tif 117 < len(data) {\n\t\tv117 = data[117]\n\t}\n\tv118 := 0\n\tif 118 < len(data) {\n\t\tv118 = data[118]\n\t}\n\tv119 := 0\n\tif 119 < len(data) {\n\t\tv119 = data[119]\n\t}\n\tv120 := 0\n\tif 120 < len(data) {\n\t\tv120 = data[120]\n\t}\n\tv121 := 0\n\tif 121 < len(data) {\n\t\tv121 = data[121]\n\t}\n\tv122 := 0\n\tif 122 < len(data) {\n\t\tv122 = data[122]\n\t}\n\tv123 := 0\n\tif 123 < len(data) {\n\t\tv123 = data[123]\n\t}\n\tv124 := 0\n\tif 124 < len(data) {\n\t\tv124 = data[124]\n\t}\n\tv125 := 0\n\tif 125 < len(data) {\n\t\tv125 = data[125]\n\t}\n\tv126 := 0\n\tif 126 < len(data) {\n\t\tv126 = data[126]\n\t}\n\tv127 := 0\n\tif 127 < len(data) {\n\t\tv127 = data[127]\n\t}\n\tv128 := 0\n\tif 128 < len(data) {\n\t\tv128 = data[128]\n\t}\n\tv129 := 0\n\tif 129 < len(data) {\n\t\tv129 = data[129]\n\t}\n\tv130 := 0\n\tif 130 < len(data) {\n\t\tv130 = data[130]\n\t}\n\tv131 := 0\n\tif 131 < len(data) {\n\t\tv131 = data[131]\n\t}\n\tv132 := 0\n\tif 132 < len(data) {\n\t\tv132 = data[132]\n\t}\n\tv133 := 0\n\tif 133 < len(data) {\n\t\tv133 = data[133]\n\t}\n\tv134 := 0\n\tif 134 < len(data) {\n\t\tv134 = data[134]\n\t}\n\tv135 := 0\n\tif 135 < len(data) {\n\t\tv135 = data[135]\n\t}\n\tv136 := 0\n\tif 136 < len(data) {\n\t\tv136 = data[136]\n\t}\n\tv137 := 0\n\tif 137 < len(data) {\n\t\tv137 = data[137]\n\t}\n\tv138 := 0\n\tif 138 < len(data) {\n\t\tv138 = data[138]\n\t}\n\tv139 := 0\n\tif 139 < len(data) {\n\t\tv139 = data[139]\n\t}\n\tv140 := 0\n\tif 140 < len(data) {\n\t\tv140 = data[140]\n\t}\n\tv141 := 0\n\tif 141 < len(data) {\n\t\tv141 = data[141]\n\t}\n\tv142 := 0\n\tif 142 < len(data) {\n\t\tv142 = data[142]\n\t}\n\tv143 := 0\n\tif 143 < len(data) {\n\t\tv143 = data[143]\n\t}\n\tv144 := 0\n\tif 144 < len(data) {\n\t\tv144 = data[144]\n\t}\n\tv145 := 0\n\tif 145 < len(data) {\n\t\tv145 = data[145]\n\t}\n\tv146 := 0\n\tif 146 < len(data) {\n\t\tv146 = data[146]\n\t}\n\tv147 := 0\n\tif 147 < len(data) {\n\t\tv147 = data[147]\n\t}\n\tv148 := 0\n\tif 148 < len(data) {\n\t\tv148 = data[148]\n\t}\n\tv149 := 0\n\tif 149 < len(data) {\n\t\tv149 = data[149]",
"token_estimate": 917,
"tokenized_korean_text": "} v 100 := 0 if 100 < len ( data ) { v 100 = data [ 100 ] } v 101 := 0 if 101 < len ( data ) { v 101 = data [ 101 ] } v 102 := 0 if 102 < len ( data ) { v 102 = data [ 102 ] } v 103 := 0 if 103 < len ( data ) { v 103 = data [ 103 ] } v 104 := 0 if 104 < len ( data ) { v 104 = data [ 104 ] } v 105 := 0 if 105 < len ( data ) { v 105 = data [ 105 ] } v 106 := 0 if 106 < len ( data ) { v 106 = data [ 106 ] } v 107 := 0 if 107 < len ( data ) { v 107 = data [ 107 ] } v 108 := 0 if 108 < len ( data ) { v 108 = data [ 108 ] } v 109 := 0 if 109 < len ( data ) { v 109 = data [ 109 ] } v 110 := 0 if 110 < len ( data ) { v 110 = data [ 110 ] } v 111 := 0 if 111 < len ( data ) { v 111 = data [ 111 ] } v 112 := 0 if 112 < len ( data ) { v 112 = data [ 112 ] } v 113 := 0 if 113 < len ( data ) { v 113 = data [ 113 ] } v 114 := 0 if 114 < len ( data ) { v 114 = data [ 114 ] } v 115 := 0 if 115 < len ( data ) { v 115 = data [ 115 ] } v 116 := 0 if 116 < len ( data ) { v 116 = data [ 116 ] } v 117 := 0 if 117 < len ( data ) { v 117 = data [ 117 ] } v 118 := 0 if 118 < len ( data ) { v 118 = data [ 118 ] } v 119 := 0 if 119 < len ( data ) { v 119 = data [ 119 ] } v 120 := 0 if 120 < len ( data ) { v 120 = data [ 120 ] } v 121 := 0 if 121 < len ( data ) { v 121 = data [ 121 ] } v 122 := 0 if 122 < len ( data ) { v 122 = data [ 122 ] } v 123 := 0 if 123 < len ( data ) { v 123 = data [ 123 ] } v 124 := 0 if 124 < len ( data ) { v 124 = data [ 124 ] } v 125 := 0 if 125 < len ( data ) { v 125 = data [ 125 ] } v 126 := 0 if 126 < len ( data ) { v 126 = data [ 126 ] } v 127 := 0 if 127 < len ( data ) { v 127 = data [ 127 ] } v 128 := 0 if 128 < len ( data ) { v 128 = data [ 128 ] } v 129 := 0 if 129 < len ( data ) { v 129 = data [ 129 ] } v 130 := 0 if 130 < len ( data ) { v 130 = data [ 130 ] } v 131 := 0 if 131 < len ( data ) { v 131 = data [ 131 ] } v 132 := 0 if 132 < len ( data ) { v 132 = data [ 132 ] } v 133 := 0 if 133 < len ( data ) { v 133 = data [ 133 ] } v 134 := 0 if 134 < len ( data ) { v 134 = data [ 134 ] } v 135 := 0 if 135 < len ( data ) { v 135 = data [ 135 ] } v 136 := 0 if 136 < len ( data ) { v 136 = data [ 136 ] } v 137 := 0 if 137 < len ( data ) { v 137 = data [ 137 ] } v 138 := 0 if 138 < len ( data ) { v 138 = data [ 138 ] } v 139 := 0 if 139 < len ( data ) { v 139 = data [ 139 ] } v 140 := 0 if 140 < len ( data ) { v 140 = data [ 140 ] } v 141 := 0 if 141 < len ( data ) { v 141 = data [ 141 ] } v 142 := 0 if 142 < len ( data ) { v 142 = data [ 142 ] } v 143 := 0 if 143 < len ( data ) { v 143 = data [ 143 ] } v 144 := 0 if 144 < len ( data ) { v 144 = data [ 144 ] } v 145 := 0 if 145 < len ( data ) { v 145 = data [ 145 ] } v 146 := 0 if 146 < len ( data ) { v 146 = data [ 146 ] } v 147 := 0 if 147 < len ( data ) { v 147 = data [ 147 ] } v 148 := 0 if 148 < len ( data ) { v 148 = data [ 148 ] } v 149 := 0 if 149 < len ( data ) { v 149 = data [ 149 ]"
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "24176c911d0bacf9a29fa7f8251f5036",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 847,
"line_start": 648,
"symbol": "BigCompute [part 4/5]"
}
],
"text": "\t}\n\tv150 := 0\n\tif 150 < len(data) {\n\t\tv150 = data[150]\n\t}\n\tv151 := 0\n\tif 151 < len(data) {\n\t\tv151 = data[151]\n\t}\n\tv152 := 0\n\tif 152 < len(data) {\n\t\tv152 = data[152]\n\t}\n\tv153 := 0\n\tif 153 < len(data) {\n\t\tv153 = data[153]\n\t}\n\tv154 := 0\n\tif 154 < len(data) {\n\t\tv154 = data[154]\n\t}\n\tv155 := 0\n\tif 155 < len(data) {\n\t\tv155 = data[155]\n\t}\n\tv156 := 0\n\tif 156 < len(data) {\n\t\tv156 = data[156]\n\t}\n\tv157 := 0\n\tif 157 < len(data) {\n\t\tv157 = data[157]\n\t}\n\tv158 := 0\n\tif 158 < len(data) {\n\t\tv158 = data[158]\n\t}\n\tv159 := 0\n\tif 159 < len(data) {\n\t\tv159 = data[159]\n\t}\n\tv160 := 0\n\tif 160 < len(data) {\n\t\tv160 = data[160]\n\t}\n\tv161 := 0\n\tif 161 < len(data) {\n\t\tv161 = data[161]\n\t}\n\tv162 := 0\n\tif 162 < len(data) {\n\t\tv162 = data[162]\n\t}\n\tv163 := 0\n\tif 163 < len(data) {\n\t\tv163 = data[163]\n\t}\n\tv164 := 0\n\tif 164 < len(data) {\n\t\tv164 = data[164]\n\t}\n\tv165 := 0\n\tif 165 < len(data) {\n\t\tv165 = data[165]\n\t}\n\tv166 := 0\n\tif 166 < len(data) {\n\t\tv166 = data[166]\n\t}\n\tv167 := 0\n\tif 167 < len(data) {\n\t\tv167 = data[167]\n\t}\n\tv168 := 0\n\tif 168 < len(data) {\n\t\tv168 = data[168]\n\t}\n\tv169 := 0\n\tif 169 < len(data) {\n\t\tv169 = data[169]\n\t}\n\tv170 := 0\n\tif 170 < len(data) {\n\t\tv170 = data[170]\n\t}\n\tv171 := 0\n\tif 171 < len(data) {\n\t\tv171 = data[171]\n\t}\n\tv172 := 0\n\tif 172 < len(data) {\n\t\tv172 = data[172]\n\t}\n\tv173 := 0\n\tif 173 < len(data) {\n\t\tv173 = data[173]\n\t}\n\tv174 := 0\n\tif 174 < len(data) {\n\t\tv174 = data[174]\n\t}\n\tv175 := 0\n\tif 175 < len(data) {\n\t\tv175 = data[175]\n\t}\n\tv176 := 0\n\tif 176 < len(data) {\n\t\tv176 = data[176]\n\t}\n\tv177 := 0\n\tif 177 < len(data) {\n\t\tv177 = data[177]\n\t}\n\tv178 := 0\n\tif 178 < len(data) {\n\t\tv178 = data[178]\n\t}\n\tv179 := 0\n\tif 179 < len(data) {\n\t\tv179 = data[179]\n\t}\n\tv180 := 0\n\tif 180 < len(data) {\n\t\tv180 = data[180]\n\t}\n\tv181 := 0\n\tif 181 < len(data) {\n\t\tv181 = data[181]\n\t}\n\tv182 := 0\n\tif 182 < len(data) {\n\t\tv182 = data[182]\n\t}\n\tv183 := 0\n\tif 183 < len(data) {\n\t\tv183 = data[183]\n\t}\n\tv184 := 0\n\tif 184 < len(data) {\n\t\tv184 = data[184]\n\t}\n\tv185 := 0\n\tif 185 < len(data) {\n\t\tv185 = data[185]\n\t}\n\tv186 := 0\n\tif 186 < len(data) {\n\t\tv186 = data[186]\n\t}\n\tv187 := 0\n\tif 187 < len(data) {\n\t\tv187 = data[187]\n\t}\n\tv188 := 0\n\tif 188 < len(data) {\n\t\tv188 = data[188]\n\t}\n\tv189 := 0\n\tif 189 < len(data) {\n\t\tv189 = data[189]\n\t}\n\tv190 := 0\n\tif 190 < len(data) {\n\t\tv190 = data[190]\n\t}\n\tv191 := 0\n\tif 191 < len(data) {\n\t\tv191 = data[191]\n\t}\n\tv192 := 0\n\tif 192 < len(data) {\n\t\tv192 = data[192]\n\t}\n\tv193 := 0\n\tif 193 < len(data) {\n\t\tv193 = data[193]\n\t}\n\tv194 := 0\n\tif 194 < len(data) {\n\t\tv194 = data[194]\n\t}\n\tv195 := 0\n\tif 195 < len(data) {\n\t\tv195 = data[195]\n\t}\n\tv196 := 0\n\tif 196 < len(data) {\n\t\tv196 = data[196]\n\t}\n\tv197 := 0\n\tif 197 < len(data) {\n\t\tv197 = data[197]\n\t}\n\tv198 := 0\n\tif 198 < len(data) {\n\t\tv198 = data[198]\n\t}\n\tv199 := 0\n\tif 199 < len(data) {\n\t\tv199 = data[199]",
"token_estimate": 917,
"tokenized_korean_text": "} v 150 := 0 if 150 < len ( data ) { v 150 = data [ 150 ] } v 151 := 0 if 151 < len ( data ) { v 151 = data [ 151 ] } v 152 := 0 if 152 < len ( data ) { v 152 = data [ 152 ] } v 153 := 0 if 153 < len ( data ) { v 153 = data [ 153 ] } v 154 := 0 if 154 < len ( data ) { v 154 = data [ 154 ] } v 155 := 0 if 155 < len ( data ) { v 155 = data [ 155 ] } v 156 := 0 if 156 < len ( data ) { v 156 = data [ 156 ] } v 157 := 0 if 157 < len ( data ) { v 157 = data [ 157 ] } v 158 := 0 if 158 < len ( data ) { v 158 = data [ 158 ] } v 159 := 0 if 159 < len ( data ) { v 159 = data [ 159 ] } v 160 := 0 if 160 < len ( data ) { v 160 = data [ 160 ] } v 161 := 0 if 161 < len ( data ) { v 161 = data [ 161 ] } v 162 := 0 if 162 < len ( data ) { v 162 = data [ 162 ] } v 163 := 0 if 163 < len ( data ) { v 163 = data [ 163 ] } v 164 := 0 if 164 < len ( data ) { v 164 = data [ 164 ] } v 165 := 0 if 165 < len ( data ) { v 165 = data [ 165 ] } v 166 := 0 if 166 < len ( data ) { v 166 = data [ 166 ] } v 167 := 0 if 167 < len ( data ) { v 167 = data [ 167 ] } v 168 := 0 if 168 < len ( data ) { v 168 = data [ 168 ] } v 169 := 0 if 169 < len ( data ) { v 169 = data [ 169 ] } v 170 := 0 if 170 < len ( data ) { v 170 = data [ 170 ] } v 171 := 0 if 171 < len ( data ) { v 171 = data [ 171 ] } v 172 := 0 if 172 < len ( data ) { v 172 = data [ 172 ] } v 173 := 0 if 173 < len ( data ) { v 173 = data [ 173 ] } v 174 := 0 if 174 < len ( data ) { v 174 = data [ 174 ] } v 175 := 0 if 175 < len ( data ) { v 175 = data [ 175 ] } v 176 := 0 if 176 < len ( data ) { v 176 = data [ 176 ] } v 177 := 0 if 177 < len ( data ) { v 177 = data [ 177 ] } v 178 := 0 if 178 < len ( data ) { v 178 = data [ 178 ] } v 179 := 0 if 179 < len ( data ) { v 179 = data [ 179 ] } v 180 := 0 if 180 < len ( data ) { v 180 = data [ 180 ] } v 181 := 0 if 181 < len ( data ) { v 181 = data [ 181 ] } v 182 := 0 if 182 < len ( data ) { v 182 = data [ 182 ] } v 183 := 0 if 183 < len ( data ) { v 183 = data [ 183 ] } v 184 := 0 if 184 < len ( data ) { v 184 = data [ 184 ] } v 185 := 0 if 185 < len ( data ) { v 185 = data [ 185 ] } v 186 := 0 if 186 < len ( data ) { v 186 = data [ 186 ] } v 187 := 0 if 187 < len ( data ) { v 187 = data [ 187 ] } v 188 := 0 if 188 < len ( data ) { v 188 = data [ 188 ] } v 189 := 0 if 189 < len ( data ) { v 189 = data [ 189 ] } v 190 := 0 if 190 < len ( data ) { v 190 = data [ 190 ] } v 191 := 0 if 191 < len ( data ) { v 191 = data [ 191 ] } v 192 := 0 if 192 < len ( data ) { v 192 = data [ 192 ] } v 193 := 0 if 193 < len ( data ) { v 193 = data [ 193 ] } v 194 := 0 if 194 < len ( data ) { v 194 = data [ 194 ] } v 195 := 0 if 195 < len ( data ) { v 195 = data [ 195 ] } v 196 := 0 if 196 < len ( data ) { v 196 = data [ 196 ] } v 197 := 0 if 197 < len ( data ) { v 197 = data [ 197 ] } v 198 := 0 if 198 < len ( data ) { v 198 = data [ 198 ] } v 199 := 0 if 199 < len ( data ) { v 199 = data [ 199 ]"
},
{
"block_ids": [
"5d269745b2e5dbdcbef0c09ba54b0bd6"
],
"chunk_id": "438127626378632c03780d10603de32c",
"chunker_version": "code-go-ast-v1",
"doc_id": "83daba5fbb026e7a400d68a1c4bd36db",
"heading_path": [],
"policy_hash": "6cfe77abe2b0e5c3",
"source_spans": [
{
"kind": "code",
"lang": "go",
"line_end": 890,
"line_start": 848,
"symbol": "BigCompute [part 5/5]"
}
],
"text": "\t}\n\tv200 := 0\n\tif 200 < len(data) {\n\t\tv200 = data[200]\n\t}\n\tv201 := 0\n\tif 201 < len(data) {\n\t\tv201 = data[201]\n\t}\n\tv202 := 0\n\tif 202 < len(data) {\n\t\tv202 = data[202]\n\t}\n\tv203 := 0\n\tif 203 < len(data) {\n\t\tv203 = data[203]\n\t}\n\tv204 := 0\n\tif 204 < len(data) {\n\t\tv204 = data[204]\n\t}\n\tv205 := 0\n\tif 205 < len(data) {\n\t\tv205 = data[205]\n\t}\n\tv206 := 0\n\tif 206 < len(data) {\n\t\tv206 = data[206]\n\t}\n\tv207 := 0\n\tif 207 < len(data) {\n\t\tv207 = data[207]\n\t}\n\tv208 := 0\n\tif 208 < len(data) {\n\t\tv208 = data[208]\n\t}\n\tv209 := 0\n\tif 209 < len(data) {\n\t\tv209 = data[209]\n\t}\n\treturn len(data)\n}",
"token_estimate": 191,
"tokenized_korean_text": "} v 200 := 0 if 200 < len ( data ) { v 200 = data [ 200 ] } v 201 := 0 if 201 < len ( data ) { v 201 = data [ 201 ] } v 202 := 0 if 202 < len ( data ) { v 202 = data [ 202 ] } v 203 := 0 if 203 < len ( data ) { v 203 = data [ 203 ] } v 204 := 0 if 204 < len ( data ) { v 204 = data [ 204 ] } v 205 := 0 if 205 < len ( data ) { v 205 = data [ 205 ] } v 206 := 0 if 206 < len ( data ) { v 206 = data [ 206 ] } v 207 := 0 if 207 < len ( data ) { v 207 = data [ 207 ] } v 208 := 0 if 208 < len ( data ) { v 208 = data [ 208 ] } v 209 := 0 if 209 < len ( data ) { v 209 = data [ 209 ] } return len ( data ) }"
}
]

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,33 @@
#include <stdio.h>
#include <stdlib.h>
#define MAX_BUF 4096
typedef enum {
OK = 0,
ERR_PARSE,
ERR_IO,
} status_t;
typedef struct {
int id;
char name[64];
status_t status;
} record_t;
static int counter = 0;
int parse_record(const char *line, record_t *out) {
if (line == NULL || out == NULL) return ERR_PARSE;
return OK;
}
void print_record(const record_t *r) {
printf("[%d] %s (status=%d)\n", r->id, r->name, r->status);
}
int main(void) {
record_t r = { .id = 1, .name = "foo", .status = OK };
print_record(&r);
return 0;
}

View File

@@ -0,0 +1,40 @@
#include <string>
#include <vector>
namespace kebab {
namespace chunk {
class MdHeadingV1Chunker {
public:
MdHeadingV1Chunker() = default;
~MdHeadingV1Chunker() = default;
std::string chunk_doc(const std::string& doc) {
return doc;
}
int operator()(int x) const {
return x * 2;
}
private:
int counter_ = 0;
};
template <typename T>
T identity(T value) {
return value;
}
} // namespace chunk
void global_helper() {
// free function in kebab namespace
}
} // namespace kebab
int main() {
kebab::chunk::MdHeadingV1Chunker c;
return 0;
}

View File

@@ -0,0 +1,5 @@
FROM rust:1.94-slim AS builder
WORKDIR /app
COPY . .
RUN cargo build --release
CMD ["/app/target/release/kebab"]

View File

@@ -0,0 +1,7 @@
[package]
name = "demo"
version = "0.1.0"
edition = "2021"
[dependencies]
serde = "1"

Some files were not shown because too many files have changed in this diff Show More