Compare commits

...

201 Commits

Author SHA1 Message Date
8dee610a97 docs(hotfixes): arctic 종단 도그푸딩 evidence (recall@10 130/132)
kebab v0.26.0 실제 파이프라인(ollama arctic)으로 namu 재색인 → 확장 골든 eval
recall@10 130/132·recall@50 132/132·fully_consistent 22/24 종단 재현. 측정→구현
→실파이프라인 삼중 확인. 릴리스 전 도그푸딩 trigger(embedder 모델 변경) 충족.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 07:19:19 +00:00
d71ed2516b Merge pull request 'feat(embed): arctic-embed-l-v2.0 임베더(candle+ollama)' (#203) from feat/arctic-embedder into main
Reviewed-on: #203
2026-06-03 06:27:55 +00:00
095c9f37a2 docs(smoke): embedding config 블록 v0.26.0 동기화
SMOKE.md 의 [models.embedding] 예시 주석이 stale: provider 목록에 ollama 누락 +
"candle 은 e5-large 만 지원"(arctic 추가로 더 이상 사실 아님) + endpoint/arctic
미기재. CLAUDE.md §"README Configuration + SMOKE config 블록 동시 갱신" 규칙대로
보완 — provider 4종, arctic 모델(candle/ollama 태그), endpoint(ollama 전용, llm
endpoint fallback), e5↔arctic cascade 주석 추가.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 05:11:03 +00:00
16ddb1dfc3 docs: arctic 임베더 문서 동기화 (README/ARCHITECTURE/HANDOFF/HOTFIXES)
README Configuration: provider candle/ollama + arctic 모델(candle CLS / ollama 태그)
+ endpoint + e5→arctic cascade 경고. ARCHITECTURE: 백엔드 그래프 노드(embedollama)
+ 임베딩 백엔드 결정표(채택 근거 측정 recall@10 130) + 디렉토리 트리. HANDOFF 1줄.
HOTFIXES 2026-06-03 arctic dated entry(레지스트리/pooling/prefix/cascade + 수동
cosine 0.999984 실측 결과).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 04:59:23 +00:00
72c99c452c feat(config,app): embedding provider=ollama 배선 + endpoint, version 0.26.0
kebab-config: EmbeddingModelCfg.endpoint: Option<String>(serde default, ollama용,
None→models.llm.endpoint 폴백) + provider 문서에 ollama + env
KEBAB_MODELS_EMBEDDING_ENDPOINT. kebab-app embedder(): provider match 에 ollama
분기(facade 경유). workspace member += kebab-embed-ollama, app dep 추가.
version 0.25.0 → 0.26.0(minor, +Cargo.lock) — 신규 임베딩 백엔드/모델은 CLAUDE.md
§Release 의 surface 변경 트리거.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 04:59:23 +00:00
cbcae69abf feat(embed): candle 모델 레지스트리 + arctic-embed-l-v2.0 (CLS pooling)
e5 하드코딩(HF_MODEL/SUPPORTED_MODEL/mean/query:+passage:) → 모델 레지스트리
EmbedModelSpec{name,hf_repo,pooling,query_prefix,doc_prefix,dim,version_tag}.
e5(mean, query:/passage:) + arctic(CLS, query:/무접두어). pooling 모델별 분기
(mean=attention-mask-weighted / CLS=hidden[:,0,:]), tokenize/forward/L2 공유.
arctic pooling=CLS 는 HF 1_Pooling/config.json(pooling_mode_cls_token:true) 확인.
model_version 은 arctic 일 때 +arctic-cls 태그(embedding_version cascade 트리거);
e5 는 fastembed-e5 호환(NUMA 드롭인) 위해 plain config.version 유지.

correctness 게이트: tests/arctic_ollama_parity.rs (#[ignore], live Ollama) —
candle arctic vs Ollama snowflake-arctic-embed2 per-sentence 코사인>0.99.
수동 실측 cosine_min=0.999984 (recall@10 130 재현 보장).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 04:59:11 +00:00
7505645008 feat(embed): kebab-embed-ollama 신규 크레이트 — Ollama /api/embed Embedder
arctic-embed-l-v2.0 의 폴백 백엔드(측정에 쓴 경로 그대로). reqwest::blocking
POST {endpoint}/api/embed {model,input:[...]} → embeddings. batch 48 +
fail-soft 재시도 3, 결과 L2 정규화(Ollama raw 반환 → 일관성), dim 검증.
query/doc prefix 는 모델 태그로 추론(arctic-embed→query:/무접두어, e5→query:/passage:).
model_version=ollama:{model}. endpoint=models.embedding.endpoint ?? models.llm.endpoint.
wiremock 테스트 3종(L2 정규화/dim mismatch/empty no-op) + 단위 5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 04:59:11 +00:00
e2ae9a4589 docs(spec): arctic-embed-l-v2.0 임베더 통합 spec + plan
별칭 제거 후 설명형 recall 보강 최선책(측정 arctic recall@10 130/132,
용어 무손실). candle 다중모델화(CLS pooling+query: prefix) 우선 + Ollama
embed provider 폴백. correctness 게이트 = candle≈Ollama 코사인>0.99.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 04:14:26 +00:00
1dfab6dfc5 Merge pull request 'refactor(app): doc-side expansion(별칭) 기능 제거' (#202) from refactor/remove-doc-expansion into main
Reviewed-on: #202
2026-06-03 00:39:26 +00:00
fc5103642e docs: 별칭 제거 문서 동기화 + version 0.25.0
HOTFIXES 2026-06-03 dated entry, 2026-05-30 design spec 제거 banner,
HANDOFF 1줄, README(별칭 섹션/config/명령표 정리), ARCHITECTURE(결정 표 +
디렉토리 트리), SMOKE/DOGFOOD config-migrate 예시 정정. workspace version
0.24.0 → 0.25.0 (+ Cargo.lock).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:37:58 +00:00
e03d03cb26 test: 별칭 전용 테스트 삭제 + 영향 테스트/fixture 갱신
kebab-search/tests/lexical.rs 의 alias 채널 테스트 + insert_chunk_with_aliases
헬퍼 제거(body 회수 회귀 테스트로 대체). Chunk 리터럴 aliases: None 제거
(embedding_records_fk/idempotency/inspect). chunk 스냅샷 fixture 의 aliases
키 제거. config_migrate 는 ingest.code 앵커로, corpus_revision/search_lexical
주석은 V013 비-bump 명시로 갱신.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:37:58 +00:00
16aadea222 feat(store): V013 마이그레이션 — chunk_aliases_fts + chunks.aliases DROP
forward-only 마이그레이션으로 V010 이 만든 chunk_aliases_fts(+트리거)와
chunks.aliases 컬럼 제거. 과거 V010 은 freeze 무수정. 순수 구조 변경 —
corpus_revision bump 안 함(spec §결정: 본문/임베딩 불변, in-process LRU 는
프로세스별·query 이전 실행이라 bump 무의미).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:37:44 +00:00
a48c405826 refactor(wire): ExpansionProgress 이벤트 + 렌더 제거
IngestEvent::ExpansionProgress variant + 직렬화 테스트 제거(AssetChunked/
AssetTimings 유지). CLI/TUI 의 expansion 렌더 제거, AssetTimings 한 줄에서
expand 세그먼트 제거. ingest_progress.v1 schema 의 expansion_progress kind
제거, expansion_ms 설명을 "값 0 유지"로 갱신.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:37:44 +00:00
21e02d8a93 refactor(app): ingest 별칭 생성·캐시·sentinel 벡터 루프 제거
ingest_one_asset 의 청크당 별칭 LLM 생성·derivation_cache 조회/저장·
embed_aliases sentinel 벡터(`{orig}#alias#N`) upsert 루프 제거.
expansion_ms 는 wire 호환 위해 0 고정. alias_sentinel_ids_to_delete 와
orphan purge 3개 호출부를 본문 chunk_id 직접 삭제로 단순화.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:37:43 +00:00
a64c31ee94 refactor(search): alias lexical arm 제거
run_alias_query / merge_body_alias 제거, 검색을 body_rows 직접 사용으로
단순화. build_match_string_for_column 의 column 매개변수 인라인(text 고정).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:36:56 +00:00
ec96648956 refactor(config): ExpansionCfg + [ingest.expansion] 제거
IngestExpansionCfg struct + IngestCfg.expansion 필드 + Default +
KEBAB_INGEST_EXPANSION_* env 파싱 + 테스트 제거. migrate.rs 의
ingest.expansion 섹션 주석 제거 — config migrate 테스트는 ingest.code 앵커로
정정(forward-compat: 기존 [ingest.expansion] 섹션은 serde 가 무시).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:36:56 +00:00
ecaf224381 refactor(chunk): Chunk 생성부의 aliases 리터럴 + store 컬럼 제거
kebab-chunk/* AST·md·tier2·pdf chunker 의 aliases: None 리터럴 삭제,
store-sqlite documents.rs chunks INSERT 컬럼/바인딩 + get_chunk 매핑에서
aliases 제거.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:36:44 +00:00
b1c5feb3f3 refactor(core): Chunk.aliases 필드 제거
doc-side expansion(별칭) 제거 — Chunk 의 aliases: Option<String> 필드와
serde default 테스트 제거. Metadata.aliases(Vec, 문서 메타)는 유지.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 21:36:44 +00:00
ca8c0645fb docs(spec): doc-side expansion(별칭) 제거 spec + plan
연구문서(2026-06-03) 결론 따라 별칭 기능 제거 설계. 유지(Metadata.aliases,
AssetChunked/AssetTimings, embedding 캐시)/제거(Chunk.aliases, expansion.rs,
ExpansionCfg, chunk_aliases_fts, run_alias_query, ExpansionProgress) 경계 +
신규 DROP 마이그레이션 + wire expansion_progress 제거 결정. 구현 plan 8 task.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 20:13:12 +00:00
c7af6612b7 docs(research): expansion 비용 재고 + 별칭 대체 딥리서치
별칭(doc-side per-chunk LLM expansion)이 ingest 임계경로 병목으로 확정된 뒤
대안 조사. 동시성(OLLAMA_NUM_PARALLEL 최대 1.28×)·모델스왑(qwen3.5 중국어/
degeneration)·백그라운드(총량·treadmill 불변) 모두 실측 소진. Step 0 측정:
별칭 없이도 cross-lingual recall@10 완벽(en/ko/syn/abbr), 약점은 설명형뿐
→ 별칭 ROI 음수. Step 1: bge-m3 dense 는 lateral(설명형 +3 / 용어 -3, 순0).
4-agent 딥리서치: 잔존 = reverse-dictionary 과제, 측정-우선 계층(heading
enrichment → arctic-ko 임베더 → bge-reranker-v2-m3 → near-tie 게이트 expansion).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 20:07:42 +00:00
acb4fa6c65 Merge pull request 'feat(ingest): asset 내부 phase 진행 로깅 (asset_chunked/expansion_progress/asset_timings)' (#201) from feat/ingest-progress-detail into main
Reviewed-on: #201
2026-06-02 17:26:47 +00:00
8bfa4ba76e fix(ingest-progress): 리뷰 반영 — store_ms 경계 정정 + 중복 expansion 프레임 가드
- store_ms 에서 stale-vector orphan purge(LanceDB I/O) 제거 → embed/vector phase
  (embed_ms)로 이동. store_ms 가 이제 SQLite put_* 만 의미(진단 정확도; 편집
  재색인 시 920ms 오귀속 제거). purge 는 여전히 unconditional + upsert 이전.
- 최종 expansion_progress 프레임을 done != last_done 로 가드 (throttle 배수 시
  중복 프레임 + chunks==0 시 0/0 프레임 제거).
- schema/HOTFIXES: store_ms/embed_ms 설명 정정 + dangling IMPL_REPORT 참조 제거.

clippy -D warnings 0, test 312 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 14:49:02 +00:00
ad0ccf4ccf chore(ingest-progress): remove process artifacts before PR
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 14:13:36 +00:00
b351523e51 docs(worktree): IMPL_BRIEF + IMPL_REPORT for ingest-progress-detail
작업 입력(brief)과 산출 증거(report: 변경/이벤트/exit-code 검증/smoke 샘플/
잔여 리스크). 메인 세션이 PR 정리 시 드롭 가능한 worktree 메타 아티팩트.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 13:58:33 +00:00
a48b055358 feat(ingest): asset 내부 phase 진행 로깅 (asset_chunked/expansion_progress/asset_timings) + v0.24.0
asset(문서) 단위뿐이던 ingest 진행 이벤트에 문서 내부 phase 가시성을 추가.
큰 문서가 expansion(별칭 LLM, 청크당 순차)으로 수십 분 걸려도 진행바가
1/N 에 멈춘 듯 보이던 문제 해결.

wire ingest_progress.v1 additive (backward-compat):
- asset_chunked {idx,total,chunks} — 청킹 직후, markdown/image/pdf 전 경로
- expansion_progress {idx,total,done,chunks} — expansion 루프 스로틀
  (25청크 또는 1s, 종료 시 done==chunks). 캐시 히트도 done 에 포함
- asset_timings {idx,total,parse_ms,chunk_ms,expansion_ms,embed_ms,store_ms}
  — markdown 경로 phase별 wall-clock

설계: timing 은 kebab_core::IngestItem(wire-stable) 변경을 피해 신규
AssetTimings 이벤트로 ingest_one_asset 가 직접 emit (AssetFinished 무변경).

CLI(progress.rs): 진행바 sub-message(→ N chunks / 별칭 확장 done/chunks) +
asset 종료 시 phase timing 한 줄(fmt_ms). TUI reducer no-op arm.

검증: clippy -D warnings exit 0; cargo test -p kebab-app -p kebab-cli
312 passed/0 failed. ordering-invariant 테스트 재작성 + 신규 직렬화 테스트.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 13:58:27 +00:00
581e1d5d55 feat(cli): ingest 시 임베딩 백엔드/디바이스 한 줄 표시 + README KB 이전 문서 (v0.23.1)
- kebab-cli ingest: 시작 시 `임베딩 백엔드: <provider> (Metal/GPU 빌드|CPU) · 모델 …`
  를 stderr 로 표시 (--json/--quiet 억제). Metal 표기는 cfg!(feature=embed_metal)
  기반; 확정 런타임 디바이스는 kb.log(`candle device = …`).
- README: '외부 계산 + 로컬 검색' 절에 복사 대상(kebab.sqlite/sqlite, lancedb/vector_dir)
  + [storage] config 키 + models/assets 복사 불필요 + 동일 버전/모델 조건 + rsync 예시.
- 버전 0.23.0 → 0.23.1 (CLI 출력 + 문서만, 동작/schema 불변).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 12:25:45 +00:00
c17d6e67a8 Merge pull request 'feat(embed): candle Metal (Apple Silicon GPU) opt-in build feature' (#200) from feat/embed-candle-metal into main
Reviewed-on: #200
2026-06-02 11:40:52 +00:00
af8fd34716 docs(embed): README 에 cargo install --features embed_metal 안내 추가
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 11:38:28 +00:00
369aeb3d24 feat(embed): candle Metal (Apple Silicon GPU) opt-in build feature + v0.23.0
- kebab-embed-candle: `metal` feature → candle metal backend; select_device()
  picks Device::new_metal(0) (CPU fallback) under the feature, else Device::Cpu.
  .contiguous() before to_vec2 (Metal rejects strided views; CPU tolerates).
- feature passthrough: kebab-app/embed_metal → kebab-cli/embed_metal.
  Build on macOS: cargo build --release --features embed_metal.
- default (non-metal) path unchanged: clippy 0, candle units + thread_cap + parity pass.
- README + HOTFIXES: Mac-GPU-ingest → copy sqlite+lancedb → server CPU-query workflow.
- version 0.22.0 → 0.23.0 (opt-in build surface).

macOS-only compile; Metal execution/speed/parity validated by user on M4 Pro
(not buildable on the Linux CI/dev machine).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 11:37:08 +00:00
99f8cfa691 Merge pull request 'feat(embed): candle 임베딩 provider (NUMA-안전, opt-in)' (#199) from feat/embed-candle into main
Reviewed-on: #199
2026-06-02 10:14:39 +00:00
d85d7348a5 docs(embed-candle): 도그푸딩 + A1 반증 + MKL 부정결과 증거 기록
- HOTFIXES + release-notes: candle 전체 도그푸딩 997 docs/23,151 chunks/에러 0 (9.5h)
- A1(taskset -c 0-3) 실서버 반증: 4코어 제한에도 onnxruntime segfault → candle 만이 실 해법
- MKL 가속 부정 결과: 코어 더 쓰나 38~50% 느림 → 미채택, 순수-Rust 유지
- 패리티 2.01e-7 재확인, 성능 트레이드오프 명시

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 09:08:12 +00:00
edac3ae737 chore(embed-candle): PR #199 회차 1 리뷰 반영 — SMOKE.md candle 모델 주의 명시
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 17:01:35 +00:00
6ec4e6809f fix(embed-candle): address round-1 review
- commit track-spec + meta-spec/plan into branch (HIGH: dangling `amends:` ref)
- inline parity evidence (cosine 1.0, max_abs_diff 2.01e-7) into HOTFIXES +
  release notes; drop refs to deleted IMPL_REPORT/SPIKE_REPORT (MEDIUM)
- model guard: reject non-e5-large `model` before the 2GB download so
  model_id() can't mislabel vectors (MEDIUM) + unit test
- parity test now covers BOTH query: and passage: prefixes (MEDIUM)
- guard encodings.first() index; document zero-attention/pooling invariant;
  clarify embed_batch prefixing doc (LOW)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 16:54:20 +00:00
1011c75fff chore(embed-candle): remove spike/impl process artifacts before PR
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 16:37:46 +00:00
8f7b6ee538 feat(embed): candle 임베딩 provider (NUMA-안전, opt-in) + v0.22.0
duo-socket NUMA 서버에서 fastembed(onnxruntime)가 intra-op 스레드를 48개로
하드코딩해 NUMA 힙 손상 → double-free 로 ingest 가 죽는 문제를 회피하기 위해,
같은 multilingual-e5-large 모델을 순수 Rust(candle)로 돌리는 opt-in 임베딩
provider 를 추가한다.

- 신규 crate kebab-embed-candle: CandleEmbedder (kebab_core::Embedder).
  hf-hub safetensors → XLMRobertaModel forward → mask mean-pool → L2 → e5
  prefix. candle 의존성 트리를 이 crate 에 격리 (core/config 외 kebab-* 의존 0).
- 스레드 캡: [models.embedding].num_threads + env KEBAB_EMBED_THREADS →
  글로벌 rayon 풀 1회 캡 (NUMA-안전 레버).
- kebab-app::embedder() 가 provider 분기 (fastembed/onnx/"" → 기존 경로 불변,
  candle → CandleEmbedder, 미지값 → 에러).
- Phase 0 스파이크 crate 제거 (production 흡수).
- 버전 0.21.1 → 0.22.0 (신규 config surface, pre-1.0 minor bump).

패리티: cosine_min=1.000000, max abs diff=2.01e-7 (< 1e-5) → embedding_version
유지, 재색인 0. fastembed default 동작/벡터 불변. wire schema 변경 없음.

검증(파일+exit code): clippy -D warnings EXIT=0(warning 0), test EXIT=0
(candle unit 5 + thread_cap rayon=4 + config 68), parity #[ignore] EXIT=0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 14:52:25 +00:00
76841af7d3 spike(embed-candle): candle e5-large 타당성 검증 — VERDICT PASS
Track 1 / Phase 0 격리 스파이크. candle(순수 Rust)로
intfloat/multilingual-e5-large 를 돌려 기존 onnxruntime
FastembedEmbedder 와 비교.

결과:
- 패리티: 한/영 10문장 cosine min=mean=1.000000 (완전 일치)
- padding_idx: XLM-R 규약 정상 (소스 + 패리티 이중 확인)
- 스레드 제어: RAYON_NUM_THREADS=4 로 컴퓨트 스레드 12→4 캡 확인
  (fastembed 4.9.1 의 48-하드코딩+override불가 문제 구조적 부재)
- latency: batch=32 candle 2.161s vs fastembed 0.536s (~4×, 4 vs 12 스레드)

→ candle 본 구현 진행 권고 (GREEN). 상세 SPIKE_REPORT.md.

candle 의존성은 crates/spike-embed-candle 에만 격리. 프로덕션
crate 동작 변경 없음. 결정적 NUMA 검증은 그 듀얼소켓 서버에서
사용자 실행 필요 (meta-spec §4.3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 14:23:51 +00:00
980e20fd8d docs: SMOKE/DOGFOOD 에 config migrate 플레이북 추가
SMOKE 에 config migrate 스모크 단계(dry-run/적용/멱등/--json), DOGFOOD §9 에
스키마 마이그레이션 시나리오(.bak byte-identical·값 보존·가시화·멱등·doctor).
v0.21.1 에 포함되도록 태그 이동.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 13:58:08 +00:00
cd79ed326c chore: bump version 0.21.0 → 0.21.1
config 마이그레이션(kebab config migrate, PR #198) — 신규 CLI 서브커맨드 +
doctor 체크 + init 섹션 주석 + wire config_migration.v1 + schema_version 1→2.
additive 변경(데이터 무효화 아님)이라 patch bump.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 13:51:56 +00:00
9dbf9d781d Merge pull request 'feat(config): config.toml 마이그레이션 (kebab config migrate)' (#198) from feat/config-migration into main
Reviewed-on: #198
2026-05-31 13:48:10 +00:00
9501edd82b docs: config migrate surface 동기화 (README/HOTFIXES/HANDOFF)
README Configuration 에 kebab config migrate 불릿, HOTFIXES 에 dated entry
(메커니즘 + 도그푸딩 evidence 표 + 한계), HANDOFF 한 줄. lib.rs 백업 경로는
with_extension 유지(리뷰 nit: .toml config 엔 정상 동작, 회귀 위험 회피).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 13:25:42 +00:00
4b4a4c0b32 fix(config): init 헤더에 지원 확장자 상세 목록 유지
annotated_default_document 의 HEADER 가 기존 init 헤더의 '처리 가능한 형식'
상세 목록(.md / .png .jpg .jpeg / .pdf)을 보존하도록 복원. p9-fb-25 의
init_template 계약(지원 확장자 안내) 유지.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 12:46:45 +00:00
f2cc325cf3 feat(cli): kebab config migrate 서브커맨드 + wire config_migration.v1
- Cmd::Config { Migrate { --dry-run } }, --json 시 config_migration.v1.
- wire_config_migration (ConfigMigrationReport 가 schema_version 자체 보유).
- schema.rs WIRE_SCHEMAS 에 config_migration.v1 등록 + JSON schema 파일.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 12:09:31 +00:00
b7e022a5e3 feat(app): config migrate facade + init 주석 공유 + doctor 체크
- config_migrate_with_config_path: 백업(.bak)+atomic write(tmp→rename)+dry-run,
  round-trip 검증으로 실패 시 원본 보존. ConfigMigrationReport 반환.
- init_workspace 가 annotated_default_document() 사용(섹션 주석 포함).
- doctor 에 config_migration 체크 추가(미동기 시 ok=false + hint).
- tests/config_migrate.rs 4개(백업/atomic/dry-run/멱등/doctor) 통과.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 12:09:31 +00:00
bd7c4fd7ef feat(config): config 마이그레이션 엔진 (reconcile + step 체인)
- toml_edit 0.22 의존성 추가
- migrate.rs: CURRENT_SCHEMA_VERSION=2, annotated_default_document(주석
  카탈로그 공유 원천), reconcile(빠진 섹션/키 주석과 함께 추가, 값 불가침),
  step_1_to_2(workspace.include 제거), migrate_document(step+reconcile+stamp)
- schema_version default 1 → 2
- 56 tests green, clippy -D warnings clean

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 12:09:31 +00:00
4dcb4a45d6 feat(config): migrate 모듈 스캐폴딩 + toml_edit 의존성
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 11:41:32 +00:00
6d86214060 docs(plan): config 마이그레이션 구현 계획 (TDD, 13 tasks)
reconcile(additive)+step 체인(non-additive) 분리, init/migrate 공유
annotated_default_document, app facade 백업+atomic write, doctor 체크,
CLI config migrate, wire config_migration.v1. bite-sized TDD steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 11:39:31 +00:00
6bbb8f854b docs(spec): config 마이그레이션 설계 계약
kickoff 인계(#197)의 brainstorm 결과를 확정한 spec. 트리거=명시 명령
`kebab config migrate`+doctor 안내, 주석 보존=toml_edit 부분 편집,
메커니즘=reconciliation(additive)+step 체인(non-additive) 하이브리드.
init/migrate 가 주석 달린 default 문서를 공유. 안전 3축(멱등·백업·dry-run)
+ atomic write. wire schema config_migration.v1 신설.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 11:34:19 +00:00
2a4df4d48d Merge pull request 'docs: config 마이그레이션 작업 인계 kickoff' (#197) from docs/config-migration-kickoff into main 2026-05-31 11:11:17 +00:00
16f3d6eef2 docs: config 마이그레이션 작업 인계 kickoff
config.toml 스키마 진화 시 기존 사용자 파일 자동 마이그레이션 기능의
별도 세션 인계 문서. 현황(serde default forward-compat 있음/파일 마이그레이션
없음/schema_version 장식), 핵심 난점(주석 보존), 설계 3안(전체재작성/toml_edit
append/백업), 트리거(명령 vs 자동), 방법론(v0.21.0 PR #195/#196 패턴) 정리.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 11:11:08 +00:00
fa89c7b561 Merge pull request 'docs(readme): v0.21.0 전면 재구성' (#196) from feat/doc-side-expansion into main 2026-05-31 10:44:38 +00:00
a4c81fed86 docs(readme): v0.21.0 전면 재구성
Quick start 를 맨 앞(빠른 사용), 핵심 기능을 중간, 아키텍처·설계를 뒤로
재배치. kebab 무관 내용(ollama sudo-less tarball 설치, CPU 모델 트러블슈팅)
과 구식 버전 태그(fb-XX, p9-fb, V009, v0.17~v0.20.x 산재), stale 버전 문구
제거. v0.21.0 기준(doc-side expansion 별칭, 파생물 캐시, 외부 계산 워크플로)
서술. 302→206 줄.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 10:44:29 +00:00
5b7c02fe13 Merge pull request 'feat(expansion): doc-side expansion 별칭 개별 dense 벡터 + 파생물 캐시(V012)' (#195) from feat/doc-side-expansion into main 2026-05-31 10:25:45 +00:00
88c5b83dea docs: derivation-cache spec/handoff 독자 관점 보강
PR #195 구현(e9b5202) 기준으로 빠졌던 디테일 보강:
- chunk_id(위치 기반 벡터 식별자) vs cache_key(내용 해시 조회 키) 구분 callout
- §7 호환성/마이그레이션 신설: 본문 재색인 불필요, V012 가산이나 binary 교체 필요,
  별칭 sentinel 묶음→개별 변경의 기존 KB 영향(레거시 호환)
- version_key 에 kind 토큰("doc|") 반영, orphan sentinel cleanup(LIKE prefix) 명시
- embed_with_cache 순서 보존 불변, 별칭 개별 벡터 근거(희석 13/18→16/18)
- 정정: derivation_cache_gc 는 메서드만 존재하고 미연결(캐시 현재 무한 누적, 후속)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 10:25:00 +00:00
2619b7bff7 test(chunk): AST snapshot fixture에 aliases:null 필드 반영
Chunk 구조체에 aliases 필드가 추가된(별칭 인프라) 뒤 chunk-*-ast-v1
snapshot fixture 들이 미갱신 상태로 남아 drift FAIL 이었다. chunk_id·
text·policy_hash·tokenized 는 전부 불변 — 직렬화에 "aliases": null 한
필드만 추가됐다(청크 생성 로직 무변경, 회귀 아님). UPDATE_SNAPSHOTS=1 로
10개 fixture(code c/cpp/go/java/js/kotlin/python/rust/ts + long_section)
재베이크.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 09:57:16 +00:00
e9b520216e fix(expansion): per-alias sentinel orphan cleanup + 캐시 견고성 (PR #195 리뷰)
MAJOR: 별칭 dense 벡터의 chunk_id 가 레거시 단일 `{id}#alias` 에서 줄별
`{id}#alias#0`, `#alias#1`, … 로 바뀌었으나 orphan cleanup 이 단일 sentinel
하나만 삭제해 `#alias#N` 벡터가 LanceDB / embedding_records 에 누수됐다.

- kebab-app: `alias_sentinel_ids_to_delete` 헬퍼 추가(접근법 A) — 본문 +
  legacy `{id}#alias` + `{id}#alias#0`..`{id}#alias#{max-1}` 를 모두 delete-set
  에 포함. max=expansion.max_aliases_per_chunk(= parse_aliases 의 하드 cap)와
  일치. parser-bump / edited-asset / deleted-file 세 LanceDB cleanup 경로 모두
  이 헬퍼를 사용.
- kebab-store-sqlite: embedding_records 명시 DELETE 4 경로(put_chunks /
  purge_*_except_doc_id / purge_orphan_at_workspace_path /
  purge_deleted_workspace_path)를 정확 일치(`|| '#alias'`)에서 `{id}#alias%`
  프리픽스 LIKE 로 전환. 본문 chunk_id 는 32자 hex 라 LIKE 와일드카드 없음.

MINOR 1: alias 캐시 히트 시 비-UTF8 payload 를 미스로 강등(재생성 분기로)
— embedding 경로의 decode-실패→미스 강등과 동작 일치.
MINOR 2: embedding version_key 맨 앞에 kind 토큰("doc") 추가 — 임베더가
kind 별 프리픽스를 붙이므로 미래에 query 임베딩이 같은 캐시를 타도 충돌 방지.

회귀 테스트:
- kebab-app: alias_sentinel_ids_to_delete 단위 테스트 2건.
- kebab-store-sqlite: per-alias sentinel embedding_records 가 세 cleanup
  경로 모두에서 사라지는지 핀하는 통합 테스트 3건.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 09:14:34 +00:00
a8fd76499c feat(expansion): doc-side expansion 별칭 개별 dense 벡터 + 파생물 캐시(V012)
별칭을 줄별 개별 dense 벡터(sentinel `{chunk}#alias#N`)로 색인하고
boilerplate 청크는 별칭 생성을 skip. 묶음 1벡터 방식은 평균화로 특정
표현이 희석돼 오히려 회귀(13/18)했던 것을 폐기. 변형 일관성 14/18 →
16/18, mean_spread@10 0.222 → 0.111 (나무위키 ~1000 문서 CS corpus).
`kebab-core::strip_alias_suffix` 가 suffix 형과 per-alias 형 둘 다 처리.

파생물 캐시(V012): embedding 벡터 + 별칭 LLM 결과를 청크 내용 해시
키로 캐싱해 재색인 시 내용 불변 청크의 재계산을 skip. cache_key =
blake3(kind ‖ text_blake3 ‖ version_key)[:32], version_key 에
model/prompt/dimensions 포함 → §9 cascade 와 정합(버전 bump 시 자동
miss). 측정: 정답 3개 cold 1879s → warm 13s ≈ 145배. 순수 가산이라
corpus_revision bump 없음. search/ask 는 kebab.sqlite+lancedb 만으로
동작 → 외부 서버 색인 후 DB 만 복사하는 이식 워크플로 가능.

V012 schema migration + 신규 surface 로 workspace version 0.20.2 →
0.21.0 (minor) bump. README/HANDOFF/ARCHITECTURE/HOTFIXES sync.
known limitation: stack·svm 설명형 2개 잔존 + grounded 판정이 부분
인용을 grounded 로 오분류(후속 후보).

측정 상세: docs/superpowers/handoffs/2026-05-31-namu-wiki-alias-cache-study.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 08:24:04 +00:00
0282a81c67 fix(store): CASCADE 대체 4번째 경로 + V011 CHECK 복원 (Task 4.5 리뷰)
리뷰 MAJOR: purge_document_at_workspace_path_except_doc_id(parser-bump 경로)에
원본+sentinel embedding_records 명시 DELETE 누락 → tombstone 누적. 추가 +
회귀 테스트. MINOR: V011 status CHECK(pending/committed/tombstone) 복원.
NIT: foreign_keys PRAGMA no-op 주석.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 14:02:46 +00:00
f3587b7143 feat(store): filter_chunks sentinel 별칭 candidate strip (committed 통과)
LanceDB 후보의 sentinel chunk_id({orig}#alias)는 chunks JOIN 에서 탈락해
VectorRetriever strip 이전에 사라진다. candidate 를 kebab_core::strip_alias_suffix
로 원본 chunk_id 로 strip 해 IN-list/JOIN 에 넣어(committed 판정은 원본 body chunk
기준) 통과시키되, 반환은 입력 candidate 형태(sentinel 유지) — VectorRetriever 가
그 sentinel 을 받아 strip+dedup 한다. SQL replace 대신 (b) Rust strip 채택(명확).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 13:41:28 +00:00
483b1ec06b feat(store): V011 embedding_records FK 제거 + CASCADE 대체 명시 DELETE (sentinel 별칭 벡터)
별칭 dense 벡터를 sentinel chunk_id({orig}#alias)로 색인하려면 chunks 에 없는
chunk_id 가 embedding_records 에 들어가야 한다. V001 의 chunk_id REFERENCES chunks
ON DELETE CASCADE FK 가 이를 SQLite 787 로 막으므로 테이블을 FK 없이 재생성한다.
status/vector_committed(V003) + 3개 인덱스 보존, chunks_bd_tombstone_embeddings
trigger 무수정. DROP→RENAME 시 dangling trigger 재파싱을 피하려 legacy_alter_table=ON.

사라진 CASCADE 는 put_chunks + purge 두 경로(purge_orphan_at_workspace_path,
purge_deleted_workspace_path)의 명시 DELETE 로 대체 — chunks 삭제 직전 원본 +
{id}#alias sentinel embedding_records 를 함께 정리. corpus_revision baseline 2→3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 13:41:20 +00:00
d279f343e7 docs(spec,plan): 별도 벡터 인프라 — FK 제거(V011) + CASCADE 대체 + filter_chunks
PoC: 별칭 순수 벡터가 영어 설명형 rank 7~30 (concat 본문 희석으로 미회복) →
별도 벡터 명분. 차단요인 3건: embedding_records FK(787, V011 재생성),
CASCADE 대체(명시 DELETE), filter_chunks sentinel strip. plan Task 4.5/4.6.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 13:25:45 +00:00
b56469f010 fix(core): clippy uninlined_format_args — strip_alias 테스트 (리뷰 MAJOR-1)
workspace clippy --all-targets -D warnings 게이트 통과. format! 인자 인라인.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 11:24:04 +00:00
6ba8cb2c88 feat(search): VectorRetriever sentinel 별칭 strip + dedup
별칭 dense 벡터({orig}#alias) hit 을 원본 chunk_id 로 strip 해 hydrate,
body+alias 중복은 첫(높은 score) 하나만 유지. overfetch 2→3 (dedup 후 k
확보). wire/RetrievalDetail 무변경. vector/hybrid 회귀 0, clippy green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 11:09:32 +00:00
afa8af0f88 feat(app): 별칭 dense 별도 벡터 색인 + purge (sentinel) 2026-05-30 10:48:58 +00:00
b9d20d23d1 feat(config): ingest.expansion.embed_aliases flag (default off) 2026-05-30 10:31:07 +00:00
86b4e1ebd0 feat(core): ALIAS_SUFFIX + strip_alias_suffix (dense alias vectors) 2026-05-30 10:31:03 +00:00
825543549d docs(plan): 별칭 dense 별도 벡터 구현 plan
ALIAS_SUFFIX(core) → embed_aliases flag → ingest sentinel 벡터+purge →
VectorRetriever strip+dedup → 측정. TDD, 완성 코드. doc-side expansion PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 10:28:43 +00:00
bcb8b93751 docs(spec): 별칭 dense 별도 벡터 설계 spec
PoC(concat) 측정: dense 별칭이 6/0/2/0.25 (설명형은 dense 본령 실증), 단
영어 설명형 2개는 concat 본문 희석으로 미회복. 처방: 별칭을 sentinel
chunk_id 별도 벡터로 색인(본문 벡터 불변=회귀 안전, 별칭 순수 신호).
flag ingest.expansion.embed_aliases default off. lexical 완화는 폐기.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 10:26:24 +00:00
116b3e6377 fix(app): clippy unused_self — build_request 를 associated fn 으로
CI 게이트(clippy --workspace --all-targets -D warnings) 통과. 동작 동일.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 03:47:06 +00:00
69b53d1c97 docs(spec): doc-side expansion 검색 메커니즘을 shipped 구현에 맞춰 정정
Task 6 리뷰 MINOR-1: spec 본문이 단일 UNION ALL+GROUP BY 로 기술됐으나
shipped = 2-query(run_query+run_alias_query) + Rust merge_body_alias(body 우선).
서로 다른 FTS 테이블 bm25 절대값 비교가 무의미해 body-우선 merge 가 더 깨끗.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 03:20:13 +00:00
a271352e33 feat(search): lexical body+alias 병합 검색 (pool-rescue)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:12:14 +00:00
cde4d75f6b feat(app): ingest 별칭 생성 hook (flag off 기본, fail-soft) 2026-05-30 03:03:09 +00:00
bddcd53688 fix(app): parse_aliases 접두 제거가 숫자/하이픈 선두 별칭 손상 (Task 4 리뷰 MAJOR-1)
탐욕적 trim_start_matches → 명시적 strip_list_marker(마커+공백 패턴만 1회).
"3D 렌더링"/"2단계"/"-fast" 보존, "- "/"1. " 마커만 제거. 회귀 테스트 2개.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 02:49:25 +00:00
2a207f9868 feat(app): ExpansionGenerator — 청크당 별칭 생성 (fail-soft)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:36:20 +00:00
cc31868d24 feat(config): [ingest.expansion] flag (default off) 2026-05-30 02:26:41 +00:00
0df47febf0 test(store): doc-side expansion Task 2 리뷰 보강 (M1/M2/N1)
- M1: chunk_aliases trigger 가드에 AND aliases <> '' (빈 문자열 미색인)
- M2: 재색인 멱등 테스트 (재-put 후 별칭 행 1개)
- N1: 본문 격리 음성 단언 (별칭 term 이 chunks_fts 로 누출 안 됨)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 02:24:24 +00:00
b12a616ab2 feat(store): V010 chunk_aliases_fts + put_chunks 별칭 영속화
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:15:27 +00:00
848b75c069 feat(core): Chunk.aliases 필드 (doc-side expansion)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 02:09:39 +00:00
467a974901 docs(plan): doc-side expansion 구현 plan + spec 정제 (별도 FTS 테이블)
spec: chunks_fts §5.5 verbatim 충돌 회피 → 별도 chunk_aliases_fts 테이블 +
lexical 내부 body+alias 병합(RetrievalDetail/wire schema 무변경)으로 정제.
plan: 7 task TDD (Chunk 필드 → V010 → config → ExpansionGenerator →
ingest hook → lexical 병합 → 측정/문서). 완성 코드 + 빌드 규약.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 02:04:58 +00:00
098413922b docs(spec): 색인시 doc-side expansion 설계 spec (Phase 2)
brainstorm 확정: 청크당 별칭 생성(같은언어+한↔영 번역), additive+수동
재색인, 1차 단순 품질제어. 별도 FTS5 aliases 채널 → RRF 3채널 융합.
flag off 기본, kebab eval variants 로 on/off 측정.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 01:54:46 +00:00
695010ea7a Merge pull request 'docs: Phase 2 doc-side expansion 킥오프 + 구현 방법론 핸드오프' (#194) from docs/phase2-doc-expansion-kickoff into main
Reviewed-on: #194
2026-05-30 01:23:48 +00:00
8bb7c276d0 docs: Phase 2 doc-side expansion 킥오프 + 구현 방법론 핸드오프
새 세션이 Phase 2(색인시 doc-side expansion)를 자립적으로 이어받을 컨텍스트 문서.
배경(rerank 반증→재정의→Phase1 진단 B우세→딥리서치→PoC), 설계 방향(KO↔EN 번역 별칭
+ 별도 FTS5 필드 + RRF, flag off), 이미 만든 측정 도구(kebab eval variants + dogfood golden),
그리고 지금까지와 동일한 구현 방법론(brainstorm→spec→plan→OMC teammate sequential 구현+리뷰
+독립검증, 모델 라우팅, 빌드 redirect+exit, 측정=variant eval 프록시금지, gitea-pr 리뷰루프)을 담음.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 01:19:14 +00:00
01a03463a6 Merge pull request 'feat(eval): 변형 일관성(query-paraphrase robustness) 평가 프레임워크' (#193) from feat/paraphrase-robustness-eval into main
Reviewed-on: #193
2026-05-30 01:12:26 +00:00
b6ad947378 docs: README 명령 표 슬림 + ARCHITECTURE 상세 이전·동기화
README 의 괴물 셀(ingest 2891→544, search 2952→687, ask 1244→415, tui 2300→453자)을
"무엇 + 핵심 flag + 포인터"로 축소. 빠진 구조 detail 은 ARCHITECTURE 로 이전:
- symbol path 형식에 Go/Java/Kotlin/C/C++ 추가 + code chunk provenance(citation.kind/code_lang/repo)
- Markdown title 자동 채움 순서(md-frontmatter-v2)
- RAG groundedness 검증(mDeBERTa-v3 XNLI, nli_threshold gate) 결정 행 신설
- TUI 행을 P9-1~4 완료 + F1 cheatsheet 로 최신화 (stale "진행 예정" 제거)
flag 망라는 --help, TUI 키는 in-app F1 cheatsheet(권위 런타임 소스)로 위임 — stale 방지.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 01:10:39 +00:00
1529e6d991 docs(readme): PR #193 회차 1 리뷰 반영 — eval 명령 표에 aggregate/variants 추가
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 01:00:32 +00:00
5ad1f98227 docs(handoff): doc-side expansion 딥리서치 + PoC 결과 (Phase 2 방향 확정)
딥리서치(104 agent): 어휘격차 pool-miss 최선책 = 색인시 doc-side expansion.
PoC(dogfood KB): recall@50=0 이던 3쿼리가 별칭 추가로 rank1~2 부활(hybrid+vector,
골든 verbatim 아님=일반화). 핵심 미검증 고리 실 corpus 정량 확인.
Phase 2 = 색인시 doc-side expansion(KO↔EN 번역 별칭) → 별도 FTS5 필드 → RRF, flag off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
a58cae2ff3 docs(research): 어휘격차 pool-miss 해결 딥리서치 레퍼런스
deep-research 워크플로(104 agent, 5각도, 22소스, 25 claim 3-vote 검증, 22 confirmed/3 killed).
결론: 색인시 doc-side expansion(doc2query)이 pool-miss 최선책 — pool 자체를 키우고
per-query 지연 ~0(색인시 1회), 정확매칭 보존(별도 필드 append). 단 vanilla mt5는 같은언어라
한/영 갭은 색인시 KO↔EN 대체 query 생성 필요. query-side(HyDE=거부된 per-query LLM,
Vector-PRF=recall 주장 기각)는 부적합. 검증은 기존 variant eval 로 가능.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
7a1dff1684 docs(handoff): query-paraphrase robustness Phase 1 완료 + (A)/(B) 진단
8그룹×4변형(dogfood) 측정: groups=8 A_dominant=2 B_dominant=4 spread@10=0.750.
진단 — 문제는 한/영이 아니라 어휘 거리(영어 풀어쓴 문장도 miss, 일부 한국어는 OK).
B(어휘격차, recall@50=0, rerank 불가) 우세 → 쿼리 확장/번역 처방 신호. A(순위출렁)는
cap_theorem/vector_database 2그룹뿐. "측정 먼저" 논제 정량 검증(rerank 단독은 부분해법).
Phase 2 처방 결정 대기.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
0988f66331 feat(cli): kebab eval variants <run_id> — 변형 일관성 진단 리포트
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
82e02aa4fe fix(eval): 변형 일관성 리뷰 H1/M1 — pool truncation 방어 + answer 판정 정렬
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
db4af0cc72 feat(eval): 변형 일관성 메트릭 + A/B(순위출렁/어휘격차) 분류
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
ab20202241 test(eval): Task1 리뷰 nit — 3+멤버 그룹/group=None 테스트 + 에러 메시지에 divergent query id
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
a51e6395c0 feat(eval): GoldenQuery.group + 그룹 정합성 검증 (변형 일관성 기반) 2026-05-30 00:53:24 +00:00
fe4c854673 docs(plan): query-paraphrase robustness Phase 1 구현 계획
5개 task: (1) GoldenQuery.group + 그룹 정합성 검증, (2) 변형 일관성 메트릭
모듈 + A/B(순위출렁/어휘격차) 분류, (3) kebab eval variants CLI, (4) dogfood
golden 변형 그룹 큐레이션, (5) 측정 + 진단 리포트. TDD bite-sized, 완성 코드.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
1de3f4ffca docs(spec): query-paraphrase robustness 평가 프레임워크 설계 (측정 먼저)
목표 재정의: 한/영 overlap → 같은 의미의 다양한 표현(동의어·다른 어휘·풀어쓴
문장·한영)에서 일관된 답변 품질. 지난 reranker 실험이 overlap 프록시 최적화로
헛돈 교훈 반영 — 처방 전 진짜 지표(변형 일관성)를 직접 재는 평가부터.

Phase 1(본 spec 구현): kebab-eval golden suite에 변형 그룹(intent group) +
변형 일관성 메트릭(recall_spread, answer_consistency) + recall@pool vs recall@k로
(A)순위출렁/(B)어휘격차 자동 판별. Phase 2(처방)는 측정 결과 게이트 뒤 조건부.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:24 +00:00
7fbfec647d Merge pull request 'feat: v0.20.2 — Ask 응답언어 rag-v3 + 8 dogfood findings + 검색 품질 eval baseline' (#192) from fix/v0-20-2-dogfood-findings into main
Reviewed-on: #192
2026-05-29 05:27:52 +00:00
ca8c83b1ba chore(hotfixes): PR #192 회차 1 리뷰 반영 — refusal marker 표기 정정
`<REFUSE>` marker → citation marker(`[#번호]`) 유무 기반 (pipeline.rs:463-486).
release-notes 정정과 일관.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 05:22:57 +00:00
6c611990d8 chore: bump version 0.20.1 -> 0.20.2
v0.20.2 — Ask 응답언어 자동매칭(rag-v3) + 8 dogfood findings
+ eval --config facade 패치 + 검색 품질 golden baseline
(hybrid hit@3=1.0 / MRR=0.833, lexical hit@3=1.0).
evidence: tasks/HOTFIXES.md 2026-05-29, docs/release-notes/v0.20.2-draft.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 05:10:42 +00:00
166b1404e4 docs(release-notes): correct refusal判정 mechanism + O-2 phrasing
leader review of writer draft: refusal 판정은 citation marker(`[#번호]`)
유무 기반이며 `<REFUSE>` 특수 마커가 아님. O-2 문구 예시도 실제 rag-v3
규칙으로 정정.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:58:08 +00:00
2d0168b7ab docs(sync): README + HANDOFF v0.20.2 surface 반영
CLAUDE.md docs-split 규칙에 따라 사용자 visible surface 변경 동기화.

README:
- [rag] prompt_template_version default rag-v2 → rag-v3 (v0.20.2)
- v3 규칙 설명 (답변 언어 = 질문 언어)
- O-2 known limitation (소형 모델 refusal 언어 불일치)

HANDOFF:
- 머지 후 발견된 버그/결정 에 v0.20.2 1줄 요약 추가
- 검색 품질 baseline (hybrid MRR=0.833) + O-2 known limitation 언급

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 04:54:26 +00:00
4afcaf96d2 docs(release-notes): add v0.20.2-draft (rag-v3 응답언어 + 검색 품질 eval 인프라)
v0.20.2 릴리즈 노트 초안 작성. 사용자 영향 4단락 구조로 각 finding 기술.

- Finding #1/O-2: rag-v3 응답언어 자동 매칭 + refusal 언어중립화
- Finding #2: bulk search input schema 확정 (15필드)
- Finding #3: list docs human-readable path 보강
- Finding #7: index_version 두 곳 구분 (vector vs FTS5)
- eval --config facade + 검색 품질 baseline (hybrid hit@3=1.0 / MRR=0.833)
- Finding #4/#5/#6/#8: docs/schema 정비
- version cascade 주의 (rag-v3 → eval compare)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 04:54:19 +00:00
16c4579399 docs(hotfixes): add 2026-05-29 v0.20.2 dogfood findings + 검색 품질 baseline
8-finding 도그푸딩 라운드 및 검색 품질 baseline 결과를 HOTFIXES 에 기록.

- 8 findings 요약 표 (rag-v3, bulk schema, list docs, index_version 등)
- Finding O-2 known limitation (소형 모델 refusal 언어 불일치)
- 검색 품질 baseline 표 (hybrid MRR=0.833, lexical MRR=0.7)
- golden 큐레이션 교훈 (dispatch.py 정답 정정 → hit@3 0.9→1.0)
- eval logs cross-link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 04:54:11 +00:00
40d7faee71 docs(dogfood): add §10.2 search quality baseline scenario (v0.20.2 golden suite)
eval --config facade 패치로 dogfood KB 직접 평가 가능해짐에 따라
§10 Eval 에 §10.2 검색 품질 baseline 섹션 추가.

- golden suite 실행 명령 (hybrid + lexical eval run → aggregate)
- v0.20.2 metric baseline 표 (hybrid hit@3=1.0 / MRR=0.833)
- 정성 체크리스트 (한국어 2자 hit@3, empty=0, MRR 임계치)
- golden 큐레이션 절차 + dispatch.py 오류 교훈
- §10.1 로 기존 basic eval run 재구성

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 04:54:03 +00:00
a3bb2580bf test(rag): add rag-v3 dispatch integration test + refresh stale rag-v2 docs (code-review)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 04:46:27 +00:00
2429189447 docs(spec): reflect search-quality critic round-1 (eval --config, lang-filter non-goal, curation)
Incorporates all critic (opus) round-1 findings into the dogfood
search-quality eval design spec:

BLOCKER-1: §4.4 execution commands now use --config /build/dogfood/config.toml
  (Task A facade-rule patch makes this the canonical path).  §5.1 re-titled
  from "(후속 패치)" to "Task A로 적용됨 — 권장 운영 경로"; XDG workarounds
  demoted to "패치 전 fallback".  Intro paragraph updated accordingly.

MAJOR-1: §3 Non-Goals gains an explicit bullet: lang/media/code_lang
  SearchFilters validation is out of scope for this harness (runner uses
  SearchFilters::default(), runner.rs:151).  §4.1 "code 검색" row no longer
  claims code_lang filter coverage.

MINOR-1: §4.3 step 3 now names kebab inspect doc <id> as the primary
  chunk-selection path (breaks chunk-level curation loop); search hits
  demoted to "보조 확인용".

MINOR-2: §4.1 golden category table gains two new rows — 한국어 N-gram
  fallback query (복합어/신조어 coverage) and 영어 whole-token exact query
  (separates substring artefacts).

MINOR-3: §4.1 YAML header note added: record corpus_revision in golden
  file so stale-bail root cause is immediately traceable.

NIT: §9 References line numbers corrected (runner.rs:31, metrics.rs:116/144);
  runner.rs:151 SearchFilters::default() reference added.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 03:43:00 +00:00
d93b757cf1 fix(cli): thread --config through kebab eval run/aggregate/compare (facade-rule)
Cmd::Eval now loads Config via cli.config (same pattern as all other
subcommands) before dispatching to the inner match.  Each arm now calls
the *_with_config variant:

  run_eval(&opts)             → run_eval_with_config(&cfg, &opts)
  compute_aggregate(run_id)   → compute_aggregate_with_config(&cfg, run_id)
  store_aggregate(run_id, ..) → store_aggregate_with_config(&cfg, run_id, ..)
  Compare already called compare_runs_with_config but sourced cfg from
  Config::load(None) — that redundant load is removed; cfg comes from
  the shared binding above.

Fixes the same facade-rule regression pattern as P3-5 / P4-3: previously
`kebab --config /build/dogfood/config.toml eval run` silently evaluated
the XDG-default (empty) KB instead of the dogfood KB.

Also fixes runner.rs test that hardcoded rag-v2 after commit 5719969
bumped the default prompt_template_version to rag-v3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 03:42:40 +00:00
571996938c docs(contract): bump default prompt_template_version to rag-v3 (Todo #1)
line 899: V1만 legacy → V1/V2 둘 다 legacy, v0.20.2 부터 rag-v3 default 선언.
line 1349 (★): config 예시 default rag-v2 → rag-v3.
line 1533 (★): §9 cascade table 코드 상수 rag-v2 → rag-v3.
line 287 이후: answer.v1 예시 블록에 historical snapshot 주석 추가 (n1 — model+ptv stale, 값 변경 안 함).
task spec grep 판단: tasks/p9/p9-fb-15 의 rag-v2 언급 2줄은 rag-v2 도입 시점 historical 기술 → frozen 유지.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:45:13 +00:00
be79bdb83d docs(#7): distinguish vector-store vs FTS5 index_version (Todo #7)
schema.schema.json models.index_version: vector store (LanceDB) version 임을 명시.
search_hit.schema.json index_version: lexical (FTS5) version 임을 명시.
search_hit.schema.json retrieval: 내부 필드 목록 + hybrid 전용 fusion 설명 추가 (hunk 공유).
README kebab schema 행: index_version 두 곳의 의미가 다름을 주의 표기 추가.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:45:04 +00:00
4e76f103c1 docs(#5,#6): clarify retrieval.* nesting + single-mode score relation (Todo #5/#6)
README Score 해석 절에 score ↔ retrieval.* 구조 설명 추가:
- fusion_score/lexical_score/vector_score/lexical_rank/vector_rank 는 retrieval 내부 (top-level 아님).
- single-mode 에서 score==fusion_score==lexical/vector_score 가 같은 값인 것은 정상 (Finding X).
search_hit.schema.json score 필드에 score_kind 관계 + single-mode 동일값 이유 설명 추가.
search_hit.schema.json retrieval/index_version 설명은 Task 12 커밋에 포함 (같은 hunk).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:44:56 +00:00
4fd672193f docs(#4): clarify lang vs code_lang semantic and und=code (Todo #4)
lang_breakdown description에 code 문서는 자연어 감지 미수행(lang="und" 정상) 사실 추가.
README에 lang vs code_lang 설명 절 신규 추가.
task spec grep: tasks/p9/p9-fb-15 의 rag-v2 언급은 historical 기술 → frozen 유지.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:44:34 +00:00
1454321b12 fix(rag): rag-v3 refusal/hedge phrasing follows answer language (Finding O-2)
SYSTEM_PROMPT_RAG_V3: 한국어 리터럴 refusal/hedge 문구를 언어 중립으로 교체.
- 근거가 부족하면 "근거가 부족하다"고 답한다. → 답변 언어로 근거가 부족함을 밝히고 [#번호] 인용 없이 답한다.
- 근거가 모호하면 "확실하지 않다" 라고 명시한다. → 근거가 모호하면 답변 언어로 불확실함을 명시한다.

MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT: 동일 패턴 두 곳 교체.
- 근거가 부족하면 "근거가 부족하다"고 답한다. → 답변 언어로 근거가 부족함을 밝히고 [#번호] 인용 없이 답한다.
- self-check 의 즉시 "근거가 부족하다" 라고만 답한다. → 즉시 답변 언어로 근거가 부족하다고만 답한다.

refusal 판정 로직(citation marker 기반)은 무변경 — 문구만 언어 중립화.
test rag_v3_contains_v2_rules_plus_language_rule: "확실하지 않다" assert → "불확실함" assert 로 갱신.
task spec grep: tasks/p9/p9-fb-15 의 rag-v2 언급은 도입 시점 historical 기술 → frozen 유지.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:43:28 +00:00
649ec35108 feat(cli): add Ollama remote-endpoint hint to kebab init (Todo #8)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:19:34 +00:00
dece5e89fc feat(bulk): document bulk search input schema + error shape hint (Todo #2)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:16:21 +00:00
3cb49f1f9b feat(cli): show title + doc_path in list docs human output (Todo #3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 02:03:10 +00:00
f5ff823984 docs(dogfood): add RAG response-language scenario (Todo #1 verification) 2026-05-28 21:28:22 +00:00
b82eaec21a feat(config): default prompt_template_version rag-v2 -> rag-v3 (Todo #1) 2026-05-28 21:20:59 +00:00
6daa43375b feat(rag): add rag-v3 system prompt with response-language matching (Todo #1) 2026-05-28 21:20:18 +00:00
85efeeca3e docs(plan): v0.20.2 dogfood findings 구현 plan (15 task)
planner(opus) 작성 → critic 리뷰 시도 → leader 좌표 검증.
8 todo → 15 task: 코드 4 (rag-v3 / list docs / bulk / init) + 각 finding 후
전체 도그푸딩 검증 task 4 + docs-only 3 + contract + HOTFIXES/release-notes + version bump.

plan critic round-1 은 환경 도구 손상으로 좌표 blocker(B-1/B-2/M-1/M-2)를 오진 →
leader 가 pipeline.rs/config/cli/bulk/Cargo.toml 을 직접 grep 검증해 plan 좌표 정확 확인,
executor 용 "anchor grep 재확인" + binary 경로 주의 헤더 추가.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 21:09:39 +00:00
2b4ba8e104 docs(spec): v0.20.2 dogfood findings 설계 + round-1 critic 반영
v0.20.1 전체 도그푸딩에서 발견된 8 todo (Ask 응답언어 rag-v3 / doc.lang
docs / bulk input / list title / fusion_score·score_kind / schema
index_version / Ollama hint) 를 단일 patch release 로 설계.

writer worker 초안 → opus critic round-1 리뷰 반영:
- B1: top-level score placeholder → 확정 (score_kind 가 의미 선언, search.rs:95-99)
- M1: 이미지 caption 언어 강제 out-of-scope 명시
- M2: config default 테스트(lib.rs:1316) 갱신 필요 명시
- M3: bulk input 전체 필드 (query/mode/k/trust_min/ingested_after/media/tag/lang)
- M4: rag-v3 의 eval_runs.config_snapshot_json cascade 영향

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 20:37:42 +00:00
b08941d6ab docs(handoff): mark brainstorming-needed items in v0.20.1 findings todo
각 todo 에 fix path classification 추가:
- 🧠 필수: design / user expectation 결정 필요 — brainstorming skill 우선
- 🔧 mechanical: spec drift 또는 명확한 fix — 별 brainstorming X
- 📝 mild discussion: spec drafter self-review 로 trade-off 결정 가능

Classification:
- 🧠 필수 (2): #1 Ask 영어→한국어 response policy, #4 doc.lang semantic
- 🔧 mechanical (4): #2 bulk schema, #5/#6 docs sync, #7 schema rename
- 📝 mild (3): #3 list title, #8 Ollama default

추가 brainstorming 후보 (직접 finding 외):
- BS-A: HTML corpus 지원 (1415 file skipped)
- BS-B: Tier 1/2/3 chunker UX visibility
- BS-C: kebab dogfood subcommand (자동화)
- BS-D: 영문 code chunk 의 tokenized_korean_text 효율
- BS-E: builtin_blacklist 명세 노출

권장 워크플로:
1. brainstorming 단계 먼저 (#1, #4 + BS-x 별 검토)
2. mechanical batch (#2, #5, #6, #7) — 한 PR
3. mild discussion batch (#3, #8)
4. dogfood retest → v0.20.2 patch release
2026-05-28 19:45:53 +00:00
6bf4e82e62 docs(handoff): v0.20.1 full dogfood findings todo for next session
머지 후 v0.20.1 의 full dogfood (사용자 실제 corpus 6293 file, 3.5
시간 ingest, §1~§11 시나리오) 발견된 findings 를 새 session 의 self-
contained todo handoff 로 정리.

P0 (bug / 의도와 다른 동작):
- #1 Ask 영어 query → 한국어 응답 (rag-v2 prompt template 강제)
- #2 bulk search input format 불명확 (wire schema 미명시)
- #3 list docs title 중복 (heading-based, doc_path 보조 필요)
- #4 doc.lang = und 53% (code file 의 lang detection 실패)

P1 (docs drift):
- #5 fusion_score 위치 (.retrieval.fusion_score)
- #6 score_kind="bm25" 의미 (lexical mode 의 fusion_score)
- #7 schema index_version vs lexical_index_version 혼동

P2 (setup):
- #8 Ollama endpoint default 가 localhost (사용자 환경 remote)

각 todo 별 severity, scenario, suspected location, action item 명시.
새 session 시작 명령 + branch 권장 + 도그푸딩 재실행 절차 + finding
cumulative table 포함.

Repo state: main HEAD=a0c7fa3, clean. v0.20.1 binary OK. /build/dogfood/
KB (3940 docs, 34896 chunks) preserved for regression test.
2026-05-28 19:40:10 +00:00
a0c7fa3d1a docs(claude): dogfood 보관소를 /build/dogfood/ 로 통합 + 분류 정책 명시
사용자 정정 따라 dogfood data layout 갱신:

1. 위치: /build/cache/dogfood/ → /build/dogfood/. /build/cache 는 의미상
   캐시 (regeneratable downloads/models) 이지 test data 아님.
   /build/dogfood 는 sudo 로 신설 + chown.

2. 분류 정책: kebab version / 생성 시점 / scenario name prefix 금지
   (v0.20.1-dogfood/, dogfood-v018/ 같은 디렉토리 신설 X). 모든 분류
   는 문서 의미 / 종류 / 형식 기준만. 자세한 layout 은
   /build/dogfood/README.md.

3. 단일 디렉토리 정책: source 문서 + KB state + logs 모두 /build/
   dogfood/ 안 하나로. 매 도그푸딩 run 마다 kb/ 만 reset, 별 디렉토리
   신설 X.

4. 금지 위치 명시: /tmp/kebab-*, /build/cache/dogfood*, /home/altair823/
   KnowledgeBase, XDG paths 신규 사용 금지.

Source dirs 정리 (이번 commit 외 별 작업으로 완료):
- /build/cache/dogfood{,-p10b,-v017,-v018,-v0.19.0} 모두 삭제 (move 후).
- /home/altair823/KnowledgeBase, kebab-dogfooding 도 /build/dogfood/ 로 이동.
- XDG paths 는 /build/dogfood/_archive/xdg-state/ 로 snapshot.

최종 corpus: 6293 files (markdown/code/html/manifest/resources), 554M.
2026-05-28 14:44:23 +00:00
ebc6bf45c4 Merge pull request 'feat(v0.20.1): 한국어 morphological tokenizer (V009) + N-gram supplement + eager backfill' (#191) from feat/korean-morphological-tokenizer into main
v0.20.1 — V009 한국어 morphological tokenizer + N-gram supplement (Bug #8 closure).

22 implementation commit + 2 docs polish (dogfood data consolidation + CLAUDE.md trigger policy).

Dogfood evidence: KnowledgeBase 1781 doc / 9050 chunk backfill 26.6초, '한국' query 0 → 10 hit.

PR-level review verdict: APPROVED WITH NOTES (4 minor docs polish, all addressed).
2026-05-28 14:17:13 +00:00
d8fdc815be docs(claude): dogfood trigger + 통합 data 보관소 정책
사용자 요청 — 사용자가 누적된 ad-hoc 도그푸딩 데이터를 /build/cache/
dogfood/ 한 곳에 collection 한 후, 도그푸딩의 필요 시점을 추론해
CLAUDE.md 에 정책 section 추가.

신규 section `## Dogfood trigger` (사이 Release 와 Naming):
- 도그푸딩이 필요한 시점 (6 trigger 분류: schema/migration, wire
  schema/CLI, search/RAG, performance, language/locale, file/asset).
- Release-level: bump commit 이전에 evidence 명시 필수.
- 도그푸딩 데이터 보관소: /build/cache/dogfood/ 의 디렉토리 구조 +
  README.md cross-link + /tmp/kebab-* 신규 사용 금지.
- 도그푸딩 결과 기록: HOTFIXES dated entry + release notes draft 의
  4-단락 풀어쓰기 + DOGFOOD.md scenario catalog cascade.

실 작업:
- /build/cache/tmp/v0.20.1-* 5 디렉토리, /tmp/dogfood-* 2 디렉토리,
  관련 log file 모두 /build/cache/dogfood/ 로 mv. config.toml 의
  hard-coded path 자동 sed-replace.
- /build/cache/dogfood/README.md 신규 — 디렉토리 구조 + 신규 시나리오
  시작 절차 + V007 시뮬레이션 패턴 + 정리 정책.

기대 효과: 도그푸딩 evidence 의 git-tracked HOTFIXES + draft release
notes 외에도 raw data 가 한 곳에서 자유롭게 재사용 가능. 새 release
의 도그푸딩이 이전 KB 위에서 incremental 확인 가능.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9 (도그푸딩 evidence cascade)
Plan: post-implementation infrastructure
2026-05-28 14:06:24 +00:00
9f2a56d091 docs(hotfixes): large-scale KnowledgeBase dogfood evidence (N-gram supplement)
사용자 실제 /home/altair823/KnowledgeBase/ (1781 markdown / 9050 chunk)
를 v0.20.1+N-gram supplement 포함 binary 로 backfill 재실행:

- Backfill duration: 26.6초 (9050 chunk, OnceLock 캐시 + 1000-row
  batch transaction). ~3 ms/chunk amortized.
- '한국' query: V007 의 0 hit → V009 + N-gram 의 10 hit (Bug #8
  functional closure 실측 검증).
- '한국어' query: 5 → 10 hit (morpheme + N-gram 동시 매칭).
- 영어 whole-token: 'token'/'pipeline'/'config' = 10 hit each
  (V009 회귀 측면 정상).

Snippet evidence: KB 의 testdata/coding-md-corpus/*/...md 의
"문서를 한국어로 다시 정리하기" 패턴이 ko-dic 분해 + N-gram window
로 '한국' query 매칭 demonstrate.

기타 한국어 (서울, 지하철, 대한민국 등) 0 hit 는 KB corpus 의
단어 자체 부재 — data limitation, V009 implementation limitation X.

Test data 위치:
- /home/altair823/KnowledgeBase/ (사용자 실제 KB, 1781 markdown)
- /build/cache/tmp/v0.20.1-dogfood/kb/ (ingested SQLite + LanceDB)
- /build/cache/tmp/v0.20.1-dogfood2/corpus/ (한국어 wiki fixture)
- /build/cache/tmp/v0.20.1-v007strict/corpus/no-space.md (whitespace-less)
- /build/cache/tmp/v0.20.1-ngram/corpus/extra.md (대한민국, 한국정부, 주민등록번호)

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9 + Appendix B
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (dogfood evidence final)
2026-05-28 14:02:02 +00:00
fe20be8195 feat(chunk): N-gram supplement (Option β) — sub-token emit for Korean compounds
#4 (사용자 요청): spec §6.2 의 Option β (sub-token 추가 emit) 를
v0.21.x P9 follow-up 에서 v0.20.1 implementation 으로 promote.
dogfood 의 ko-dic compound noun limitation (`대한민국`, `한국정부`,
`주민등록번호` 등 단일 token 정책) 해소.

Implementation (`crates/kebab-chunk/src/lib.rs::tokenize_korean_morphological`):
- 신규 helper `is_hangul()` — 한글 음절 (U+AC00..D7A3) + 자모
  (U+1100..11FF, U+3130..318F) 판정.
- lindera output 의 각 morpheme 에 대해, 한글만 + 길이 ≥ 3 인 경우
  sliding window 2-gram 추가 emit. `[한국정부, 한국, 국정, 정부]`
  형태로 token list expand.
- 영어 / 숫자 / 혼합 token 은 supplement X (false positive 회피).

Tests (`crates/kebab-chunk/tests/tokenize_korean.rs`):
- `tokenize_korean_morphological_emits_2gram_for_long_morpheme`: 5 probe
  fixture 중 supplement 발화 case 확인 (실측 `서울특별시` →
  `[서울, 특별시, 특별, 별시]`, `대한민국` → `[대한민국, 대한,
  한민, 민국]`).
- `tokenize_korean_morphological_no_2gram_for_english`: Rust optimization
  fixture 에서 영어 substring (`Rus`, `ust`, `imi`) emit 없음 보장.

Dogfood evidence (`tasks/HOTFIXES.md` 2026-05-28 entry 보강):
- '대한', '한민', '민국' query 모두 hit (대한민국 의 sliding window).
- '특별', '주민', '등록' 같은 sub-token query hit.
- 영어 'tokenizer' query 는 corpus 부재로 0 hit (supplement X).
- Trade-off: DB size +20-30% (Korean-heavy), false positive 작은 risk.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2 (Option β promote)
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (post-implementation enhancement)
2026-05-28 13:48:05 +00:00
028d9ad4ea docs(release): v0.20.1 release notes draft + spec/plan dogfood cross-link
#1 (사용자 요청): release notes draft 작성 + spec/plan 의 dogfood
evidence cross-link 보강.

docs/release-notes/v0.20.1-draft.md (신규):
- 4 단락 본문 (한국어 2자 query 지원 + 영어 substring 회귀 + V007→V009
  자동 backfill + ingest 성능 영향).
- Migration cascade table (lexical_index_version, corpus_revision,
  wire schema shape preservation).
- API + dependency 변경 (lindera v3, lindera-ko-dic v3, retired
  short_query_hint helper, 새 facade APIs).
- Breaking changes 명시 (영어 substring 회귀, 첫 부팅 latency, DB/
  binary 크기 증가).
- Upgrade 절차 + Known limitation + 14 dogfood scenario reference.

spec Appendix B (segmentation evidence):
- "Empirical verification (2026-05-28 dogfood — post-merge update)"
  subsection 신규. prior-knowledge 가정 vs 실측 결과 table. Scenario
  1-4 모두 verified 표시. ko-dic 의 '서울특별시' → '[서울, 특별시]'
  분해 증거 명시.

plan Changelog:
- post-implementation entry: 22 commit on branch, S3 blockers, S7
  cascade, S11 sanity regression updates, opus PR review 4 finding
  fixes.
- dogfood evidence entry: 14 scenario verify pass, ko-dic 분해
  evidence, HOTFIXES + spec Appendix B cross-link.

Spec: …spec…md Appendix B
Plan: …plan…md (post-implementation + dogfood evidence Changelog)
Release notes: docs/release-notes/v0.20.1-draft.md
2026-05-28 13:34:33 +00:00
a3513c9110 docs(hotfixes): V009 dogfood verification evidence (2026-05-28)
V009 한국어 morphological tokenizer 의 dogfood 검증 결과를 HOTFIXES
2026-05-28 entry 에 보강. 14 scenario 의 hit count + ko-dic 의
compound noun 분해 evidence (서울특별시 → [서울, 특별시]) + Option α
acceptance 의 known limitation 명시.

Reference corpus: DOGFOOD.md §2.1bis 의 korea-overview.md +
korea-compound.md (10 KB 합계, 2 markdown). KB ingest + 14 query
검증 모두 expected.

사용자 KnowledgeBase 같은 영어/code 중심 KB 에서 한국어 lexical
0-hit 가 정상임을 reference fixture evidence 와 분리해 사용자
오인 방지.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S11 + dogfood evidence)
2026-05-28 13:24:29 +00:00
f2a76cfe94 docs(dogfood): V009 morphological tokenizer scenarios + fixture evidence
v0.20.1 dogfood verification 의 fixture + scenario 를 DOGFOOD.md /
SMOKE.md 에 반영. 사용자 KnowledgeBase 같은 영어/code 중심 KB 에서
한국어 0-hit 가 정상 (token 부재) 임을 명시하고, ko-dic 의 morpheme
분해 동작을 검증할 reference fixture (korea-overview.md +
korea-compound.md) 를 inline 으로 제공.

DOGFOOD.md §2.1 갱신:
- description: trigram → unicode61 + 형태소 column.
- scenarios: 한국어 2-char (한국, 서울) + compound noun (서울특별시) +
  영어 whole-token 회귀 + 1-char filter 등 7 case 로 확장.
- §2.1bis 신규: V009 dogfood evidence reference corpus + 검증 명령 +
  예상 snippet (lindera 분해 증거) + known limitation (ko-dic
  compound 단일 token 정책, Option α acceptance).

SMOKE.md 'V009 morphological 검색' 갱신:
- trigram 시절 hint advisory + 3자 키워드 권장 시나리오 제거.
- v0.20.1 의 2-char Korean / compound noun / 1-char filter / 영어
  whole-token 회귀 scenario 로 교체.

Reference fixture (실측 verify pass):
- korea-overview.md: '한국' / '서울' / '지하철' 모두 hit.
- korea-compound.md: '한국어' / '한국문화' / '서울특별시' compound hit.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (dogfood evidence)
2026-05-28 13:22:43 +00:00
8c56ef3010 docs(superpowers): v0.20.x C 한국어 morphological tokenizer spec + plan artifacts
본 commit 은 v0.20.x C task (Bug #8 — 한국어 2자 query 0-hit) 의
4-stage workflow artifact 5 파일을 archive:

- spec.md (668 line, status=accepted): Option A/B/C 비교 + lindera
  Path A (영어 substring 회귀 인정) 결정 + 12 section + 4 Appendix
  (B segmentation evidence, C cost evidence, D license evidence).
- spec-critic-r1.md: 3 critical + 6 major finding (NEEDS_REWRITE).
- spec-critic-r2.md: r1c rewrite 후 traceability matrix (ACCEPT).
- plan.md (750 line, status=accepted): 11 step + dependencies +
  cost optimization routing + 9 closure micro-patches 적용.
- plan-closure-r1.md: traceability matrix + 9 MP 의 origin.

이 artifact 들은 implementation 머지 후 frozen reference. 후속
deviation 은 tasks/HOTFIXES.md 가 source of truth.

Workflow stage:
1. spec drafter (omc team writer, opus)
2. spec critic R1 (omc team critic, opus) — NEEDS_REWRITE
3. spec rewriter r1c (omc team writer, opus) — 7 item fix
4. spec closure R2 (in-process verifier, sonnet) — ACCEPT
5. plan drafter (omc team planner, opus)
6. plan closure (in-process verifier, sonnet) — ACCEPT + 9 MP
7. subagent-driven-development implementation (11 step + 5 follow-up
   + 1 docs polish = 17 commit)
8. PR-level final code review (in-process code-reviewer, opus) —
   Approved with notes (4 minor docs finding, merge as-is)

Branch: feat/korean-morphological-tokenizer
Version: 0.20.1
2026-05-28 12:53:31 +00:00
5d9ea588ed docs(v0.20.1): polish PR-review findings (README/HOTFIXES/schema/SKILL)
opus PR-level final review (Approved with notes) 의 4 minor finding
mechanical 정정:

1. README.md — `kebab search` row 의 영어 substring 매칭 표현이
   V007 시절 그대로였음. V009 의 whole-token 회귀 (substring → V002
   동작) 를 정직히 명시 + vector/hybrid mode 권장 안내.
2. tasks/HOTFIXES.md — 2026-05-28 entry 의 file path 정정. lexical.rs
   는 lindera 호출자가 아니라 build_match_string 의 MIN_QUERY_CHARS
   3→2 갱신만; lindera helper 의 실제 owner 는 kebab-chunk/src/lib.rs.
   ingest.rs 는 본 PR scope 외, eager backfill hook 위치는 kebab-app/
   src/app.rs::App::open_with_config.
3. docs/wire-schema/v1/search_response.schema.json — `hint` field
   description 이 V007 trigram 3-char minimum 시절 advisory 시그니처
   그대로. v0.20.1 에서 helper retired + always-omit 사실 명시
   (forward-compat 차원에서 field 만 schema 에 보존).
4. integrations/claude-code/kebab/SKILL.md — `hint` field 설명의
   self-contradiction ("present only with trigram in edge cases" vs
   "Korean 2-char now supported") 해소. retired + reuse 가능 명시.

PR-level reviewer recommendation: "Merge as-is — block 사유 아님 (모든
finding minor)". 본 commit 은 reviewer 의 옵션 1 (별 docs hotfix
commit) 채택.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (PR-level finding follow-up)
2026-05-28 12:53:00 +00:00
53ec9b4dc5 test(chunk): regenerate AST + long-section snapshots for V009 chunk field
S3 의 Chunk struct 갱신 (kebab-core 의 tokenized_korean_text:
Option<String> field 추가) 가 모든 chunk snapshot JSON 의 serde
serialize 결과를 변경시킴. 10 snapshot fixture (9 AST chunker +
markdown long-section) 의 baseline 을 V009 형태로 regenerate.

각 snapshot 의 변경 = chunk JSON 마다 `"tokenized_korean_text":
null` field 추가 (대부분의 fixture 가 영어 코드라 lindera 의 None
fallback). 동작 변경 없음 — serde representation 의 cascade만.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up via S11 sanity)
2026-05-28 12:27:37 +00:00
21b52bc285 style: cargo fmt --all (S3+S4+S5+S7 follow-up)
V009 morphological tokenizer 작업 (S3 chunk + S4 backfill + S5
short_query_hint 제거 + S7 신규 tests) 의 형식 정리. 동작 변경 없음.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S11)
2026-05-28 12:06:01 +00:00
97fd895a10 chore(release): bump version 0.20.0 → 0.20.1
CLAUDE.md §Release / binary version bump 의 두 트리거 모두 hit:
- 사용자 도그푸딩 필요 (Bug #8 한국어 2자 query 해소 — '한국', '서울',
  '지하철' 검색 검증).
- frozen design contract 변경 (§5.5 chunks_fts 의 unicode61 + CASE
  expression triggers + tokenized_korean_text column).

V009 + lindera ko-dic 형태소 분석기 통합 외에도 v0.20.x 의 logging
round 2 enhancement (PR #190) 가 같은 v0.20.x 시리즈에 포함되어
v0.20.1 patch release 시점에 함께 cut.

Build verification: ./target/release/kebab --version → kebab 0.20.1.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §12.1
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S10)
2026-05-28 12:05:31 +00:00
d13eb87401 docs(v0.20.x): sync README + HANDOFF + ARCH + SKILL + HOTFIXES for V009
V009 한국어 morphological tokenizer 의 사용자 visible surface 변경 +
release notes scope 를 5 docs 에 cascade.

- README.md: kebab search 명령 row 에 한국어 2자 query 지원 명시.
- integrations/claude-code/kebab/SKILL.md: V007 3-char hint 제거 +
  V009 2자 한국어 query 지원 1줄.
- HANDOFF.md: C task status 완료 flip + v0.20.1 release notes scope
  에 본 변경 추가 + 머지 후 발견 summary 행.
- docs/ARCHITECTURE.md: embedding upgrade (e5-small → e5-large),
  lindera-ko-dic FTS5 한국어 지원, version notes 추가.
- tasks/HOTFIXES.md: 2026-05-28 entry — Bug #8 V009 해소, lindera-ko-dic
  실제 crate name (spec deviation), cargo-deny deferred, Path A
  영어 substring 회귀 명시.

Spec: tasks/p9/p9-9-v0.20.x-korean-morphological-tokenizer-spec.md §7.4
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-28 11:55:25 +00:00
26f3a7756c test(eval): regenerate runner_per_query_snapshot for V009 baseline
V009 FTS5 tokenizer (trigram → unicode61 + 형태소) 로 인한 BM25
distribution + hit ordering 변경의 의도된 cascade. eval runner 의
per-query snapshot (run-1.json) 을 V009 baseline 으로 regenerate.

Regenerate 절차: UPDATE_SNAPSHOTS=1 cargo test -p kebab-eval
--test runner runner_per_query_snapshot_matches_fixture -j 4.
회귀 0 — kebab-eval workspace 의 모든 test (runner / loader /
metrics_and_compare) pass.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S8)
2026-05-28 11:51:49 +00:00
881f949fcb test(search): regenerate lexical_snapshot_run_1 for V009 BM25 distribution
V009 FTS5 tokenizer 가 trigram → unicode61 + tokenized_korean_text
column 로 갱신되면서 BM25 IDF 계산이 변화 → fusion_score 의 부동
소수점 값이 미세하게 다름 (히트 순서·snippet·구조 동일). S1 머지
직후 update 누락되었던 snapshot 을 V009 baseline 으로 regenerate.

회귀 없음: `cargo test -p kebab-search --test lexical lexical_snapshot_run_1 -j 4`
exit 0.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1 follow-up via S7 reviewer)
2026-05-28 11:49:15 +00:00
c5de5f812b test(fts,app): V009 morphological tokenizer integration tests
신규 4 test 추가:

- crates/kebab-store-sqlite/tests/fts.rs:
  - fts_v009_korean_morphological_2char_query_hits: tokenized_korean_text
    column 이 채워진 chunk 의 '한국' 2-char query hit.
  - fts_v009_english_whole_token_only: V007 trigram substring 매칭
    회귀 (Path A) — 'token' query 가 'tokenizer' chunk 에서 0-hit.
- crates/kebab-app/tests/search_korean.rs:
  - korean_morphological_2char_query_lexical_mode: end-to-end
    한국어 wiki fixture ingest → '한국' / '서울' query hit.
  - korean_morphological_mixed_english_korean_query: 'Rust' English
    whole-token + '최적화' Korean morpheme hit.

crates/kebab-search/src/lexical.rs:
  - build_match_string() 의 MIN_TRIGRAM_CHARS(3) → MIN_QUERY_CHARS(2).
    V009 unicode61 은 최소 token 길이 제한 없어 2자 한국어 morpheme
    query 가 통과되어야 함. 1자 단독은 여전히 필터.
  - 관련 unit test 2개 V009 동작으로 갱신.

fixture text 는 lindera ko-dic 의 실제 segmentation 동작에 의존
(spec Appendix B prior-knowledge 예측). 실측 시 fixture 조정 가능.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9.1, §9.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S7)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:38:52 +00:00
f94e0c4a9b feat(app): bump lexical_index_version to V009 (fts5-v009-korean-morphological)
V009 의 FTS5 tokenizer 가 trigram → unicode61 + 한국어 형태소 분해
column 로 갱신됨. lexical_index_version 의 format 에
`fts5-v009-korean-morphological` suffix 추가하여 V007 baseline 과
구별. eval runner 의 config_snapshot 및 search cache 무효화에
자동 picks up.

기존 format: lex:{chunker_version}
신규 format: lex:{chunker_version}:fts5-v009-korean-morphological

Wire schema shape 변경 없음 (SearchHit.index_version 의 string
content 만 변화). lexical_index_version_is_returned_unchanged test
는 IndexVersion 의 임의 string 을 사용해 unchanged.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §11.1, §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S6)
2026-05-28 11:23:13 +00:00
923b959610 refactor(app): retire short_query_hint helper, keep wire field as None
V009 unicode61 + 형태소 tokenizer 환경에서 2-char 한국어 query 가
hit 가능해졌으므로 V007 시기의 "3자 이상 권장" hint 가 obsolete.
SearchResponse.hint field 는 wire schema 보존 위해 struct 에 유지 +
항상 None.

- kebab-app/src/app.rs: short_query_hint 함수 + doc-comment 삭제.
  2 호출 site 가 hint = None 으로 정리.
- kebab-app/src/lib.rs: re-export 에서 short_query_hint 제거.
- kebab-tui/{app.rs,search.rs,run.rs}: short_query_hint field + 4
  호출 cascade 제거.
- kebab-cli/tests/wire_search_response.rs:
  search_plain_emits_short_query_hint_to_stderr test 삭제.
  search_json_emits_hint_field_for_short_query →
  search_json_hint_absent_for_short_query_v009 으로 교체
  (hint 항상 None 검증).
- kebab-search/src/lexical.rs::build_match_string: V007 의 trigram
  multi-token OR-combine 분기는 V009 환경에서 redundant 하나 보존
  (future 확장성) — doc-comment 1 줄 추가.

Wire schema shape 변경 없음 (search_response.schema.json:33 의 hint
field 보존, struct 에 None 으로 항상 셋팅).

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §7.2, §7.3, §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S5)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:13:45 +00:00
b63af20b72 feat(app): first-boot eager backfill for tokenized_korean_text
V007 → V009 업그레이드 시 기존 chunks 의 tokenized_korean_text 가
NULL — 첫 App::open_with_config 호출 시 자동으로 lindera ko-dic
으로 분해 후 UPDATE. chunks_au trigger 가 chunks_fts 를 자동 재-index.
사용자 재-ingest 불필요.

- crates/kebab-store-sqlite/src/store.rs:
  backfill_tokenized_korean_text(progress_cb, tokenize) API. 1000 row 마다
  commit + progress 콜백. idempotent (IS NULL 필터로 partial
  completion 재실행 안전). tokenizer 를 파라미터로 받아 §8 dep 경계 유지.
- crates/kebab-app/src/app.rs::open_with_config: run_migrations 직후
  backfill 호출. 실패 시 warn log 만 (App open 은 성공 — vector/hybrid
  mode 계속 가능). 500 row 마다 info log progress.
- crates/kebab-store-sqlite/tests/fts.rs:
  backfill_tokenized_korean_text_populates_nullable_rows 단위 test
  (idempotency 포함).
- clippy pre-existing 오류 수정 (redundant_closure, map_unwrap_or,
  cast_lossless, uninlined_format_args — kebab-app/ingest_log.rs,
  pdf_ocr_apply.rs, app.rs, tests/ocr_inspect_smoke.rs).

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §8.1, §8.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:01:00 +00:00
e8f44a57e3 fix(test): update first_ingest_bumps_corpus_revision baseline for V009
V004 seeds corpus_revision=0, V009 migration bumps to 1 (spec §5.2 —
LRU cache invalidation). Test previously asserted fresh store = 0;
now reads post-migration baseline dynamically and verifies that the
ingest commit increments past it.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §5.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)
2026-05-28 10:47:09 +00:00
4b4a8cbb3a fix(test): retire V007 trigram-only fts tests, add V009 unicode61 sanity
V007 trigram tokenizer 의 substring 매칭을 검증하던 3 test 는 V009
unicode61 으로 의도된 회귀 (spec §3 Non-Goals Path A) 가 발생하므로
obsolete:

- fts_trigram_korean_3char_substring_hits: '발생한' → '발생한다' hit
  은 trigram 의 substring 매칭이라 V009 의 whole-token 매칭에서 fail.
- fts_trigram_korean_short_query_zero_hit_pinned: 2-char Korean
  query 의 0-hit 동작은 V009 의 형태소 column 으로 해소되므로 이 핀
  자체가 obsolete (S7 이 신규 2-char hit test 로 대체).
- fts_trigram_english_substring_hits: 'token' → 'tokenizer' hit 은
  V009 unicode61 의 whole-token only 에서 fail.

신규 추가:
- fts_v009_unicode61_space_separated_korean_token_hits: V009 unicode61
  의 whole-token 매칭 sanity (token '충돌은' hit, substring '발생한'
  0-hit). S7 이 추가할 morphological 검증 test 와 별개의 baseline.

S7 (plan §2 Step 7) 가 v009_korean_morphological_2char_query_hits +
v009_english_whole_token_only 를 추가하여 회귀 + 신규 동작 모두 핀할
예정.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §3, §9.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)
2026-05-28 10:36:32 +00:00
4dc1c10be1 fix(test): update corpus_revision baseline to post-V009 migration
V009 migration bumps corpus_revision by 1 at apply time (spec §5.2 —
invalidates pre-V009 LRU search cache). Existing tests assumed V004
seed (0) was the final baseline; updated to expect 1 after migration:

- fresh_store_starts_at_zero → fresh_store_starts_at_post_migration_baseline
- bump_increments_monotonically: expected 1,2,3 → 2,3,4 (post-baseline)
- revision_persists_across_reopen: 2 → 3 (V009 +1, +bump,+bump)

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §5.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up)
2026-05-28 10:33:50 +00:00
bd86f61c9c fix(chunk): close S3 reviewer blockers — get_chunk read + AST chunker cascade
S3 spec compliance reviewer (sonnet) 가 2 blocker 발견:

1. crates/kebab-store-sqlite/src/documents.rs: get_chunk SELECT 가
   tokenized_korean_text column 을 미조회 → DB 의 값이 read 시 유실.
   SELECT column list + row → Chunk 변환 시 row.get 인덱스 추가.
   ChunkRow struct + chunk_row_from_sql + get_chunk Chunk 생성 cascade.

2. crates/kebab-chunk/src/code_*_ast_v1.rs (9 file): make_chunk 가
   tokenized_korean_text: None 하드코딩 → 한국어 주석을 가진 코드
   파일이 FTS hit 안 됨. tier2_shared 와 동일 패턴으로
   tokenize_korean_morphological(text) 호출 cascade.

이 commit 은 S3 의 rework — amend 아닌 별 commit (S3 boundary
유지). spec §6.2 invariant ("모든 chunker 가 chunk emit 직전에
tokenize 호출") 충족.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 rework)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:30:53 +00:00
b134ae9dd5 feat(chunk): integrate lindera korean morphological tokenizer
V009 의 tokenized_korean_text column 에 들어갈 morpheme sequence
를 lindera ko-dic 으로 분해. chunk builder pipeline 의 chunk 생성
직후 시점에서 호출 → chunk struct 의 field 에 pre-fill → store
의 put_chunks 가 단일 transaction 안에서 INSERT.

- crates/kebab-core/src/chunk.rs: Chunk struct 에
  tokenized_korean_text: Option<String> field 추가 (#[serde(default)]).
- crates/kebab-chunk/src/lib.rs: tokenize_korean_morphological()
  helper + OnceLock 캐싱 + fallback (None) 정책.
- crates/kebab-chunk/Cargo.toml: lindera features = ["embed-ko-dic"]
  추가 (DictionaryKind::KoDic 활성화에 필요).
- 모든 chunker (tier2_shared, md_heading_v1, pdf_page_v1, 9개
  code AST v1): Chunk 리터럴에 tokenized_korean_text pre-fill.
- crates/kebab-store-sqlite/src/documents.rs::put_chunks: INSERT
  SQL column list + placeholder + binding 갱신 (12번째 column).
- crates/kebab-chunk/tests/tokenize_korean.rs: 단위 테스트 2개.

lindera 3.0.7 API 정정: load_dictionary_from_kind →
load_embedded_dictionary, Token.text → Token.surface.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:22:15 +00:00
597d8b70ad feat(deps): add lindera + lindera-ko-dic for korean morphological tokenizer
Workspace dependency 만 추가 — 실제 사용은 S3 의 kebab-chunk
tokenize_korean_morphological() helper.

- Cargo.toml (workspace): lindera = "3", lindera-ko-dic = "3" 추가.
- crates/kebab-chunk/Cargo.toml: per-crate dep (lindera-ko-dic 에
  embed-ko-dic feature 로 KO-DIC 딕셔너리 embedded blob 활성화).
- crates/kebab-app/Cargo.toml: [features] 에 fts_korean_morphological
  (spec §6.3 Option A — marker role only, disable path 없음).

License: lindera = MIT, lindera-ko-dic = MIT
(cargo info 로 확인). cargo deny 도입은 P9 follow-up.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.1, §10.1
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:03:58 +00:00
b106120e93 feat(fts): add V009 korean morphological tokenizer migration
V007 trigram tokenizer 의 한국어 2자 query 0-hit 한계 (Bug #8) 해소를
위한 V009 migration 추가. unicode61 tokenizer 로 환원 + 한국어 형태소
분해 결과를 별 column `tokenized_korean_text` 에 pre-fill 하는 방식.

- migrations/V009__fts_korean_morphological.sql 신규: column ADD,
  chunks_fts DROP+재정의, 3 trigger CASE expression, backfill INSERT,
  corpus_revision bump.
- design §5.5 갱신: trigram → unicode61 + 형태소 column. CASE
  expression trigger 본문.
- crates/kebab-store-sqlite/tests/fts.rs: V007 verbatim test 를
  V009 source-of-truth 로 rename. v009_bumps_corpus_revision unit
  test 추가.
- store.rs: clippy bool_to_int_with_if + cast_lossless 기존 경고 수정
  (pdf_ocr_events 관련 코드, S1 작업 중 발견).

영어 substring 매칭은 V002 (whole-token only) 로 회귀 — spec §3
Non-Goals + 후속 release notes (v0.20.1) 에서 정직히 기술.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 09:48:46 +00:00
43366b1b15 docs(handoff): C 한국어 morphological tokenizer (Bug #8) 새 session handoff
v0.20.0 sub-item 1 + bugfix 1~4 + ingest log r1+r2 머지 후, 다음 우선순위
C (한국어 morphological tokenizer) 의 self-contained context.

새 session 의 첫 step + workflow patterns + 환경/memory references + cascade
risk + 가능한 fix paths (Option A jieba-rs / B bi-gram supplement / C
query-side expansion). spec/plan/executor cycle 동일 패턴.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:18:04 +00:00
70507e94ca docs(handoff): logging round 2 closed (PR #190 merged) + v0.20.1 release notes scope 갱신
PR #190 (2026-05-28 commit 7bbdc89a) merge 후:
- "Logging feature future enhancements" section →  closed (4 enhancement 모두 PR #190 에서 ship).
- G (v0.20.1 release) section 의 release notes scope 갱신 — V008 migration + pdf_ocr_events + CLI inspect + retention 추가.

다음 작업 priorities 그대로 C → B → A → G.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:15:07 +00:00
7bbdc89ae3 Merge pull request 'feat: ingest log round 2 — image_w/h + V008 SQLite mirror + CLI inspect + retention' (#190) from feat/ingest-log-round2-enhancements into main
Reviewed-on: #190
2026-05-28 08:12:25 +00:00
7c24734cc7 docs(superpowers): v0.20.x logging r2 spec + plan artifacts
logging round 2 (4 enhancement: image_w/h + V008 SQLite mirror + CLI inspect + retention) 의 spec/plan ACCEPT 후 round artifacts.

- spec: 751 line (ACCEPT, 7/7 critic round 1 finding + 7/7 closure r2 traceability)
- plan: 576 line (ACCEPT, 6/6 step + 13/13 AC + G1/G2/G3 plan-level resolve)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:04:32 +00:00
9a36a06f97 style: cargo fmt --all (v0.20.x logging r2 feature follow-up)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:34:01 +00:00
35c987df1c feat(app): log retention — keep_recent_runs + retention_days (Enhancement 4)
LoggingCfg gains two fields with serde defaults: keep_recent_runs
(default 100, top-N file retention) and retention_days (default 30,
time-based retention for both ndjson files and the SQLite mirror).

IngestLogWriter::open now runs cleanup_old_logs before creating a new
ingest-*.ndjson — delete iff (idx >= keep_recent) OR (modified <=
cutoff). ingest_with_config_opts also calls
SqliteStore::prune_pdf_ocr_events(retention_days) at ingest start so
the SQLite mirror tracks the same retention window.

Backward compat (AC-9): both new fields use #[serde(default = ...)],
so a pre-v0.20.x config with only [logging] ingest_log_enabled +
ingest_log_dir parses unchanged. kebab init writes the new defaults
automatically via Config::default() -> toml::to_string_pretty (AC-12).

docs/SMOKE.md config example synced.

Closure r1 F5: explicit OR-on-stale comment inside cleanup_old_logs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:17:47 +00:00
d9ec7b8dc3 feat(cli): kebab inspect ocr-stats + ocr-failures (Enhancement 3 + wire schema additive minor)
Two new wire schemas land as additive minor: ocr_stats.v1 (corpus-wide
aggregate — total_events, success_rate, p50/p90/p99/max_ms, by_engine,
top-10 by_doc by failure count) and ocr_failures.v1 (per-doc or
corpus-wide recent failures, with --doc-id + --limit). Both ship via
new CLI subcommands `kebab inspect ocr-stats` / `inspect ocr-failures`.

App gains four facade methods: inspect_ocr_stats /
inspect_ocr_failures plus their *_with_config companions — required by
CLAUDE.md "the facade rule" so `--config <path>` is honored. The CLI
dispatch arms thread cfg explicitly into the _with_config form.

Runtime introspection emit (WIRE_SCHEMAS in schema.rs) gains two
entries; the meta JSON Schema (schema.schema.json) is untouched
because its wire.schemas is pattern-based, not enum-based.

ingest_log::percentiles extended to (p50, p90, p99, max). p99 surfaces
only via inspect ocr-stats; IngestSummary (round 1) stays 3-percentile.

SKILL.md synced with the two new schemas (AC-13).

Closure r2 G2 (facade *_with_config pair) + G3 (runtime emit, not
meta schema file) + closure r1 F4 (p99) resolved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:13:08 +00:00
4e451c9f7c feat(app): dual-write PDF OCR events to SQLite + ndjson (Enhancement 2 wiring)
Pre-capture canonical.doc_id and Arc<SqliteStore> before the OCR
emit_progress closure so both the ndjson file and the SQLite mirror
carry the same doc_id for every event. File write is durable
(errors propagate); SQLite insert is non-critical (tracing::warn on
failure, ingest does not abort) per spec R-1.

LogEvent::Ocr gains a doc_id: Option<&str> field as an additive
Serde change — round 1 ndjson logs deserialize with doc_id=None.

Closure r1 F1: doc_id NULL in dual-write resolved via
let doc_id_for_log = canonical.doc_id.0.clone() pre-capture.
Closure r2 G1: Arc::clone(&app.sqlite) reused instead of opening a
second SqliteStore — eliminates double-open lock contention and
duplicate migration runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:06:03 +00:00
6482bf1321 feat(store): V008 pdf_ocr_events migration + record/prune API (Enhancement 2)
Add migrations/V008__pdf_ocr_events.sql with the events table + 3
indices (doc_id, run_id, ts). SqliteStore gains two pub fn:
record_pdf_ocr_event (insert one OCR sample) and prune_pdf_ocr_events
(delete rows older than retention_days; returns the affected row
count). Both follow the existing Mutex<Connection> lock pattern.

Wiring into ingest path lands in the next commit.

Closure r1 F2: explicit lock acquisition in both methods.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 05:56:54 +00:00
5977c8cdf1 feat(app): capture image_width/height in PDF OCR raster decode (Enhancement 1)
Add extract_image_dimensions(bytes) helper using image::ImageReader
and fill the 2 PdfOcrProgress::Finished emit points in pdf_ocr_apply.rs
where page_image_bytes is in scope (OCR error path + success path).
The no-DCTDecode skip path leaves None as page_image_bytes is absent.
Result: LogEvent::Ocr carries non-null image_width/image_height on
successful raster decode, enabling future size-conditioned timeout tuning.

Closure r1 F3: kebab-app/Cargo.toml image features += "jpeg" added as
direct [dependencies] entry (not relying on feature unification via
kebab-parse-image).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 05:54:55 +00:00
89d334a92b docs(handoff): v0.20.0 sub-item 1 머지 후 next priorities (C/B/A/G order)
PR #189 (2026-05-28 commit 09333d0) 머지 후 다음 작업 priorities 기록:

- C (다음 우선): 한국어 morphological tokenizer (Bug #8 V007 trigram 2-char 한계 follow-up)
- B: OCR dense page coverage (metro-korea.pdf page 8/13 timeout — max_pixels 동적 / column-level OCR / 점진적 timeout 축소)
- A: v0.20 의 deferred sub-items (sub-item 2 multi-region image / sub-item 3 PDF normalize integration / TODO #4 figure-table / TODO #5 Enricher trait)
- G: v0.20.1 patch release + release notes (180s timeout + [logging] section + wire schema additive + CLI fix)

Non-blocking known: Bug #12 falsified, ask phrasing-sensitive refusal.

Logging feature 후속 enhancements (low priority): image_width/height capture, SQLite mirror, CLI inspect query, log retention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:48:44 +00:00
09333d0b05 Merge pull request 'feat(pdf): scanned PDF OCR via qwen2.5vl:3b vision LLM (v0.20.0 sub-item 1)' (#189) from feat/pdf-scanned-ocr into main
Reviewed-on: #189
2026-05-28 04:37:41 +00:00
685007789a style: cargo fmt --all (round 4 ingest log feature follow-up)
Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이
test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all
적용. 모든 import alphabetical reorder + line wrapping 정합.

추가 untracked artifact 동시 commit:
- docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT)
- docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT)

workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:18:40 +00:00
445b096215 fix(test): clippy + fmt fixes for logging_roundtrip and ingest_log_smoke
* kebab-config/tests/logging_roundtrip.rs: r#"..."# → plain string
    (clippy::unnecessary_hashes).
  * kebab-app/tests/ingest_log_smoke.rs: |e| e.ok() → Result::ok,
    |s| s.as_u64() → Value::as_u64 (clippy::redundant_closure).
  * cargo fmt --all applied to pre-existing formatting drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:26:42 +00:00
415227bf76 test(app): ingest_log_smoke integration test (AC-9)
crates/kebab-app/tests/ingest_log_smoke.rs 신규:

  * ingest_log_smoke (AC-9): tempdir + 1 md + 1 scanned PDF →
    ingest → assert log file exists + 각 line valid JSON +
    각 kind ∈ {ocr,parse_error,skip,error,summary} + last
    line kind=summary + scanned>0.

  * ingest_log_disabled_emits_no_file (AC-6): enabled=false 일
    때 log_dir 안 ingest-*.ndjson 파일 0개 verify.

fixture: ../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf
재사용 (OCR disabled — Ollama 없이 smoke 실행).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:06:43 +00:00
f9dc0f749f feat(app): wire IngestLogWriter into 5 ingest emit hooks (Arc<Mutex> sync)
v0.20.x ingest log feature 의 ingest pipeline wiring. 5 emit hook:

  Hook 1: ingest_with_config_opts entry/exit (writer init + summary write + flush)
  Hook 2: apply_ocr_to_pdf_pages closure (PdfOcrProgress::Finished → LogEvent::Ocr)
  Hook 3: ingest_one_*_asset Err arm (LogEvent::Error)
  Hook 4: scan 직후 fs_skips.events enumerate (LogEvent::Skip)
  Hook 5: (Hook 3 통합) per-asset fatal error → LogEvent::Error

Hook 4 의 skip event carry 위해 kebab-source-fs 의 FsScanSkips 에
events: Vec<FsSkipEvent> field 추가 (kebab-source-fs 가 kebab-app
재호출 안 함 — cycle 회피).

Ownership: Option<Arc<Mutex<IngestLogWriter>>> binding 1 곳, 5 hook 이
clone+lock+write. ocr_ms_samples (Vec<u64> success-only) 는 Arc<Mutex>
로 share, summary stage 가 sort+p50/p90/max 계산. single-threaded
per-asset loop 라 deadlock/contention 위험 없음.

Writer 실패는 ingest 자체 fail 시키지 않음 (tracing::warn + 진행).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:05:07 +00:00
bef0c98867 feat(wire): PdfOcrProgress.Finished + ingest_progress.v1 additive 4 fields
v0.20.x ingest log feature 의 wire side. additive minor cascade:

  * PdfOcrProgress::Finished + IngestEvent::PdfOcrFinished 의 4 field:
      - image_byte_size: Option<u64>
      - image_width:     Option<u32>
      - image_height:    Option<u32>
      - failure_reason:  Option<String>
  * docs/wire-schema/v1/ingest_progress.schema.json — 4 추가 property
    (모두 optional, required 변경 없음 = additive minor)
  * integrations/claude-code/kebab/SKILL.md — wire schema description 동기

기존 ingest_progress.v1 consumer (CLI wire dump, integration test
fixture, kebab-cli wire_search/wire_ask) 는 4 추가 field 의
Option::None 으로 backward-compat. version bump 0 (additive minor =
binary-version cascade trigger 아님 per CLAUDE.md §Versioning cascade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:57:59 +00:00
f8a4c79727 feat(app): IngestLogWriter + LogEvent enum (per-ingest-run ndjson log)
v0.20.x ingest log surface 의 module side. crates/kebab-app/src/
ingest_log.rs 신규:
  * IngestLogWriter — open/write_event/write_summary/flush + Drop flush
  * LogEvent enum 4 variant (ocr / parse_error / skip / error)
  * IngestSummary struct (kind="summary" literal + 11 stat field)
  * generate_run_id (ISO 8601 prefix + uuid v7 마지막 8 hex)
  * expand_log_dir ({state_dir} placeholder 의 hand-roll expand)
  * now_ts (Rfc3339 UTC helper)
  * percentiles helper (sorted Vec p50/p90/max)

uuid v7 = workspace dep, rand 신규 의존 회피 (spec §6 R-5).

본 step 은 self-contained writer + 5 unit test. ingest pipeline 의
emit hook 5개 wiring 은 step 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:53:09 +00:00
f60304beb4 feat(config): add [logging] section (ingest_log_enabled + ingest_log_dir)
v0.20.x ingest log surface 의 config side. LoggingCfg struct 신설:
  * ingest_log_enabled (bool, default true)
  * ingest_log_dir (PathBuf, default "{state_dir}/logs")

#[serde(default)] tag 로 pre-v0.20 config 가 [logging] section 부재
시 LoggingCfg::default() 자동 init (AC-10 backward compat).

{state_dir} placeholder 의 실제 expand 는 step 2 (IngestLogWriter)
의 expand_log_dir helper 가 담당 (kebab-config 의 expand_path_with_base
는 {state_dir} 미지원, spec §6 R-3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:44:21 +00:00
6a9551e0fa fix(config): pdf.ocr.request_timeout_secs default 60 → 180 (Bug #11 follow-up)
Round 3 final dogfood (2026-05-28) 에서 60s default 가 dense Korean page
(metro-korea.pdf page 8/9/13) 의 OCR 을 강제 timeout — round 2 대비 1 page
더 indexed 손실. user perspective: cost vs coverage trade-off 가 60s 에선
coverage 쪽으로 너무 깎임.

Sweet spot 점진적 축소 정책 채택 — conservative starting point 180s 부터
dogfood evidence (OCR 평균 ms 분포) 기반 점진적 축소. 60s 같은 짧은 default
로 직접 jump 안 함.

- crates/kebab-config/src/lib.rs::default_pdf_ocr_request_timeout_secs() = 180
- unit test rename (_is_60s → _is_180s) + assertion 180
- crates/kebab-config/tests/pdf_ocr.rs assert_eq 180
- tasks/HOTFIXES.md 2026-05-28 follow-up entry 추가

User override path 보존 — config.toml [pdf.ocr] request_timeout_secs = N
로 user 가 직접 tune.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 01:40:23 +00:00
46e99470eb docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md
3-round dogfood-driven fix cycle 의 산출물:

- bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line
- bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line
- bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line
- docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus)

각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 01:21:34 +00:00
9b44e27dfe test(app): update schema_report assertion for streaming_ask=true (Bug #9 follow-up)
schema_report_reflects_freshly_ingested_kb 가 `!streaming_ask` 를 assert 했으나
Bug #9 fix (760eee8) 로 streaming_ask 가 true 로 정정됨. assertion 을 반전.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:58:10 +00:00
854a180365 fix(cli): add active_parsers + active_chunkers to Models test fixture in wire.rs (Bug #13)
Step 4 의 Models struct 확장 (active_parsers / active_chunkers 추가) 이
crates/kebab-cli/src/wire.rs 의 테스트 fixture 초기화를 누락 → E0063 컴파일 에러.
#[serde(default)] 는 serde 역직렬화 전용 — struct literal 초기화 에는 모든 field 필요.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:18:27 +00:00
5bba95fd71 docs(spec): HOTFIXES entry + parent spec cross-link for Bug #11 timeout deviation
Bug #11 (이전 commit `fix(config): pdf.ocr.request_timeout_secs default 600 → 60`)
의 frozen-spec deviation handoff.

- tasks/HOTFIXES.md: 2026-05-27 dated subsection — Discovered / Symptom / Root cause /
  Fix / Amends 5-field 포맷 (기존 entries 와 일치).
- docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md: PDF OCR config block
  line 1000 (default value) + OQ-1 line 1628 에 inline HTML 주석 2 줄 cross-link.
  prose 변경 0 — parent spec frozen contract 보존, HTML 주석은 markdown render 시 invisible.

HOTFIXES entry 가 live source of truth (CLAUDE.md "Spec contract" 규칙).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:16:18 +00:00
2c7fa7142a fix(cli): empty query emits error.v1 invalid_input for search + ask (Bug #14)
이전: `kebab search "" --json` / `kebab search "  " --json` / `kebab ask "" --json`
모두 exit=0 + silent 0 hit (search) 또는 LLM 빈 prompt round-trip (ask). user
mistake (typo, shell expansion 실수) 가 silent → debugging 비용.

이후: 양쪽 arm 에서 `query.trim().is_empty()` → kebab_app::StructuredError
(ErrorV1, code=invalid_input, hint 포함). exit=2 (StructuredError → 기존
exit_code() 의 generic non-zero path).

--bulk mode 는 영향 0 (bulk arm 이 query 무시).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:16:08 +00:00
d9c7aabce1 feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13)
이전: schema.v1.models 가 parser_version / chunker_version 단일 값만 보고 →
multi-medium corpus (md + pdf + code Rust/Python + dockerfile + k8s + manifest)
의 version cascade audit 누락 risk.

이후: additive minor — Models struct 에 active_parsers + active_chunkers Vec<String>
추가. backward compat: 기존 단일 field 보존 (markdown default), 신규 array 는
optional (#[serde(default)] + JSON schema required 미포함).

source:
- kebab_store_sqlite::fetch_distinct_parser_versions() 가
  documents.parser_version DISTINCT + ORDER BY 반환.
- fetch_distinct_chunker_versions() 가 chunks.chunker_version 동일 pattern.
- collect_models 가 매 schema 호출마다 재계산 (cache 없음 — R-3 자동 해결).

wire schema additive only — 메이저 bump 불필요. v0.20.1 minor 로 충분.
integrations/claude-code/kebab/SKILL.md 동기 갱신.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:15:58 +00:00
10b0e2f4f2 fix(config): pdf.ocr.request_timeout_secs default 600 → 60 per dogfood evidence (Bug #11)
metro-korea.pdf v0.20 final-dogfood (2026-05-27):
- page 8 + page 13 양쪽 모두 600s default 까지 완전 timeout
  (`ms: 600000, chars: 0, skipped: true`)
- 결과: 본문 indexed 안 됨 + page 당 20분 cost 낭비

cloud GPU Ollama 의 실측 per-page throughput 는 6-32s (parent spec 가정 105s 보다
훨씬 빠름). 60s 면 production-friendly upper-bound. dense/고해상도 page 는
config.toml override (`[pdf.ocr] request_timeout_secs = N`) 로 user 가 늘릴 수
있음 — Step 6 에서 HOTFIXES + parent spec cross-link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:15:21 +00:00
28f513795e fix(config): emit error.v1 code=config_not_found for missing --config path (Bug #10)
이전: `kebab search "rust" --config /tmp/nonexistent.toml --json` 가 exit=0 +
`{"hits":[]}` silent fallback to XDG default. typo / wrong path 가 0-hit 으로만
surface — debugging nightmare.

이후: kebab_config::ConfigNotFound thiserror::Error 추가, Config::load 의
`Some(p) if !p.exists()` arm 이 anyhow::Error::new(ConfigNotFound { path })
return. kebab_app::error_wire::classify 가 downcast → ErrorV1 code=config_not_found,
hint, details.path 채워서 stderr 에 ndjson 으로 emit.

R-1 (relative path): std::path::Path::exists() 는 cwd-relative — 별도 작업 없이
absolute + relative 모두 cover. integration test 두 개로 검증.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:14:54 +00:00
760eee89c8 fix(app): flip streaming_ask + single_file_ingest capabilities to actual surface (Bug #9)
capabilities_snapshot() 가 streaming_ask + single_file_ingest 를 hardcoded false 로
보고했으나 실제 구현은 v0.20 final-dogfood 에서 production-grade:
- kebab ask --stream → answer_event.v1 ndjson 191 event 정상 emit
- kebab ingest-file <path> / kebab ingest-stdin --title <T> → ingest_report.v1 정상

MCP host + Claude Code skill 등 agent 가 schema.capabilities 로 routing 결정 시
false negative → 사용자가 실제 동작 feature 를 사용 불가능하다고 오인.

http_daemon 은 false 유지 (별도 sub-item 의 non-impl).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:13:57 +00:00
f763049923 test(cli): assert 'code' in search --help output (Bug #7 regression pin)
Why: Step 2 의 doc-comment edit 가 향후 누군가 value list 를 재정렬
하거나 alias section 으로 분리할 때 silently 사라질 risk. clap 의
--help 렌더링 가 doc-comment 의 free-form text 라 grep-only smoke 가
유일한 검출 수단.

Change: 신규 test file (kebab-cli convention `cli_*` prefix 답습).
CARGO_BIN_EXE_kebab 으로 fresh binary 실행, stdout 의 `code` substring
assert. spec §4.4 의 acceptance row 1:1 mapping.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md §4.4
/ §5 (acceptance row 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:47:55 +00:00
8cf73d1f43 docs(cli): list 'code' in --media help string + SKILL.md (Bug #7)
Why: kebab search --media code 가 v0.18.0 부터 functional support 됨
(MEDIA_KINDS 외 path 로 first-class 처리, schema.v1.media_breakdown.code
존재). 그러나 SearchArgs 의 clap doc-comment + SKILL.md line 57 의
value list 가 stale — `code` 누락. user 가 --help 만 보고 code 미지원이라
오해 가능.

Change: 2 surface 동기 — main.rs line 158-160 의 multi-line clap
doc-comment + integrations/claude-code/kebab/SKILL.md line 57.
Rust binary surface / wire schema 변경 0.

Out of scope (follow-up): crates/kebab-mcp/tools/search.rs:44,
crates/kebab-core/src/search.rs:32+52, crates/kebab-app/src/
ingest_progress.rs:69, crates/kebab-cli/tests/wire_schema_breakdowns.rs:35
도 동일 stale list 보유. spec ACCEPT (round 1c) 의 grep boundary
밖이므로 본 round 미포함.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.3 / §4.3a.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:47:16 +00:00
a58ee10dfb fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6)
Why: metro-korea.pdf (Identity-H CID font without ToUnicode CMap) 의
ingest 가 pdf_ocr_pages=0 으로 잘못 종료. lopdf 0.32.0 의 emit
`?Identity-H Unimplemented?` marker 28 ASCII char 가 is_valid_text_char()
의 0x0020..=0x007E range 통과 → ratio=1.0 → OCR fallback 0.5
threshold bypass.

Change: MOJIBAKE_MARKERS const + compute_valid_char_ratio() 4-단계
(strip → trim-empty zero → dominance cap-0.3 → 기존 ratio). marker
list extensible. is_valid_text_char() 본체 변경 0.

Tests: +2 unit (dominance + minority) on top of 기존 8. parser_version
/ wire schema 변경 0.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.1 / §4.2 / §6 R-1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:42:59 +00:00
e674ff474b fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)
v0.20.0 sub-item 1 dogfood report 의 Bug #4 — F4 mojibake.pdf 의 lopdf
`get_pages()` count = 0 (Pages tree broken). root cause = 기존 byte-
level `re.sub` + manual startxref edit 가 lopdf strict load 통과시키지만
Pages dict 의 `/Kids` reference 깨짐.

- `tests/fixtures/_synth/mojibake.py`: full rewrite — replace byte-level
  `re.sub` + manual startxref with pikepdf open+inject-dummy-ToUnicode+
  del+save (auto xref regen). HYSMyeongJo-Medium CID font: CID font 이
  ToUnicode 를 자체 생성하지 않아 dummy stream 을 inject 후 strip
  (removed=1 invariant). Exit codes 2/3/4 for invariant fail.
- `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`: regenerate via
  pikepdf — 1 valid page, no /ToUnicode marker, byte-identical 후 reproducible.
- `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json`:
  regen via 2-run cargo test pattern (hand-rolled unwrap_or_else baseline
  bootstrap, no insta crate).
- `crates/kebab-parse-pdf/tests/text_extractor_regression.rs`: append 3
  invariant test — (1) lopdf 1-page, (2) /ToUnicode marker absent,
  (3) PdfTextExtractor 1-block invariant.
- `crates/kebab-parse-pdf/src/text_quality.rs`: f4_fixture_ratio_under_threshold
  threshold 0.3 → 0.5 (production valid_ratio_threshold 기본값). 구 broken
  fixture (pages=0) 는 extract_text="" → ratio=0.0; 신 fixed fixture 는
  CID 2-byte fallback decode → ratio≈0.375 — 여전히 OCR trigger 조건 충족.

spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 4)
prior: 241ded5 (Step 3 integration test)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:02:17 +00:00
241ded59df test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)
v0.20.0 sub-item 1 bugfix Step 3 (Group C) — integration-level regression
for Bug #3 (intra-doc chunk_id collision under aggressive overlap).

- `crates/kebab-app/tests/common/mod.rs`: `pub mod mock_ocr;` 1 line append.
- `crates/kebab-app/tests/common/mock_ocr.rs` (new): MockOcrEngine lift +
  `single` / `per_page` ctor (backward-compat single + per-page cursor).
- `crates/kebab-app/tests/pdf_ocr_apply.rs`: inline MockOcrEngine 제거 +
  `mod common; use common::mock_ocr::MockOcrEngine;` import. 10 ctor call
  site migration (`MockOcrEngine { .. }` → `MockOcrEngine::single(...)`).
- `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs`
  (new): F1 + F2 scanned PDF + Bug #3 trigger shape (10 char "가" + ". " +
  500 char "나") via mock OCR. assertion: chunk_id global uniqueness (HashSet
  dedup) across F1 + F2; F2 trigger text produces ≥2 chunks (collision shape).
- C1 decision: Option A (share via tests/common/mock_ocr.rs). Facade mock
  injection unavailable (OllamaVisionOcr hardcoded) — helper-level chain test
  (apply_ocr_to_pdf_pages → PdfPageV1Chunker) adds value beyond unit B5.

spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4.5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 3)
prior: 436fd01 (Step 2 Bug #3 chunk_id fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:45:38 +00:00
436fd015a2 fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3)
v0.20.0 sub-item 1 dogfood report 의 Bug #3 (Critical). scanned_page2.pdf
(1580 char OCR text) ingest 시 `chunks.chunk_id` PRIMARY KEY violation —
`per_chunk_hash = #c{char_start}` 가 post-overlap `actual_start` 사용 +
overlap walk floor 가 `prev_min` 으로 collapse → segment 1/2 동일 `#c0`.

- `crates/kebab-chunk/src/pdf_page_v1.rs`: `chunk_page` returns 4-tuple
  (segment_start, actual_start, chunk_end, slice); caller `per_chunk_hash`
  suffix uses `segment_start` (pre-overlap boundary, strictly increasing)
  instead of `char_start` (post-overlap, may collapse to prev_min).
- VERSION_LABEL `"pdf-page-v1"` → `"pdf-page-v1.1"` (design §9 cascade,
  explicit user-facing audit trail). `crates/kebab-app/tests/pdf_pipeline.rs:
  168, 368` 의 hardcoded literal 도 v1.1 로 갱신.
- module docs (`pdf_page_v1.rs:47-60`): workaround description 의
  `#c{char_start}` reference 를 `#c{segment_start}` 로 갱신 + segment_start
  invariant 명문 + HOTFIXES.md cross-ref.
- `pdf_page_v1.rs::tests`: `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
  regression pin (10 char "가" + ". " + 500 char "나" — multi-chunk +
  overlap walk collapse trigger).
- `tasks/HOTFIXES.md`: 2026-05-27 entry (symptom F2 1580 char OCR,
  intra-doc collision root cause, second-iteration patch rationale).

spec:  docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4)
plan:  docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 2)
prior: d9acda5 (Step 1 Bug #2 walker fix)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:32:09 +00:00
d9acda517a fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2)
v0.20.0 sub-item 1 dogfood report 의 Bug #2 — `[ingest.code].max_file_bytes`
가 walker 단계의 모든 file 에 일률 적용 → PDF/image/markdown 의 대부분 (256 KB+)
이 walker pre-extract skip. fix:

- `crates/kebab-source-fs/src/code_meta.rs`: `pub(crate) fn is_code_file(path)
  -> bool` helper 추가 (= `code_lang_for_path(path).is_some()`).
- `crates/kebab-source-fs/src/connector.rs:168-190`: walker size-cap check 가
  `is_code_file(&abs_path) && is_oversized(...)` short-circuit. PDF/image/
  markdown 는 walker bypass — parser 의 자체 size control (lopdf load_mem,
  image OCR max_pixels) 가 cover.
- `crates/kebab-source-fs/src/connector.rs` 기존 mod tests 안 추가:
  `size_cap_skips_only_code_files` — 300 KB PDF + MD + .rs 의 walker 결과
  검증. 기존 sibling test (huge.rs / longfile.rs, fixture 명 `.rs`) regression 0.

spec:  docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§3)
plan:  docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 1)
prior: b4d9e60 (PR #189)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:20:38 +00:00
b4d9e60816 chore(release): bump version 0.19.0 → 0.20.0 — v0.20.0 sub-item 1 scanned PDF OCR
# v0.20.0 — scanned PDF OCR via Ollama vision LLM

v0.20.0 의 핵심 변경 = embedded text 가 없는 scanned PDF (책 스캔, 영수증,
카메라 page) 의 OCR ingest. PoC 의 5 engine 비교 (Tesseract / EasyOCR /
PaddleOCR / gemma4:e4b / qwen2.5vl:3b) 에서 qwen2.5vl:3b 의 alnum 94.79%
(page1) / 81.56% (받침) 가 모든 다른 engine 을 능가 — 본 release 의 default
vision OCR.

## 1. OCR opt-in 사용법

`[pdf.ocr]` config 의 `enabled = true` 또는 `KEBAB_PDF_OCR_ENABLED=true` env
로 활성화. default off — OCR 한 page 당 45-100s (qwen2.5vl:3b on CPU,
remote Ollama) 의 cost 가 책 archive 외 비-OCR KB 에 부적합.

```toml
[pdf.ocr]
enabled = true
model = "qwen2.5vl:3b"
# 다른 default 는 README 참조
```

qwen2.5vl:3b 의 Ollama pull:

```bash
ollama pull qwen2.5vl:3b   # 3GB Ollama image
```

## 2. v0.19 indexed scanned PDF 의 force-reingest

v0.19 binary 로 scanned PDF 를 ingest 한 KB 는 자동으로 OCR path 진입 안
함 — parser_version "pdf-text-v1" 보존 (CLAUDE.md §Versioning cascade 의
trigger 회피 결정, H-4). 따라서 v0.20 binary upgrade + config
`pdf.ocr.enabled = true` 만 적용 시 try_skip_unchanged 의 Unchanged path 가
OCR 실행을 skip. 명시적 재처리:

```bash
kebab ingest --root /path/to/kb --force
```

## 3. DCTDecode-only v1 scope (FlateDecode / CCITTFax page 처리)

v0.20.0 의 PDF page image extract = lopdf 의 image XObject 의 /Filter ==
DCTDecode 만 cover (JPEG passthrough). 다른 encoding (FlateDecode raw
pixel, CCITTFaxDecode bilevel, JPXDecode JPEG2000) 은 warning event 발행 +
해당 page skip.

scanned PDF 의 일부 page 가 FlateDecode 또는 CCITTFax 로 encoded 시:

```bash
qpdf --object-streams=disable --recompress-flate input.pdf normalized.pdf
```

v1 의 의도 = single binary 원칙 (image crate 도입 0). v1.1+ 또는 별
sub-item 에서 multi-filter 지원 검토.

## 4. Family asymmetry (image OCR gemma4:e4b vs PDF OCR qwen2.5vl:3b)

image OCR (P6) 의 default 는 gemma4:e4b 그대로 (변경 0). PDF OCR (v0.20)
만 qwen2.5vl:3b. 사용자가 [image.ocr] model = "qwen2.5vl:3b" 으로 통일
가능 단 default 는 family asymmetric 보존.

## Dogfood + test 결과

- workspace test: 178 result lines, 0 failure.
- workspace clippy (-D warnings): exit 0.
- alnum e2e (real Ollama, manual invoke):
  - F1 (한국어 page1): 94.79% (≥ 0.85 threshold).
  - F2 (받침-intensive): 81.56% (≥ 0.70 threshold).
- integration smoke + vector PDF regression: pass.

## 변경된 surface

- new config: [pdf.ocr] (11 field) + 11 env override KEBAB_PDF_OCR_*.
- new wire: IngestEvent::PdfOcrStarted/Finished (additive minor).
- new wire: IngestItem.pdf_ocr_pages/ms_total (additive minor).
- new CLI line: "📷 OCR page N..." / "✓ OCR page N (chars chars, msms via ollama-vision)".
- new module: kebab-parse-pdf::{page_image, text_quality} + kebab-app::pdf_ocr_apply.
- dep: workspace lopdf = "0.32" 통합.
- fixture: 5 PDF (F1/F2/F4/F6/F7) under crates/kebab-parse-pdf/tests/fixtures/.

## 변경되지 않은 surface (invariant)

- Extractor::extract trait body byte-identical (PR #187).
- PdfTextExtractor body 변경 0 — post-extract enrichment pattern 으로 분리.
- parser_version "pdf-text-v1" 보존.
- chunker_version "pdf-page-v1" 보존.
- workspace.dependencies 의 production dep graph 변경 0 (-e normal baseline 보존).

## sub-item 의 11 commit history

9d7faab Step 1: foundation + cargo tree baselines
aeeff36 Step 2: lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7)
fb3952d Step 2 fix: F7 conversion engine record correction
c2cd3a7 Step 3: page_image + text_quality modules (10 test)
8d81bc1 Step 3 fix: clippy pedantic in page_image
9f003ef Step 4: pdf_ocr_apply helper (10 test, F7 split + cancel)
fd918a6 Step 5: [pdf.ocr] config section + PdfOcrOpts doc
4672cba Step 5 fix: clippy::bool_assert_comparison in pdf_ocr tests
b9ee09f Step 6: wire PDF OCR enrichment + cancel propagation
4c5ccd5 Step 7: wire schema additive — IngestEvent + IngestItem + skipped
c9e0594 Step 8: CLI printer activation + ingest_progress test + spec literal
4819768 Step 9: integration smoke + vector regression + alnum e2e
1d4e301 Step 9 follow-up: Cargo.lock for dev-dep additions
90726ab Step 10: docs sync (README + HANDOFF + ARCHITECTURE + SMOKE)

## § Acceptance §9 verifier evidence

K5 의 15 row scriptable verifier 모두 green (또는 manual real-Ollama row 의 결과 보고):
- Row #4 (vector PDF byte-identical): pass.
- Row #5 (Extractor::extract trait byte-identical): 0 line diff.
- Row #6 (wire schema additive): jq + diff exit 0.
- Row #7-#8 (clippy / workspace test): exit 0.
- Row #9-#10 (dep graph baseline -e normal): empty diff.
- Row #11 (docs sync): grep evidence.
- Row #12 (version bump): "0.20.0" + Cargo.lock cascade ≥ 22.
- Row #14 (PR #187 invariant): extract_for(&asset.media_type) ≥ 1.
- Row #15 (DCTDecode-only v1, F6/F7 skip): test green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 11:03:44 +00:00
90726ab283 docs(v0.20): sync README + HANDOFF + ARCHITECTURE + SMOKE for scanned PDF OCR (post-extract enrichment, qwen2.5vl:3b, DCTDecode-only v1)
Step 10 (Group J) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

J0 — release notes path decision: commit body (RELEASE_NOTES.md /
docs/RELEASE_NOTES_*.md 부재, v0.17.x/v0.18.0 patterns 의 commit body
release notes 형식 따름). Step 11 K1 commit body 안 inline.

J1 — README.md:
- Configuration section 의 toml table list 에 `[pdf.ocr]` 추가.
- 새 sub-section `### [pdf.ocr] — scanned PDF OCR (v0.20.0+)`: 11 field
  toml example + `KEBAB_PDF_OCR_*` 11 env override + force-reingest UX
  ("v0.19 indexed scanned PDF 가 v0.20 upgrade 후 자동 OCR 미적용,
  `kebab ingest --force` 필요").

J2 — HANDOFF.md:
- phase status P7 row 확장: 3/3 component + post-extract OCR enrichment
  (v0.20.0 sub-item 1, qwen2.5vl:3b vision LLM).
- "머지 후 발견된 결정" entry: v0.20 sub-item 1 의 design + scope
  (H-1 post-extract pattern + DCTDecode-only v1 + parser_version 보존 + H-4 UX).

J3 — docs/ARCHITECTURE.md:
- OCR row 분리: `OCR (image)` (gemma4:e4b 그대로) + `OCR (PDF, v0.20.0+)`
  (qwen2.5vl:3b, post-extract enrichment via kebab-app::pdf_ocr_apply,
  DCTDecode-only v1, family asymmetry — PoC alnum 94.79% vs gemma4 27%).
- PDF parser row 확장: page_image::extract_dctdecode_page_image (v0.20.0) +
  parser_version "pdf-text-v1" 보존 + provenance event 차별화.

J3 — docs/SMOKE.md:
- `[pdf.ocr]` 격리 config example (enabled=true, model=qwen2.5vl:3b).
- 새 dogfood section `### v0.20 force-reingest (scanned PDF OCR)`:
  v0.19 → v0.20 upgrade path 의 명시적 `kebab ingest --force` invoke.

J4 — release notes draft (Step 11 K1 commit body 의 source):
- result file 안 record (4 topic: opt-in + force-reingest + DCTDecode-only +
  family asymmetry + dogfood/test 결과).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§6.4)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 10 J0-J4)
prior: 1d4e301 (Step 9 + Cargo.lock follow-up)
contract: §9

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:34:24 +00:00
1d4e301e5e chore(deps): Cargo.lock for Step 9 dev-dep additions (strsim + kebab-parse-image)
Step 9 (commit 4819768) 에서 추가된 dev-dep (strsim 0.11 + kebab-parse-image
path) 의 Cargo.lock cascade. worker 가 명시적 commit 에 포함 안 함 — follow-up
commit 으로 lock 동기화.

dep graph baseline (-e normal) 영향 0 (dev-dep 만 추가).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:12:38 +00:00
48197687b7 test(pdf): integration smoke (w/ search + cancel) + vector regression + alnum e2e (#[ignore]) for v0.20 sub-item 1
Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규):
- ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama,
  ingest_with_config production path + IngestItem.pdf_ocr_pages verify.
- ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search
  의 OCR text indexed verify (§ Acceptance #2).
- ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set
  cancel=true + dummy endpoint, no panic/deadlock verify).

I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규):
- vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector
  PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant).
- baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create).
- normalize_provenance_timestamps inline helper (R-3 mitigation, workspace
  전체 부재 — 신규 12-line).

I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규):
- f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real
  Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation.
- alnum metric = strsim::levenshtein (dev-dep 추가).
- truth file copy from PoC scratch (page1.txt + page2-batchim.txt) →
  scanned_page1_truth.txt + scanned_page2_truth.txt.
- kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용).
  parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph
  baseline -e normal 보존).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5)
prior: c9e0594 (Step 8 CLI printer)
contract: §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 10:10:58 +00:00
c9e05941c5 feat(cli): activate per-page PDF OCR progress printer + test(app): ingest_progress emit verify + spec(pdf-ocr): align §4.6.1 literal with option_A (ms/chars)
Step 8 (Group H) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 reviewer concern fix (spec literal deviation).

H1 — kebab-cli/src/progress.rs printer activation:
- 구 no-op stub `IngestEvent::PdfOcr* { .. } => {}` (Step 6 placeholder)
  를 사람-친화 stderr line printer 로 활성화.
- spec §4.6.1 line 1085-1086 wording 그대로:
  - PdfOcrStarted → `  📷 OCR page {page}...`
  - PdfOcrFinished (skipped=false) → `  ✓ OCR page {page} ({chars} chars, {ms}ms via {ocr_engine})`
  - PdfOcrFinished (skipped=true)  → `  ⊘ OCR page {page} skipped (no DCTDecode or engine fail, {ms}ms)` (M-4 의 skipped field carry 활용)
- `!quiet` gate 정합 (AssetStarted/Finished pattern mirror).

H2 — crates/kebab-app/tests/ingest_progress.rs 의 새 test:
- pdf_ocr_progress_emits_started_finished_events (real Ollama 의존, `#[ignore]`).
- F1 fixture (scanned_page1.pdf) ingest 시 pdf_ocr_started + pdf_ocr_finished
  event 가 emit 됨을 verify. Started count == Finished count invariant.
- Manual invoke: `KEBAB_PDF_OCR_ENABLED=true cargo test -p kebab-app --test
  ingest_progress --ignored`.
- mock OcrEngine inject path 부재 (Step 6 의 eager build), Step 9 I5 의
  ocr_e2e pattern (real Ollama + `#[ignore]`) 와 동일.

Step 7 reviewer concern fix — spec §4.6.1 literal:
- line 1076-1077 의 `ocr_ms` / `ocr_chars` literal 을 wire schema 의 실제
  field name `ms` / `chars` (option_A, Rust serde 와 정합) 로 갱신.
- line 1087 의 printer wording 도 `{ocr_chars}` / `{ocr_ms}` → `{chars}` / `{ms}`.
- line 1556 의 rationale 참조 `pdf_ocr_finished.ocr_ms` → `.ms`.
- `skipped` field 도 명시 (Step 6 reviewer M-4 결과).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.6.1)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 8 H1+H2)
prior: 4c5ccd5 (Step 7 wire schema) — Step 7 reviewer concern 1 의 fix
contract: §9 (additive minor wire bump — Step 7 commit 에서 완료)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 09:18:49 +00:00
4c5ccd5447 feat(wire): additive minor — IngestEvent kind 의 pdf_ocr_* + ingest_report.items[] 의 pdf_ocr_pages/ms_total + skipped field carry (Step 6 M-4/M-2)
Step 7 (Group G) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 6 code reviewer Important M-4 (skipped field carry) + Minor M-2
(ordering invariant doc) fix.

G3 — JSON Schema sync (additive minor — schema_version 보존):

ingest_progress.schema.json:
- kind enum 2 추가: pdf_ocr_started + pdf_ocr_finished.
- 새 field: page (1-based PDF page), ocr_engine (engine_name), skipped (bool).
- 기존 ms / chars field 의 description 갱신 (pdf_ocr_finished carry 추가).

ingest_report.schema.json:
- items.items.properties 신규 정의 (이전 stub ["array", "null"] 만).
- pdf_ocr_pages + pdf_ocr_ms_total (nullable integer).
- 모든 기존 IngestItem field 도 명시화 (kind, doc_path, byte_len, ...).

Step 6 reviewer M-4 (Important) — skipped field carry:
- IngestEvent::PdfOcrFinished 에 skipped: bool 추가.
- ingest_one_pdf_asset 의 emit closure (lib.rs:~1864) 가 source
  PdfOcrProgress::Finished { skipped } 를 discard 않고 propagate.

Step 6 reviewer M-2 (Minor) — ordering invariant doc:
- crates/kebab-app/src/ingest_progress.rs 의 ordering text 갱신:
  ScanStarted < ScanCompleted < (AssetStarted [< (PdfOcrStarted <
  PdfOcrFinished)*] < AssetFinished)* < (Completed | Aborted).

.md doc (docs/wire-schema/v1/*.md) 부재 — plan §3 Step 7 G3 의 .md
deliverable retro N/A (해당 file 0).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 7 G3)
prior: b9ee09f (Step 6 wiring) + Step 6 reviewer M-4/M-2 권고
contract: §9 (additive minor wire bump — schema_version 보존)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 08:51:51 +00:00
b9ee09f176 feat(app): wire PDF OCR enrichment + cancel propagation into ingest_one_pdf_asset (H-5 eager init + post-extract hook + per-page cancel) + workspace lopdf dep (Step 4 M-4)
Step 6 (Group E) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 spillover (IngestEvent variant + IngestItem field for compile
boundary) + Step 4 reviewer Minor M-4 fix.

E1 — eager PDF OCR engine build at `ingest_with_config_opts` entry,
mirror of image OCR pattern (lib.rs:338-347). `pdf.ocr.enabled ||
always_on` 시 `OllamaVisionOcr::from_parts(endpoint, model, ...)` 호출
+ fail-fast `?`. App field 추가 0 (local var only, spec L-1 / Step 1
A1 cosmetic fix 정합).

E2 — `ingest_one_pdf_asset` signature extension: +3 param
(`pdf_ocr_engine: Option<&OllamaVisionOcr>`, `progress: Option<&
mpsc::Sender<IngestEvent>>`, `cancel: Option<&Arc<AtomicBool>>`).
`ingest_one_asset` dispatch wrapper + caller (dispatch loop) update.

E3 — post-extract enrichment block at `extract_for` 직후 (line 1779).
`pdf.ocr.enabled || always_on` 시 `apply_ocr_to_pdf_pages` 호출,
PdfOcrProgress → IngestEvent emit (PdfOcrStarted / PdfOcrFinished
with ocr_engine), summary 의 pages_ocrd/ms_total 을 IngestItem field
로 carry. PR #187 registry dispatch invariant 보존
(`extract_for(&asset.media_type, ...)` 그대로).

E4 — cancel handle propagation: ingest_with_config_cancellable →
IngestOpts.cancel → ingest_with_config_opts → ingest_one_asset →
ingest_one_pdf_asset (new `cancel` param) → PdfOcrOpts.cancel chain.
spec §4.8 line 1159 production wiring.

Step 7 spillover (compile boundary):
- `kebab_app::ingest_progress::IngestEvent`: PdfOcrStarted { page } +
  PdfOcrFinished { page, ms, chars, ocr_engine }. serde discriminant
  `pdf_ocr_started` / `pdf_ocr_finished` (Step 7 G3 wire schema 와 일치).
- `kebab_core::IngestItem`: pdf_ocr_pages: Option<u32> +
  pdf_ocr_ms_total: Option<u64> (warnings/error 사이). 11 non-PDF
  IngestItem construct site 가 `None` 채움.
- `kebab-cli/src/progress.rs` + `kebab-tui/src/ingest_progress.rs`:
  새 variant no-op handler (v1에서 per-page progress 미노출, future
  refinement 시 활성화 가능).
- `kebab-store-sqlite/tests/ingest_report_snapshot.rs` + snapshot
  `ingest_report.snapshot.json`: 2 IngestItem fixture 의 새 field 추가.
- Step 7 의 JSON Schema 갱신 + CLI printer activation + snapshot
  regenerate 는 별 commit (G3/H1/H2 deliverable).

M-4 (Step 4 reviewer Minor) — lopdf workspace dep 통합:
- workspace `Cargo.toml [workspace.dependencies] lopdf = "0.32"`.
- kebab-app + kebab-parse-pdf 의 direct dep → `{ workspace = true }`.

Verifier evidence:
- workspace test (`cargo test --workspace --no-fail-fast -j 1`):
  175 test result summary lines, 0 failures, 0 FAILED.
- workspace clippy (`-D warnings`): exit 0, 0 warning.
- dep graph baseline (`.omc/state/pdf-ocr-{parse-pdf,app-parse}-deps.baseline.txt`):
  empty diff for both.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.4 + §4.6 + §4.8)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 6 E1-E4 + Step 7 partial G1+G2)
prior: 4672cba (Step 5 fix) + fd918a6 (Step 5) + 9f003ef (Step 4 helper)
contract: §9 (additive minor wire bump — Step 7 JSON Schema 완료 시)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 08:18:34 +00:00
4672cba6c6 fix(config): satisfy clippy::bool_assert_comparison in pdf_ocr tests
fd918a6 의 F2 test file (crates/kebab-config/tests/pdf_ocr.rs) 의 4 line
`assert_eq!(bool_field, true|false)` 가 workspace clippy pedantic
의 `bool_assert_comparison` 위반 → CI gate
`cargo clippy --workspace --all-targets -- -D warnings` exit 1.

각 assertion 의 canonical form 적용:
- assert_eq!(x, false) → assert!(!x)
- assert_eq!(x, true)  → assert!(x)

semantic + behavior 동일, 4 line edit, logic 변경 0.

review trail:
- impl result: .omc/reviews/2026-05-27-pdf-ocr-step-05-impl-result.md
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-05-spec-review-result.md (I-1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 07:17:46 +00:00
fd918a60ce feat(config): add [pdf.ocr] section — qwen2.5vl:3b default, opt-in + env overrides + doc(app): PdfOcrOpts field doc (Step 4 I-1)
Step 5 (Group F) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 4 reviewer Important I-1 fix (PdfOcrOpts field doc) 동봉.

F1 — `kebab-config::PdfCfg` + `PdfOcrCfg` + 4 default fn:
- PdfCfg { ocr: PdfOcrCfg }.
- PdfOcrCfg with 11 field (enabled/always_on/engine/model/endpoint/
  languages/max_pixels/request_timeout_secs/valid_ratio_threshold/
  min_char_count/lang_hint).
- defaults: opt-in (enabled=false), qwen2.5vl:3b, 0.5 threshold, 20 char.
- mirror of image OCR cfg pattern (spec §4.5).

Config struct extension:
- `pdf: PdfCfg` field with `#[serde(default = "PdfCfg::defaults")]`.

11 env var override (parallel to KEBAB_IMAGE_OCR_*):
KEBAB_PDF_OCR_{ENABLED,ALWAYS_ON,ENGINE,MODEL,ENDPOINT,LANGUAGES,
MAX_PIXELS,REQUEST_TIMEOUT_SECS,VALID_RATIO_THRESHOLD,MIN_CHAR_COUNT,
LANG_HINT}.

F2 — `crates/kebab-config/tests/pdf_ocr.rs` (신규):
- toml roundtrip (11 field).
- defaults (opt-in + qwen2.5vl:3b).
- env override (4 key sample + default preservation).

F3 (Step 4 I-1) — `pdf_ocr_apply.rs` 4 public item 의 doc comment:
- PdfOcrOpts struct + 6 field.
- PdfOcrSummary struct + 2 field.
- apply_ocr_to_pdf_pages fn (Errors block 포함).
- PdfOcrProgress enum + 2 variant + 5 field.
body 변경 0, doc-only.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.5)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 5 F1+F2)
prior: 9f003ef (Step 4) — code reviewer Important I-1 resolution
contract: §9 (additive minor wire bump — Step 7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 07:07:18 +00:00
9f003ef1cd feat(app): add pdf_ocr_apply helper (10 test, F7 split + cancel) — post-extract OCR enrichment for PDF (H-1 resolution)
Step 4 (Group D) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

D1 — `apply_ocr_to_pdf_pages(&mut canonical, &dyn OcrEngine, &bytes, &opts, emit_progress)`
in `kebab-app::pdf_ocr_apply`. spec §4.1 line 381-599 body 그대로 +
PdfOcrOpts.cancel field + per-page cancel check (verifier LOW L-1).

post-extract enrichment pattern (H-1 resolution): kebab-parse-pdf 가
kebab-parse-image::OcrEngine 을 import 하지 않음 (parser isolation 보존).
helper 가 kebab-app 의 facade 안 — both parser crate 의 cross-import 회피.

Per-page decision matrix (spec §4.1 line 459-464):
- always_on=true → 모든 page OCR (dual-block, ordinal = page-1 + page_count).
- always_on=false + needs_ocr → in-place OCR (text-detect block mutate).
- needs_ocr=false → skip.

DCTDecode-only v1 (H-3): FlateDecode / CCITTFaxDecode page 는
extract_dctdecode_page_image=None → Warning event + skip + emit_progress(skipped=true).

OcrEngine.recognize 실패 → Warning event + skip + emit_progress(skipped=true).

D3 — per-page cancel handle (verifier LOW L-1 + spec §4.8 line 1159):
PdfOcrOpts.cancel: Option<Arc<AtomicBool>>. set→true 시
`anyhow::bail!("PDF OCR cancelled mid-PDF at page N")`.

lopdf = "0.32" added to [dependencies] (already transitive via kebab-parse-pdf;
no new crate introduced — dep graph kebab-parse-* baseline unchanged).

Integration test (`tests/pdf_ocr_apply.rs`, 10 test):
- f1_input_with_ocr_enabled_replaces_empty_block — in-place mutate.
- f3_input_with_ocr_enabled_keeps_text_detect_blocks — vector PDF skip.
- f1_input_with_ocr_disabled_keeps_empty_block — disabled no-op.
- f4_input_with_ocr_enabled_replaces_mojibake_block — mojibake → in-place mutate.
- f3_input_with_always_on_pushes_dual_blocks — always_on dual-block.
- f6_flatedecode_skipped_with_warning — FlateDecode skip + Warning event.
- f7_ccittfax_skipped_with_warning — CCITTFax skip + Warning event (verifier M-4 split).
- ocr_engine_failure_surfaces_as_warning — OCR failure → Warning event.
- dual_block_ordinals_are_deterministic_and_unique — ordinal invariant.
- cancel_handle_aborts_mid_pdf — cancel handle 의 production source (D3).

MockOcrEngine fixture: spec §5.5 line 1284-1299. F3 fixture 부재 →
mock CanonicalDocument construction + F1 bytes reuse pattern (Option B:
PdfTextExtractor::extract 를 통한 실제 production path canonical 생성).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 + §5.5)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 4 D1+D2+D3)
prior: c2cd3a7 (Step 3) + 8d81bc1 (Step 3 clippy fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 06:42:01 +00:00
8d81bc1071 style(parse-pdf): satisfy clippy pedantic in page_image (uninlined_format_args + map_unwrap_or)
c2cd3a7 의 `extract_dctdecode_page_image` 에 workspace clippy pedantic 위반 2 건
잔존 → CI gate (cargo clippy --workspace --all-targets -- -D warnings) fail.
두 lint 모두 1-line edit + semantic 동일, logic 변경 0.

- L20  uninlined_format_args: format!("page {} not in get_pages()", page_num)
                              → format!("page {page_num} not in get_pages()")
- L48-52 map_unwrap_or:        .map(|n| n == b"Image").unwrap_or(false)
                              → .is_some_and(|n| n == b"Image")

cargo clippy --workspace --all-targets -j 4 -- -D warnings → exit 0.
cargo test -p kebab-parse-pdf -j 4 → 21 passed (regression 0).

review trail:
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-03-spec-review-result.md (SPEC_COMPLIANT)
- code review: .omc/reviews/2026-05-27-pdf-ocr-step-03-code-review-result.md (Critical C-1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 06:14:00 +00:00
c2cd3a7ab7 feat(parse-pdf): add page_image (DCTDecode passthrough, 2 test) + text_quality (valid char ratio, 8 unit test) modules
Step 3 (Group C) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

C1 — `page_image::extract_dctdecode_page_image(pdf_doc, page_num)` ->
Result<Option<Vec<u8>>>. lopdf 의 Resources/XObject traverse, 첫 image
XObject 의 /Filter 검사 (single Name OR Array form 모두 cover, spec §4.1
line 642-664), DCTDecode + JPEG magic 검증 통과 시 raw bytes 반환. 다른
encoding 또는 image XObject 부재 시 Ok(None). v1 scope = DCTDecode
passthrough only (H-3 invariant, image crate 도입 0).

Integration test (`tests/page_image.rs`, 2 test):
- f1_fixture_yields_dctdecode_jpeg_bytes — F1 fixture happy path.
- flate_raw_fixture_yields_none — F6 fixture negative path.

C2 — `text_quality::compute_valid_char_ratio(s) -> f32`. valid char =
ASCII printable + Hangul (Jamo/Compatibility/Syllables) + CJK + Latin
Extended + common Korean punctuation. 빈 string → 0.0. caller
(`kebab-app::pdf_ocr_apply`) 가 threshold 와 비교 (default 0.5).

Unit test (`mod tests`, 7 + F4 conditional):
- empty / pure ASCII / pure Hangul / pure PUA / mixed half / CJK / Hangul Jamo.
- f4_fixture_ratio_under_threshold: active (case A — lopdf extract_text 가
  ToUnicode CMap 부재 시 빈 string 반환 → valid_ratio = 0.0000 < 0.3).

Also: Cargo.toml description 갱신 ("Text PDF extractor + scanned-page
image extract helpers ...", Step 1 A2 이연분).

fixture fix: mojibake.pdf 의 startxref 22130 → 22114 (16-byte offset 오차
수정 — lopdf strict parser 가 xref 를 찾지 못하는 버그 해결).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 line 600-722)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 3 C1+C2)
prior: aeeff36 (Step 2 fixtures) + fb3952d (Step 2 F7 record fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 05:59:10 +00:00
fb3952d54f docs(pdf-ocr): correct F7 conversion engine record in PoC doc (gs, not ImageMagick)
aeeff36 의 PoC doc append (engine-comparison.md L134, L141) 가 F7 (`ccitt.pdf`)
의 conversion engine 을 "ImageMagick `convert -compress Group4`" 로 기록했으나,
실제 tests/fixtures/_synth/flate_ccittfax.sh:77-83 은
`gs -sDEVICE=pdfwrite -dMonoImageFilter=/CCITTFaxEncode -dEncodeMonoImages=true`
flag 사용 (ImageMagick `convert` 호출 0회).

fixture binary (`/Filter [ /CCITTFaxDecode ]`, 2060 bytes) 는 invariant 충족 OK
(Step 2 spec compliance + code quality review verified). historical record 의
factual correction only.

review trail:
- impl result: .omc/reviews/2026-05-27-pdf-ocr-step-02-impl-result.md
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-02-spec-review-result.md
- code review: .omc/reviews/2026-05-27-pdf-ocr-step-02-code-review-result.md (I1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 05:36:56 +00:00
aeeff3635b poc+test(pdf-ocr): lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7) for v0.20 sub-item 1
Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures,
result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md).

Key findings:
- reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ];
  useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0.
- Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires
  manual PDF construction via zlib.compress.
- ghostscript pdfwrite rejects TIFF input (/undefined in II*) —
  ImageMagick `convert -compress Group4` used for F7 CCITTFax.

B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/:
- F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어).
- F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침).
- F4 mojibake.pdf       — DejaVu TTF + ToUnicode CMap stripped (count=0);
                          Noto CJK TTC has PostScript outlines unsupported by reportlab.
- F6 flate_raw.pdf      — /Filter /FlateDecode, DCTDecode absent (skip path input).
- F7 ccitt.pdf          — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input).

Synth scripts under tests/fixtures/_synth/:
- scanned_pdf.py    — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0).
- mojibake.py       — F4 reportlab DejaVu TTF + ToUnicode strip.
- flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 04:04:47 +00:00
9d7faab650 docs+chore(plan-bootstrap): apply spec L-1 cosmetic fix + capture cargo tree baselines for v0.20 sub-item 1 verifier gates
Step 1 (Group A) of v0.20.0 sub-item 1 (scanned PDF OCR) implementation plan.

A1 — spec §4.2 line 740 prose pseudo-code fix: `app.pdf_ocr_engine.as_ref()`
→ local `pdf_ocr_engine: Option<OllamaVisionOcr>` built in
`ingest_with_config_opts` (정합 with §4.4 eager init, App field 도입 0).

A2 — Cargo.toml dep invariant verified (image crate 미도입 — H-3 DCTDecode-only
v1 invariant 보존; kebab-parse-pdf + kebab-parse-image 가 kebab-app 의 기존
dep). description 갱신은 Step 3 (module 추가 후) 으로 이연.

A3 — cargo tree baseline 캡처 — K5 row #9/#10 의 ground-truth
(.omc/state/pdf-ocr-{app-parse,parse-pdf}-deps.baseline.txt). 본 sub-item
의 다른 step 의 dep graph 변경 0 invariant 의 verifier 의 baseline.
Note: .omc/ 는 .gitignore 대상 — baseline files 는 로컬 파일로 존재.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (round 1c ACCEPT)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 03:40:49 +00:00
bcd1e37dab chore(repo): .omc/ ignore + AGENTS·GEMINI symlinks + release notes 작성 가이드 강화
- .gitignore: .omc/ (OMC state directory) 추가 — .claude/, .superpowers/ 와 동급
- AGENTS.md / GEMINI.md: CLAUDE.md 로의 symlink — Codex / Gemini CLI 도 동일 지침 따르도록
- CLAUDE.md release 절차: release notes 가 commit subject 단순 나열 대신 사용자 친절한 설명 + 도그푸딩/테스트 결과 포함하도록 가이드 강화

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 23:58:04 +00:00
e7a4330798 Merge pull request 'docs: v0.20 image+pdf handoff + sub-item 3 spec/plan backfill' (#188) from docs/v0-20-image-pdf-handoff into main
Reviewed-on: #188
2026-05-26 23:36:49 +00:00
574e1b1ca1 docs: v0.20 image+pdf handoff + sub-item 3 spec/plan backfill
v0.19.0 release 후 다음 session 인계용 handoff 문서 + 사후 backfill.

- docs/superpowers/handoffs/2026-05-26-v0.20-image-pdf-normalize-handoff.md (540 lines, 9 section)
  - sub-item 1/2/3 머지 결과 + 도그푸딩 baseline (1781 doc / 9050 chunks) + user memory + OMC workflow + 빌드 환경
  - 현재 구현 상태 (v0.19.0, image+pdf) — 정확한 file:line + struct/fn signature + flow
  - 8 TODO 상세 (problem + scope + affected files + risk + trigger 조건)
  - 우선순위 + sequencing 권장 + 새 session 첫 단계 제안

- docs/superpowers/specs/2026-05-26-extractor-dispatch-unification-spec.md (sub-item 3 spec)
- docs/superpowers/plans/2026-05-26-extractor-dispatch-unification-plan.md (sub-item 3 plan)

PR #187 머지 시 source code 만 들어가고 spec/plan 누락 — 동일 PR 의 reference link 가 main 에서 404. 본 commit 으로 backfill.

Assisted-by: Claude Code
2026-05-26 23:34:17 +00:00
377 changed files with 41275 additions and 4628 deletions

1
.gitignore vendored
View File

@@ -1,6 +1,7 @@
.superpowers/
.worktrees/
.claude/
.omc/
/target
**/*.rs.bk
Cargo.lock.bak

1
AGENTS.md Symbolic link
View File

@@ -0,0 +1 @@
CLAUDE.md

View File

@@ -81,11 +81,68 @@ Bump 자체는 단순 minor / patch 한 줄 수정 (`Cargo.toml` workspace `vers
Release 절차:
1. `gitea-release v<X.Y.Z>` (gitea-ops skill) 으로 tag + push + release notes.
2. release notes 는 사용자 도그푸딩에 영향 가는 surface 변경 위주 — wire schema 추가, CLI flag 신규, TUI 키 변경, V00X migration 등.
2. release notes 는 사용자 도그푸딩에 영향 가는 surface 변경 위주 — wire schema 추가, CLI flag 신규, TUI 키 변경, V00X migration 등 — 다룬다. 이때 추가된 기능과 변경사항은 유저가 이해할 수 있도록 친절하고 자세하게 풀어서 설명해야 하며, 단순히 commit subject 를 나열하는 형태로 끝내면 안 된다. 필요하다면 도그푸딩이나 테스트 결과도 함께 적어 둔다.
3. 프리-1.0 (`0.x.y`) 단계: minor bump 시 wire schema additive / surface 변경 누적, patch bump 시 bug fix only.
**bump 시점 = release 시점 같은 commit**. 즉 commit `chore: bump version 0.x → 0.y` 직후 같은 commit 에 tag. v0.1.0 (`2319206`) 처럼 bump 없이 tag 만 찍는 패턴은 후속 release 가 대상 commit 을 헷갈리게 함 — pre-release snapshot 은 SHA reference 로 충분.
## Dogfood trigger
도그푸딩 = 새 binary 를 실제 KB / 실제 query 로 돌려보고 user-visible 동작이 spec 의 의도와 일치하는지 확인하는 종단 검증. unit / integration test 가 못 잡는 회귀 (UX 어색함, performance regression, 의외의 token 처리, embedding drift, RAG hallucination) 를 catch 함. PR 머지 전 또는 머지 직후 release notes 작성 전에 실시.
### 도그푸딩이 필요한 시점
다음 트리거 중 하나라도 hit 시 도그푸딩 필수. **모두 release-level 또는 user-visible behavior 변경 임**.
**Schema / migration**:
- 신규 V00X migration (예: V007 trigram, V008 OCR mirror, V009 morphological) — `corpus_revision` cascade + auto-backfill 정책의 사용자 경험 확인.
- frozen design contract 변경 (`docs/superpowers/specs/2026-04-27-kebab-final-form-design.md` §X 갱신) — verbatim CI diff-check 외의 user-visible side effect 확인.
**Wire schema / CLI surface**:
- 신규 `--json` 필드, exit code 변경, 또는 schema major bump (v1 → v2) — agent / external integration 의 호환성 검증.
- `kebab` 의 subcommand 또는 flag 추가/삭제/rename — agent skill / muscle memory 영향.
**Search / RAG behavior**:
- FTS5 tokenizer / chunker / embedder 모델 / RAG prompt template 변경 — 같은 query 의 hit ordering, snippet, RAG citation 패턴이 자연스럽게 변화하는지.
- score gate, RRF fusion ratio, NLI threshold 같은 ranking 파라미터 default 변경.
**Performance**:
- ingest / search / ask latency 의 의도된 변화 (예: lindera tokenize, OCR 추가, multi-hop RAG) — actual wall-clock 측정 + release notes 에 명시.
- 대용량 KB (수천 doc / 만 chunk) 의 first-boot eager backfill 시간이 사용자 hang 인지에 영향 안 가는지.
**Language / locale**:
- 한국어 / 일본어 / 중국어 lexical 동작 변경 (V007 trigram, V009 morphological, future N-gram).
- 영어 substring 매칭 같은 ad-hoc 부산물의 회귀.
**File / asset surface**:
- 신규 source 형식 (PDF OCR, audio, video) — extractor / chunker 의 실제 corpus 동작.
- `.kebabignore` / `_external/` 같은 workspace 정책 변경.
**Release-level**: 위 트리거 중 하나가 hit 되어 `Cargo.toml` workspace `version` bump 가 필요하면, **bump commit 이전에 도그푸딩 evidence 가 HOTFIXES + release notes 에 명시** 되어 있어야 함. evidence 없는 release 는 사용자가 "왜 bump 했는지" 추적 불가.
### 도그푸딩 데이터 보관소
모든 도그푸딩 source 문서 + KB state + 로그는 `/build/dogfood/` 한 디렉토리에 누적 보관한다. **분류는 문서 의미 / 종류 / 형식 기준만** — kebab version, 생성 시점, scenario name 같은 prefix 금지 (`v0.20.1-dogfood/`, `dogfood-v018/` 같은 디렉토리 신설 X). 자세한 layout 은 `/build/dogfood/README.md` 참조.
- `/build/dogfood/corpus/` — source 문서 (read-only). format 별 분류 (`markdown/`, `code/`, `html/`, `images/`, `pdf/`, `manifest/`, `resources/`) + 각 format 내 category 별 (예: `markdown/{korean,english,bilingual,tech-docs,coding-md-corpus,topics,notes,edge-cases}`, `code/{rust,python,...}`). 새 fixture 는 적절한 category subdir 에 추가.
- `/build/dogfood/kb/` — 도그푸딩 run 의 KB 출력 (SQLite + LanceDB + assets + models). 매 run 마다 reset 가능. 별 KB 디렉토리 신설 X.
- `/build/dogfood/logs/` — 누적 실행 로그 (ndjson + stderr + summary).
- `/build/dogfood/config.toml` — canonical 도그푸딩 config (없으면 `kebab init` 후 path override).
- `/build/dogfood/_archive/` — regeneratable stale state (이전 run 의 sqlite/lancedb, XDG snapshot). 디스크 압박 시 wipe 가능.
`/tmp/kebab-smoke/`, `/tmp/kebab-*`, `/build/cache/dogfood*`, `/home/altair823/KnowledgeBase`, `~/.config/kebab/`, `~/.local/share/kebab/`, `~/.local/state/kebab/` 같은 위치 신규 사용 금지 — 모두 `/build/dogfood/` 로 일관. ad-hoc fixture 가 필요하면 `corpus/<format>/<category>/` 에 추가.
### 도그푸딩 결과 기록
도그푸딩 evidence 는 두 곳에 cascade:
1. **`tasks/HOTFIXES.md` 의 dated entry** — 시나리오 별 hit count 표 + snippet evidence + known limitation. 미래에 spec drift 의심 시 git history 외 immediate reference 가 됨.
2. **`docs/release-notes/v<X.Y.Z>-draft.md`** (또는 gitea release body) — 사용자 도그푸딩 영향에 영향이 가는 surface 변경을 4 단락 (변경 사실 / trade-off / mitigation / upgrade 절차) 으로 풀어서 설명. evidence link.
도그푸딩 단계에서 *발견된 bug* (spec 과 실제 동작의 mismatch, performance regression, UX 어색함) 는 즉시 fix → re-dogfood. fix 가 별 PR 으로 빠지면 머지 후 HOTFIXES 에 dated entry.
DOGFOOD scenario catalog (§1~§13) 는 `docs/DOGFOOD.md`. 신규 release 마다 §관련 section 의 scenario list 갱신 + 신규 scenario 추가.
## Naming + paths
- Crate prefix: `kebab-` (kebab-case package, `kebab_` snake_case in Rust modules).

1356
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -11,6 +11,8 @@ members = [
"crates/kebab-search",
"crates/kebab-embed",
"crates/kebab-embed-local",
"crates/kebab-embed-candle",
"crates/kebab-embed-ollama",
"crates/kebab-llm",
"crates/kebab-llm-local",
"crates/kebab-rag",
@@ -30,7 +32,7 @@ edition = "2024"
rust-version = "1.85"
license = "MIT OR Apache-2.0"
repository = "https://github.com/altair823/kebab"
version = "0.19.0"
version = "0.26.0" # v0.26.0 — arctic-embed-l-v2.0 임베더 통합: kebab-embed-candle 다중 모델 레지스트리(e5 mean + arctic CLS, 모델별 pooling/prefix 분기) + 신규 kebab-embed-ollama 크레이트(provider="ollama", POST /api/embed, L2 정규화, batch+fail-soft). config models.embedding.provider 에 "ollama" 추가 + endpoint: Option<String>. 기본 동작 불변(provider=fastembed e5), arctic 은 opt-in, embedding_version cascade(arctic-cls / ollama:{model} 태그). — CLAUDE.md §Release
# pre-v0.18 workspace-wide cleanup: enable clippy::pedantic group with
# intentional allow-list. The allowed lints are either cosmetic (doc style),
@@ -141,6 +143,7 @@ proptest = "1"
# p9-fb-19: LRU cache for `App::search` results. Bounded capacity
# from `config.search.cache_capacity` (default 256, ~1.3 MB cap).
lru = "0.12"
lopdf = "0.32"
# fastembed-rs ships ONNX runtime via the `ort-download-binaries` feature
# in its default set (which also pulls `hf-hub` for first-run model
# downloads). Pinned to the 4.x line per task p3-2 (current 5.x release
@@ -202,6 +205,10 @@ ort = { version = "=2.0.0-rc.9", default-features = false, features = [
tokenizers = { version = "0.21", default-features = false, features = ["onig"] }
hf-hub = { version = "0.4", default-features = false, features = ["ureq", "rustls-tls"] }
ndarray = "0.16"
# Korean morphological tokenizer (FTS v0.20.x, §6.1). lindera-ko-dic bundles
# the KO-DIC dictionary as an embedded blob via the embed-ko-dic feature.
lindera = "3"
lindera-ko-dic = "3"
# Disk-footprint trim for dev / test builds. Codegen, opt-level, and
# behavior are unchanged — only DWARF debug info is reduced (line

1
GEMINI.md Symbolic link
View File

@@ -0,0 +1 @@
CLAUDE.md

View File

@@ -17,7 +17,7 @@ P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) + P10 전체
| **P4** | Local LLM + RAG + grounded answer | `kebab-llm`, `kebab-llm-local`, `kebab-rag` | P3 | ✅ 완료 |
| **P5** | Golden query / regression eval | `kebab-eval` | P4 | ✅ 완료 |
| **P6** | 이미지 ingestion (OCR + caption) | `kebab-parse-image` | P5 | ✅ 완료 (4/4 component, OCR/caption Ollama-vision) |
| **P7** | PDF text + page citation | `kebab-parse-pdf` | P5 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring) |
| **P7** | PDF text + page citation + scanned OCR (v0.20.0 sub-item 1) | `kebab-parse-pdf` + `kebab-app::pdf_ocr_apply` | P5 + P6 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring + post-extract OCR enrichment via qwen2.5vl:3b vision LLM) |
| **P8** | 음성 transcription + timestamp citation | `kebab-parse-audio` | P5 | ⏸ 보류 (whisper-rs 시스템 dep brainstorm 필요) |
| **P9** | TUI + desktop app | `kebab-tui`, `kebab-desktop` | P5 | 🟡 진행 (4/5 component — P9-1/2/3/4 완료 [Library / Search / Ask / Inspect], P9-5 desktop 예정 · 도그푸딩 피드백 **20/20 ✅**) |
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, `code-rust-ast-v1` — v0.7.0), 1B ✅ (Python/TS/JS AST chunkers — v0.8.0 이후), **1C-Go ✅ (Go AST chunker, `code-go-ast-v1` — v0.12.0)**, **1C-JavaKotlin ✅ (Java + Kotlin AST chunkers, `code-java-ast-v1` / `code-kotlin-ast-v1` — v0.13.0)**, **2 ✅ (Tier 2 resource-aware: yaml/k8s + dockerfile + manifest, `k8s-manifest-resource-v1` / `dockerfile-file-v1` / `manifest-file-v1` — v0.14.0)**, **3 ✅ (Tier 3 paragraph fallback: code-text-paragraph-v1 — v0.15.0)**, **1D ✅ (C + C++ AST chunkers, code-c-ast-v1 + code-cpp-ast-v1 — v0.16.0)** |
@@ -30,8 +30,16 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
## 머지 후 발견된 버그 / 결정 (요약)
- **candle 임베딩 백엔드 다변화** (2026-06-01, Track 1, v0.22.0): `provider = "candle"` opt-in 추가 — 같은 `multilingual-e5-large` 모델을 순수 Rust(candle)로 돌려 듀얼소켓 NUMA 서버의 onnxruntime 48-스레드 double-free 를 회피. `[models.embedding].num_threads`(+env `KEBAB_EMBED_THREADS`)로 CPU 스레드 캡. fastembed default 동작·벡터 불변, `embedding_version` 유지(재색인 0). Phase 0 스파이크 패리티 cosine 1.000000. 상세 HOTFIXES 동일 일자.
- **config 마이그레이션** (2026-05-31, PR #198): `kebab config migrate` 추가 — 기존 config.toml 에 빠진 섹션을 주석과 함께 채우고 deprecated 정리(멱등·`.bak`·dry-run, 값/주석 보존). `schema_version` 1→2, `init` 도 섹션 주석 포함, doctor 에 `config_migration` 체크. 상세 HOTFIXES 동일 일자.
머지 후 발견된 모든 deviation / hotfix 의 dated 로그는 [tasks/HOTFIXES.md](tasks/HOTFIXES.md). 본 요약은 \"누군가가 인수받을 때 알아두면 시간을 많이 절약하는\" 항목만:
- **2026-06-03 arctic-embed-l-v2.0 임베더 통합** — v0.26.0. 별칭 제거 후 설명형 query recall 보강(측정 recall@10 130/132, e5 +7). `kebab-embed-candle` 모델 레지스트리화(e5 mean + `snowflake-arctic-embed-l-v2.0` CLS, 모델별 pooling/prefix) + 신규 `kebab-embed-ollama`(`provider="ollama"`, `/api/embed`). config `endpoint: Option<String>` 추가. 기본 e5 유지(opt-in), arctic 전환은 embedding_version cascade → 재색인. candle↔Ollama cosine>0.99 게이트로 pooling/prefix 정확성 고정(`#[ignore]`). 자세한 내용: `tasks/HOTFIXES.md` (2026-06-03 arctic), spec `docs/superpowers/specs/2026-06-03-arctic-embedder-spec.md`.
- **2026-06-03 doc-side expansion(별칭) 기능 완전 제거** — v0.25.0. 아래 2026-05-31 항목의 색인-시 청크당 LLM 별칭 생성 + 별칭 검색 채널을 **전부 제거**(ROI 음수: cross-lingual 은 e5-large 단독으로 충분, 기여는 설명형 +2 그룹뿐인데 대가가 청크당 색인-시 LLM). `Chunk.aliases`/`expansion.rs`/`IngestExpansionCfg`/alias lexical arm/`expansion_progress` wire kind 제거, 신규 마이그레이션 **V013**`chunk_aliases_fts`+`chunks.aliases` DROP. 별칭 default-off 였어 사용자 체감 0, 기존 KB 도 재색인 불요(잔존 별칭 벡터는 `strip_alias_suffix` graceful 매핑/`reset` 정리). `AssetTimings.expansion_ms` 는 wire 호환 위해 값 0 으로 유지. 자세한 내용: `tasks/HOTFIXES.md` (2026-06-03), spec `docs/superpowers/specs/2026-06-03-remove-doc-expansion-spec.md`.
- **2026-05-31 Phase 2 doc-side expansion 별칭(개별 dense 벡터) + 파생물 캐시(V012)** — v0.21.0 cut. 색인 시 LLM 이 청크별 별칭("같은 의미 다른 표현")을 생성, 줄별 **개별 dense 벡터**(sentinel `{chunk}#alias#N`)로 색인 (묶음 1벡터는 평균화 희석으로 회귀 → 폐기) + boilerplate 청크 skip. `[ingest.expansion]` default off. 측정(나무위키 ~1000 문서 CS corpus): 변형 일관성 14/18 → **16/18**, spread 0.222→0.111, 대조군 false-positive 별칭 무죄. 비용 병목(별칭 18문서 2.5h)은 **파생물 캐시(V012, 청크 내용 해시 키)**로 해소 — 정답 3개 cold 1879s → warm 13s **≈ 145배**, embedding+별칭 LLM 캐싱, version_key cascade 정합. search/ask 가 `kebab.sqlite`+`lancedb` 만으로 동작 → 외부 서버 색인 후 DB 만 복사하는 이식 워크플로 가능. **결정/known limitation**: grounded/refusal 판정이 부분 인용을 grounded 로 오분류(정직한 거부가 false-positive 로 집계) — 별도 개선 후보. stack·svm 설명형 2개 잔존. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-31), 측정: `docs/superpowers/handoffs/2026-05-31-namu-wiki-alias-cache-study.md`.
- **2026-05-29 v0.20.2 dogfood findings + 검색 품질 baseline** — 8-finding 라운드 완료. (1) Ask 응답언어: rag-v3 default (질문 언어 = 답변 언어). (2) eval `--config` facade 패치 로 dogfood KB 직접 eval 가능. (3) 검색 품질 baseline — hybrid hit@3=1.0 / MRR=0.833, lexical hit@3=1.0 / MRR=0.7 (golden 10 query). **O-2 known limitation**: 소형 모델(gemma4:e4b) refusal 메시지의 query 언어 불일치 가능 — 판정은 정상, 표시 문구만 해당. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-29).
- **v0.20 sub-item 1 (scanned PDF OCR via qwen2.5vl:3b)**: post-extract enrichment pattern (`kebab-app::pdf_ocr_apply`, H-1 resolution), DCTDecode-only v1 scope (FlateDecode/CCITTFax page 는 warning + skip), parser_version `"pdf-text-v1"` 보존 + force-reingest UX 명문 (H-4).
- **2026-05-26 kebab-normalize + kebab-parse-types 흡수 (24 → 22 crates, design §3.7b 재작성)** — v0.19.0 cut. 4 parser 중 markdown 한 갈래만 lift 를 경유하는 reality 가 design §3.7b 의 fan-in ≥ 2 가정과 diverge → thin layer (`kebab-parse-types`) + `kebab-normalize` 두 crate 가 `kebab-parse-md` 로 흡수. 5 사용 type + 3 forward-declared struct 모두 `kebab-parse-md::{types,normalize}` module 의 `pub` re-export 로 보존. wire / surface impact = 0 (CLI / TUI / MCP / `--json` / config / XDG / parser_version 모두 unchanged). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-26 design deviation entry).
- **2026-05-26 v0.18.0 fb-41 multi-hop RAG + NLI verification ship (PR #176-180) + post-PR9 cleanup (PR #181)** — pre-v0.18.0 dogfood (`/build/cache/dogfood-v018/`, 33 assets / 205 chunks, gemma3:4b CPU only / 16 GB RAM) 에서 발견된 S7 caffeine hallucination 의 root cause = LLM-self-judge ceiling (synthesize 가 chunks 와 무관한 Adam optimizer gradient 식을 silent emit, self-judge 가 reject 못함). 학계 표준 (Self-RAG, CRAG, Auto-GDA, MedTrust-RAG) 결론 = deterministic post-synthesis verification. mDeBERTa-v3 XNLI ONNX (280 MB, Xenova HF) 가 `(packed_chunks, answer)` entailment 검사 — `[rag] nli_threshold > 0` (default 0.0 = disabled, production 권장 0.5) 일 때 활성. dogfood retest 측정 — S7 PR-8 baseline `grounded=true + Adam hallucination` → PR-9 `nli_verification_failed, nli_score 0.0035`. wire additive minor — `answer.v1.verification` field + `refusal_reason``nli_verification_failed` / `nli_model_unavailable` 추가, pre-v0.18 reader 무영향. 5 sub-PR 시퀀스 + cleanup PR (clippy::pedantic baseline + 의도적 30+ allow + H1 `[models.nli].model` config wiring + 9 new tests). post-refactor retest = PR-9d byte-identical (deterministic 확인). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-25 fb-41 PR-9 closure entry + S3 follow-up).
- **2026-05-25 v0.17.2 post-v0.17.1 polish (PR #164 + #165)** — v0.17.1 의 두 follow-up closure. (1) `[image.ocr] request_timeout_secs` 별 노브 — `crates/kebab-parse-image/src/ocr.rs::REQUEST_TIMEOUT` hard 300s 제거, LLM 쪽 패턴 (PR #162) 을 OCR 어댑터에 동일 적용. 사용자 결정으로 별 노브 분리 (OCR vs LLM 의 cold start 패턴이 달라 독립 조절). v0.17.1 미진행 항목 closure. (2) `chunks_fts``heading_path` 컬럼이 JSON 표기 + path 세그먼트 까지 trigram 색인 → query false positive 가능 문제 closure. `lexical.rs::build_match_string` 가 non-raw 분기 결과를 `text : (<expr>)` 로 wrap — heading 색인 V007 verbatim 유지, 매칭만 text 한정. 사용자가 명시 heading 검색 하려면 raw mode `'heading_path : <token>'` escape hatch (SKILL.md 갱신). 둘 다 additive (옛 config 호환) / re-ingest 불필요. 자세한 내용: `tasks/HOTFIXES.md` (2026-05-25 v0.17.2 두 entry).
@@ -97,6 +105,59 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
- **fb-41 multi-hop reasoning** — ⏳ 미구현, XL, eval 인프라 선행 + brainstorm 필요.
- **Rust symbol path retrofit** — Rust `code-rust-ast-v1` symbol 이 file-scope-only (1B+ 는 module prefix). `code-rust-ast-v2` bump + Rust corpus re-ingest 비용 → 사용자 명시 요청까지 보류. HOTFIXES `2026-05-20`.
### v0.20.0 sub-item 1 (PDF scanned OCR) 머지 후 priorities (2026-05-28, 사용자 결정)
PR #189 (2026-05-28 머지, commit `09333d0`) 으로 PDF scanned OCR (qwen2.5vl:3b vision LLM) + 4 round bugfix (#2/#3/#4/#6/#7/#9/#10/#11/#13/#14) + ingest log feature 가 main 으로 진입. 다음 작업 순서 = **C → B → A → G**.
- **C — 한국어 morphological tokenizer (Bug #8 follow-up)** ✅ **v0.20.1 머지 완료**.
- V007 trigram 의 ≥3 char query 제약 (HOTFIXES `2026-05-22`) — '한국' 같은 2-char 한국어 query 0 hit → V009 migration + lindera-ko-dic tokenizer + tokenized_korean_text column + first-boot eager backfill 으로 해소. branch `feat/korean-morphological-tokenizer` (8 commit + 5 follow-up).
- scope: search index 재빌드 cascade (corpus_revision bump) + V007 trigram 보존 (backward-compat).
- 사용자 surface: `kebab search` 의 한국어 2자 query ('한국', '서울') 매칭. README + SKILL + release notes 반영.
- **B — OCR dense page coverage** ⏳ C 다음.
- metro-korea.pdf page 8/13 timeout (180s, dense newspaper article). vision LLM 의 output token 과대 → 정상 timeout.
- 가능한 path: (a) per-page `max_pixels` 동적 조정 (high-resolution page 만 축소), (b) column-level sub-region OCR (newspaper layout 분할 후 OCR call 분리), (c) model upgrade (qwen2.5vl:7b — Ollama 모델 변경 + max_pixels trade-off), (d) OCR timeout 점진 축소 (180s → 120s → 90s) — round 마다 p90 측정 후.
- mojibake.pdf `pdf_ocr_pages: 0` (round 1 부터 동일) — text-detect path fallback 강화 검토.
- 별 sub-item.
- **A — v0.20 의 deferred sub-items (frozen design contract)** ⏳ B 다음.
- **sub-item 2** — Multi-region image dispatch (`OcrText.regions` bbox 분리) — image OCR + PDF column-aware OCR.
- **sub-item 3** — PDF normalize integration (`ParsedPdfPage` production caller + `build_canonical_document_from_pdf_pages` + cross-page reference graph).
- **TODO #4** — Per-page image / table extraction (PDF figure / table extract).
- **TODO #5** — Enricher trait 도입 — OCR + caption 의 `Extractor` trait 통합 (post-extract enrichment 의 generalization).
- 각 sub-item 별 spec/plan/executor cycle.
- **G — v0.20.1 patch release + release notes** ⏳ A 머지 후 (또는 C/B 시점에 따라 조기 cut).
- CLAUDE.md release 룰 — sub-item 1 base + bugfix1-4 + log feature + logging r2 누적 → minor surface 변경 다수 + wire schema additive minor + config 신규 → **v0.20.1 patch bump + release notes**.
- 핵심 surface (사용자 도그푸딩 가이드 형식):
- **한국어 2자 query 지원** (`kebab search` 에서 '한국', '서울' 같은 2자 단어 매칭 — V009 morphological tokenizer).
- OCR timeout default 180s (HOTFIXES 2026-05-28).
- `[logging]` config section (default enabled) + `{state_dir}/logs/ingest-{run_id}.ndjson` 자동 생성.
- `[logging] keep_recent_runs` (100) + `retention_days` (30) — OR-on-stale cleanup.
- `ingest_progress.v1.pdf_ocr_finished` 의 4 추가 field (image_byte_size, image_width, image_height, failure_reason) — image_w/h 가 round 2 (PR #190) 에서 실제 capture.
- `schema.v1.models``active_parsers` + `active_chunkers` (additive minor).
- V008 migration — `pdf_ocr_events` table (per-OCR-call historical record).
- 새 wire schemas — `ocr_stats.v1` + `ocr_failures.v1` (CLI inspect 의 emit).
- CLI `kebab inspect ocr-stats` + `kebab inspect ocr-failures` — sweet-spot 점진 분석.
- CLI `--media code` first-class, empty query → `invalid_input`, `--config` missing → `config_not_found` + exit 2.
- capabilities.streaming_ask + single_file_ingest 가 true (이전 false 거짓 정정).
- bump 작업: workspace `Cargo.toml` version → 0.20.1, tag, gitea-release.
### v0.20 후속 bug catalog (non-blocking known)
본 PR #189 dogfood 에서 **falsified** 또는 **design constraint** 로 분류 — fix 안 함:
- Bug #8 (V007 trigram 2-char query 한계) → 위 C 항목.
- Bug #12 (Code block wire `.code` field, `.text` 가 아닌 jq fallback artifact) — falsified.
- ask 한국어 query phrasing-sensitive refusal — RAG corner case / NLI gate behavior. 별도 brainstorm.
### Logging feature enhancements — ✅ closed (PR #190, 2026-05-28 merged commit `7bbdc89a`)
logging round 2 (PR #190) 으로 4 enhancement 모두 closed:
-`image_width` + `image_height` capture (raster JPEG decode).
- ✅ SQLite mirror (V008 `pdf_ocr_events` table + dual-write).
- ✅ CLI query (`kebab inspect ocr-stats` + `ocr-failures``ocr_stats.v1` + `ocr_failures.v1` wire schemas).
- ✅ log retention (`keep_recent_runs` + `retention_days` — file + SQLite cleanup).
### P9 dogfooding 백로그 (fb-26 ~ fb-42) — release 분할
2026-05-06 도그푸딩 누적 피드백 + "AI agent 가 kebab 을 쓰게 한다" 궁극 목표용 surface 확장. cascade 영향 / 분량 고려해 한 minor 에 묶지 않고 분할.

335
README.md
View File

@@ -1,135 +1,196 @@
# kebab — Local-first Knowledge Base
# kebab — Local-first Knowledge Base + RAG
`kebab` 는 개인용 로컬 knowledge base + RAG 도구다. Markdown / PDF / 이미지를 한 곳에 색인하고, 의미 검색 + page-단위 citation 포함 LLM 답변을 단일 binary 로 제공한다. 모든 추론은 로컬 (Ollama / fastembed) 에서 돌아간다. 대상 하드웨어: M4 48GB MacBook 1대, 사용자 1명.
## 사전 요구
- **Rust toolchain** ≥ 1.85 (workspace 가 edition 2024 + resolver 3 사용). [rustup](https://rustup.rs) 권장.
- **Ollama** — `kebab ask` 와 이미지 OCR/caption 가 사용. `https://ollama.com/download` 에서 설치 후 `ollama serve` 실행. 기본 LLM 은 gemma4 계열 (`ollama pull gemma4:e4b`) — OCR / caption 도 같은 family 라 모델 하나만 pull 하면 됨. 더 큰 variant 원하면 `gemma4:26b` 등으로 config override. config 의 `[models.llm].endpoint` 에 host:port 명시.
- **CPU only / RAM ≤ 16 GB 환경 권장 모델**: gemma4:e4b (8B) 는 CPU 추론에 무거워 RAG 한 답변이 5분을 넘기기 쉽다 — `[models.llm] request_timeout_secs` 의 기본 300 s 한도에 걸려 `error: kb-rag: llm.generate_stream` 으로 떨어진다 (HOTFIXES 2026-05-25). `gemma3:4b` / `qwen2.5:3b` / `phi3:mini` 같은 ≤ 4B Q4 모델로 바꾸면 답변 1-3 분에 안정 동작 (확장 도그푸딩에서 검증). 모델 storage 가 부담이면 `OLLAMA_MODELS=/path` env 로 위치 분리 가능.
- **`request_timeout_secs` 노브 (v0.17.0)**: `[models.llm] request_timeout_secs = 1200` (또는 `KEBAB_MODELS_LLM_REQUEST_TIMEOUT_SECS=1200`) 로 한도를 늘려 큰 모델도 시도 가능. 단 응답 동안 RAM 점유가 길어진다. **`= 0` 은 disable 이 아니라 "즉시 timeout"** (reqwest 의 의미상) — "사실상 무제한" 의도면 `u64::MAX` 또는 `86400` 같이 큰 finite 값 사용.
- **sudo 없이 설치 (격리 디렉토리 사용)**: `install.sh``/usr/local/bin/ollama` + `systemd` 유닛까지 건드리는 게 부담이면 binary tarball 만 받아 사용자 디렉토리에 풀고 env 로 모델 위치 분리하면 된다.
```bash
mkdir -p /opt/ollama/{models,logs}
curl -fL https://ollama.com/download/ollama-linux-amd64.tar.zst -o /tmp/ollama.tar.zst
zstd -d /tmp/ollama.tar.zst -o /tmp/ollama.tar && tar -xf /tmp/ollama.tar -C /opt/ollama/
# bin/ollama + lib/ollama/ 가 풀린다. 모델 디렉토리는 OLLAMA_MODELS 로 분리.
OLLAMA_MODELS=/opt/ollama/models OLLAMA_HOST=127.0.0.1:11434 \
/opt/ollama/bin/ollama serve > /opt/ollama/logs/serve.log 2>&1 &
/opt/ollama/bin/ollama pull gemma3:4b
```
루트 디스크 부담을 분리하고 싶을 때 (`~/.ollama/models` 가 기본) 그대로 활용. systemd 가 없는 컨테이너 / WSL2 / 회사 머신 등에서 유용.
- **`kebab ask --stream` 권장 (fb-33)**: 모델 cold start 가 길 때 (8B+ 또는 첫 호출) `--stream` 으로 토큰을 stderr 에 ndjson 으로 흘려 받으면 5 분 timeout 한도 안에서도 첫 토큰이 빨리 보여 사용자 체감이 개선된다. 동일 inference 시간이라도 wait-and-pray 보다 progressive 가 안정적. CLI: `kebab ask "..." --stream 2> events.ndjson > final.json`. MCP host 도 `streaming_ask` capability flag 가 `true` 면 자동 사용 권장.
- **빌드 디스크** — 첫 빌드 시 `target/` 가 610 GB (Lance + DataFusion + fastembed). 여유 확인.
- **fastembed 모델** — 첫 `kebab ingest` 시 `multilingual-e5-large` (~1.3 GB, fb-39b) 자동 다운로드. `config.toml` 에서 `model = "multilingual-e5-small"` 로 명시하면 이전 모델 사용.
## 설치
표준 경로는 `cargo install` — `~/.cargo/bin/kebab` 가 PATH 에 있는지만 확인하면 끝.
```bash
# 1) repo clone
git clone https://gitea.altair823.xyz/altair823-org/kebab.git
cd kebab
# 2) binary 빌드 + 설치 (~/.cargo/bin/kebab)
cargo install --path crates/kebab-cli --locked
# 3) PATH 확인 (아직 추가 안 했으면 ~/.bashrc / ~/.zshrc 에 추가)
which kebab # → /Users/<you>/.cargo/bin/kebab 같은 경로
kebab --version # → kebab 0.1.0
```
git URL 직접 install 도 가능 (clone 없이):
```bash
cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin kebab --locked
```
업데이트는 `git pull && cargo install --path crates/kebab-cli --locked --force` 또는 git URL 형식의 경우 `cargo install --git ... --force`.
제거는 `cargo uninstall kebab-cli`. 이 명령은 binary 만 지우고 워크스페이스 데이터는 그대로 남는다. 데이터까지 정리하려면 `kebab reset --all --yes` (config + data + cache + state 4 개 XDG 경로 모두 wipe — **irreversible**, 재시작 시 `kebab init` 다시 실행). 부분 wipe 는 `kebab reset --data-only` (config 보존), `kebab reset --vector-only` (Lance + `embedding_records` 만, 다음 ingest 가 re-embed), **`kebab reset --orphans-only`** (현재 walker scope 밖에 있는 stored doc 만 정리 — `config.workspace.include` 좁히거나 sub-dir 옮긴 후 explicit reconcile; fs 의 file 은 건드리지 않음) 등.
`kebab` 는 개인용 로컬 knowledge base + RAG 도구다. Markdown · PDF · 이미지 · 소스코드를 한 곳에 색인하고, 하이브리드 의미 검색과 근거 인용을 포함 LLM 답변을 **단일 binary** 로 제공한다. 모든 추론은 로컬 (Ollama + fastembed) 에서 돌아간다.
## Quick start
사전 요구는 두 가지뿐이다.
- **Rust toolchain** ≥ 1.85 (workspace 가 edition 2024 사용). [rustup](https://rustup.rs).
- **Ollama** — `kebab ask` 와 이미지/PDF OCR 가 사용. [공식 설치 안내](https://ollama.com/download) 참고 후 `ollama serve` 실행. 기본 LLM family 는 gemma4 (`ollama pull gemma4:e4b`) — OCR/caption 도 같은 family 라 모델 하나면 된다. CPU-only 환경이면 소형 모델 (예: `gemma3:4b`) 을 권장.
```bash
# 첫 실행 — XDG 경로에 데이터 디렉토리 + config.toml 생성
# 1) 빌드 + 설치 (~/.cargo/bin/kebab)
git clone https://gitea.altair823.xyz/altair823-org/kebab.git
cd kebab
cargo install --path crates/kebab-cli --locked
# 2) 데이터 디렉토리 + config.toml 생성 (XDG 경로)
kebab init
# config 손보 — workspace.root, 모델 endpoint 등 설정 (지원 형식: md / png / jpg / pdf / rs / py / ts / js / go)
# 3) config 최소 손보 — workspace.root (색인할 폴더) 와 LLM endpoint
${EDITOR:-vi} ~/.config/kebab/config.toml
# 색인 (Markdown / 이미지 / PDF 모두 한 번에)
# 4) 색인 (Markdown · PDF · 이미지 · 소스코드 한 번에)
kebab ingest
# 검색 (citation 의 source_span 이 매체별로 line / region / page)
kebab search "Markdown chunking 규칙" --mode hybrid
# 5) 검색 (hybrid = lexical + vector RRF, citation 포함)
kebab search "Markdown chunking 규칙"
# 질문 (Ollama 필요, PDF 인용 시 page 번호 surface)
# 6) 질문 (RAG 답변 + 근거 인용, Ollama 필요)
kebab ask "내 KB 설계에서 저장소 전략은?"
# Ratatui 셸 (Library + Search + Ask + Inspect 패널, desktop 진행 중)
kebab tui
# 헬스 체크 (config 경로 / 데이터 디렉토리 쓰기 가능 여부)
kebab doctor
```
격리된 임시 워크스페이스로 돌려보는 절차는 [docs/SMOKE.md](docs/SMOKE.md) — `--config <path>` 로 분리. 이미지 / PDF fixture 가 필요하면 두 example 바이너리 (`cargo run --release --example gen_smoke_pdf -p kebab-parse-pdf` / `gen_smoke_png -p kebab-parse-image`) 로 시스템 dep 없이 in-tree 생성 가능.
clone 없이 git URL 로 바로 설치할 수도 있다: `cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin kebab --locked`. 업데이트는 동일 명령에 `--force`. 제거는 `cargo uninstall kebab-cli` (데이터는 보존 — 데이터까지 지우려면 `kebab reset --all --yes`).
설치 없이 dev 흐름으로 돌려볼 때는 `cargo run --release -p kebab-cli -- <subcommand>` 또는 `cargo build --release && ./target/release/kebab <subcommand>`.
설치 없이 dev 흐름으로 돌려볼 때는 `cargo run --release -p kebab-cli -- <subcommand>`. 격리된 임시 워크스페이스로 검증하는 절차는 [docs/SMOKE.md](docs/SMOKE.md) (`--config <path>` 로 분리).
## 핵심 기능
### 하이브리드 검색 + citation
lexical (FTS5 BM25) 과 vector (cosine) 두 채널을 **RRF fusion** 으로 합쳐 검색한다. 모든 hit 은 출처 위치를 매체별로 정확히 담는다 — Markdown/코드는 line, 이미지는 region, PDF 는 page. `--tag` · `--media` · `--lang` · `--path-glob` 등 다양한 필터와 `--max-tokens` · `--cursor` 같은 agent budget flag 를 지원한다.
### 파생물 캐시 (자동)
embedding 벡터를 청크 **내용 해시** 로 캐싱한다 (`derivation_cache`). 재색인·갱신 시 내용이 같은 청크는 재계산을 건너뛴다. 캐시 키에 모델·차원 버전이 포함돼 버전 변경 시 자동 무효화된다 (cascade 안전). 별도 설정 없이 투명하게 동작한다. (현재 TTL/LRU 자동 정리는 미구현 — 누적된 캐시는 `kebab reset` 으로만 정리.)
### 외부 계산 + 로컬 검색 워크플로
search/ask 는 원본 파일 없이 KB 산출물만으로 동작한다 (청크 본문이 SQLite 에 저장되고 문서 경로는 상대경로로 기록됨). 비싼 색인(임베딩·OCR)을 성능 좋은 머신에서 수행한 뒤(예: Apple Silicon 맥에서 candle Metal GPU), **두 산출물만** 다른 머신(예: NUMA 서버)으로 복사하면 그대로 검색·질문할 수 있다.
**무엇을 복사하나 — `[storage]` 에서 정의된 두 경로:**
| 복사 대상 | config 키 (`[storage]`) | 기본 경로 | 내용 |
|-----------|------------------------|-----------|------|
| `kebab.sqlite` | `sqlite = "{data_dir}/kebab.sqlite"` | `{data_dir}/kebab.sqlite` | 문서·청크·본문·FTS5·메타 |
| `lancedb/` | `vector_dir = "{data_dir}/lancedb"` | `{data_dir}/lancedb/` | 임베딩 벡터 |
`{data_dir}``[storage].data_dir` (예: `~/.local/share/kebab`). `models/`(`model_dir``assets/`(`asset_dir`)는 **복사 불필요** — 모델은 각 머신이 자기 캐시를 받고, asset 원본 바이트는 검색·질문에 쓰이지 않는다 (단일파일/`stdin` 색인의 원본 재읽기·재색인까지 보존하려면 `assets/` 도 함께 복사).
```bash
# ingest 가 끝난(쓰기 없는) 상태에서 복사
rsync -a <src-data_dir>/kebab.sqlite user@server:<dst-data_dir>/
rsync -a <src-data_dir>/lancedb/ user@server:<dst-data_dir>/lancedb/
```
조건: **양쪽 동일 `kebab` 버전 + 동일 임베딩 모델/차원** (`[models.embedding].model`·`dimensions`). provider 는 달라도 됨 (예: 맥 `candle`/Metal ↔ 서버 `candle`/CPU 또는 `fastembed` — 같은 모델이면 벡터 호환). 복사는 반드시 ingest 가 돌지 않을 때.
### 멀티미디어 색인
Markdown · PDF · 이미지(OCR + caption) · 소스코드(Rust/Python/TS/JS/Go/Java/Kotlin/C/C++ AST) · 리소스(YAML/Dockerfile/TOML/JSON/XML 등)를 확장자에 따라 자동으로 적절한 chunker 에 라우팅한다. embedded text 가 없는 scanned PDF 는 `[pdf.ocr]` 로 page-단위 OCR (opt-in). 전체 확장자→chunker 매핑은 [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).
### RAG (근거 인용 + 거절)
검색 결과를 근거로 LLM 답변을 생성하고 [#번호] 인용을 단다. 근거가 부족하면 답을 지어내지 않고 거절한다. compound 질문은 `--multi-hop` 으로 분해→synthesize. 답변의 groundedness 는 mDeBERTa XNLI 로 검증할 수 있다 (`[rag] nli_threshold`, default off).
### TUI
`kebab tui` 는 Ratatui 셸 — Library / Search / Ask / Inspect 패널을 vim-style 모드로 다룬다. 키 매핑은 앱 내 `F1` cheatsheet 가 권위 소스다.
## 명령
| 명령 | 동작 |
|------|------|
| `kebab init` | XDG 경로에 데이터 디렉토리 + config.toml 생성 |
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs` → `code-rust-ast-v1`, `.py` → `code-python-ast-v1`, `.ts`/`.tsx` → `code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx` → `code-js-ast-v1`, `.go` → `code-go-ast-v1`, `.java` → `code-java-ast-v1`, `.kt`/`.kts` → `code-kotlin-ast-v1`, `.c`/`.h` → `code-c-ast-v1`, `.cpp`/`.cc`/`.cxx`/`.hpp`/`.hh`/`.hxx` → `code-cpp-ast-v1` — 모두 tree-sitter AST chunker; **Tier 2 리소스 파일**: `.yaml`/`.yml` → `k8s-manifest-resource-v1` (apiVersion+kind 파싱), `Dockerfile`/`Dockerfile.*`/`*.dockerfile` → `dockerfile-file-v1` (전체 파일), `Cargo.toml`/`pyproject.toml`/`.toml`/`package.json`/`tsconfig.json`/`.json`/`pom.xml`/`.xml`/`build.gradle`/`.gradle`/`go.mod` → `manifest-file-v1` (전체 파일) — yaml (k8s) / dockerfile / toml / json / xml / groovy / go-mod 지원); **Tier 3 paragraph fallback** (`.sh`/`.bash`/`.zsh` → `code-text-paragraph-v1`, blank-line paragraph split + 80-line/20-overlap line-window. Tier 1/2 가 0 chunk 또는 Err 시 자동 fallback — 비-k8s YAML 같은 케이스 picked up. symbol = None, lang 은 원본 보존.). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"` 에 `citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--code-lang go` / `--code-lang java` / `--code-lang kotlin` / `--code-lang yaml` / `--code-lang dockerfile` / `--code-lang toml` / `--code-lang json` / `--code-lang xml` / `--code-lang groovy` / `--code-lang go-mod` / `--code-lang shell` / `--code-lang c` / `--code-lang cpp` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`), Go symbol 은 `package.Func` / `package.(*Receiver).Method` 형식, Java / Kotlin symbol 은 `com.foo.Foo.bar` 형식 (패키지 + 클래스 + 메서드/필드). |
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor <opaque>] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media` 는 `,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min` 은 `primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md` 는 `markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. **v0.17.0 trigram tokenizer (한국어 + 영어 동작 변경):** `chunks_fts` 가 FTS5 `trigram` 으로 동작 — 한국어 query 는 3자 이상 substring 매칭 (`해시 충돌` 같은 multi-token 도 whole-phrase 후보로 hit), 영어도 substring 매칭 (`token` 이 `tokenizer` 도 hit, recall ↑ / 단어 경계 ↓). 2자 이하 query 는 0-hit + stderr `[hint] 3자 이상 키워드 권장` + `search_response.v1.hint` 필드 (raw FTS5 mode `'...'` 제외). `kebab.sqlite` 파일 크기는 trigram index 비대화로 ~2-5배 또는 수백 MB 증가 (V007 자동 backfill, re-ingest 불필요). |
| `kebab ingest [<path>]` | 워크스페이스 스캔 후 새/변경 문서 색인 (idempotent · incremental, `--force-reingest` 로 강제 재처리). 지원 확장자는 자동 skip. 진행바는 문서별 청크 수 · 문서 종료 시 phase별 소요시간(parse/chunk/embed/store)을 표시 (`--json``asset_chunked`/`asset_timings` 이벤트로) |
| `kebab ingest-file <path>` | 단일 파일 ingest (workspace 외부 가능 — `_external/` 로 deterministic copy) |
| `kebab ingest-stdin --title <T>` | stdin 의 markdown 본문 ingest |
| `kebab search --mode {lexical,vector,hybrid} "<query>" [flags]` | 검색 (default hybrid = RRF fusion, citation 포함). 필터/budget flag 는 `--help` |
| `kebab ask "<query>" [flags]` | RAG 답변 + 근거 인용 (Ollama 필요). `--session` (multi-turn) · `--stream` · `--multi-hop` |
| `kebab list docs` | 색인된 문서 목록 |
| `kebab inspect doc <id>` / `kebab inspect chunk <id>` | raw record 보기 |
| `kebab fetch chunk <id> [--context N]` / `kebab fetch doc <id> [--max-tokens N]` / `kebab fetch span <doc_id> <ls> <le> [--max-tokens N]` | (p9-fb-35) verbatim text fetch from indexed corpus. wire = `fetch_result.v1` (kind discriminator). chunk: target + ±N ordinal-context chunks. doc: full normalized markdown. span: 1-based line range (PDF/audio rejected as `error.v1.code = span_not_supported`). chars/4 budget on doc/span. |
| `kebab ask "<query>" [--show-citations / --hide-citations] [--session <id>] [--stream] [--multi-hop]` | RAG 답변 + 근거 인용. 답변 후 `근거:` block 으로 full path / line range / score 한 줄씩 (default ON — `--hide-citations` 로 끄기, pipe 시 유용). 근거 부족 시 거절. Ollama 필요. `--session <id>` 로 multi-turn — 첫 호출에서 SQLite `chat_sessions` 에 자동 생성, 이후 호출은 prior turns 를 history 로 받아 follow-up. session id 는 사용자 지정 (e.g. `kb-rust-async-2026-05`) — `kebab reset --data-only` 로 모든 session wipe. **`--stream` (p9-fb-33)** 로 ndjson `answer_event.v1` event (retrieval_done → token* → final) 를 stderr 에 흘리고 stdout 마지막 줄에 기존 `answer.v1` — agent 가 token 즉시 소비 가능. **`--multi-hop` (v0.18.0 fb-41)** — single-pass 대신 decompose → decide → synthesize 의 N-hop loop. compound 질문 (cross-doc / prereq chain) 에 효과적. 최종 답변 후 mDeBERTa-v3 XNLI 가 `(packed_chunks, generated_answer)` entailment 검사 — `[rag] nli_threshold > 0` (default 0.0 = disabled, production 권장 0.5) 일 때 활성. entailment < threshold → `refusal_reason = "nli_verification_failed"` (LLM-self-judge ceiling 극복, S7 caffeine hallucination 같은 케이스 catch). 첫 호출 시 ~280 MB ONNX model 자동 다운로드 + RAM peak ~7-8 GB (gemma3:4b 기준). model unavailable 시 `refusal_reason = "nli_model_unavailable"`, 우회는 `[rag] nli_threshold = 0` 임시 disable. |
| `kebab doctor` | 설정/모델/DB 헬스 체크 |
| `kebab tui` | Ratatui 셸 (Library + Search + Ask + Inspect 패널, desktop 진행 중). Library 에서 `r` 키로 background ingest 시작 — 화면 하단 status bar 가 진행 표시, 완료/abort 시 final 라인 잠시 유지 후 자동 hide. ingest 진행 중 `Esc` / `Ctrl-C` 가 cancel signal (그 외에는 quit). vim-style mode (header 우측 `-- NORMAL --` / `-- INSERT --`) — Library/Inspect 는 자동 NORMAL, Search/Ask 는 자동 INSERT. `i` 로 Normal→Insert (모든 pane — p9-fb-21), `Esc` 로 Insert→Normal 어디서나. mode-authoritative dispatch — Search 의 `j/k/o/g`, Ask 의 `e/j/k` 는 NORMAL 모드에서만 명령으로 동작, INSERT 에서는 입력 문자로 typing. (Search 의 chunk inspect 키는 `i`→`o` 로 rebind — `i` 가 universal Insert toggle.) **`F1` 로 cheatsheet popup** (현재 pane 의 키 매핑 + global 토글 표) — `Esc` / `F1` 로 닫기. Search 패널은 200ms debounce 후 background worker 가 검색 — 키 입력으로 UI freeze 안 됨, 사용자가 계속 타이핑하면 stale 결과 자동 폐기 (generation counter). Ask 패널은 multi-turn — 같은 conversation 안에서 Q1/A1, Q2/A2 transcript 누적, 다음 질문이 이전 턴을 history 로 받아 답변. 답변 본문은 markdown 렌더 (bold/italic/inline code/heading/list/code fence/table/blockquote, raw `**bold**` 가 실제 굵게 표시). `Ctrl-L` 로 새 conversation 시작. Search 의 `g` 키가 `$EDITOR` (기본 `vi`) 로 hit 의 citation 위치 열기 — 종료 후 TUI 화면이 자동으로 깨끗이 redraw. CLI `kebab ask` 는 raw markdown 그대로 (terminal 호환성 위해). Library 의 doc-list 가 한글 / 일본어 / 중국어 (CJK) 제목을 wide-char 정확한 column width 로 truncate — 한글 제목이 한 줄을 넘기지 않음 (CJK 1 자 = 2 col). Search/Ask/Filter 입력의 cursor 가 wide char 위에서 column 단위로 정렬 — 한글 입력 시 caret 이 글자 옆에 정확히 놓임. `← / →` 로 입력 문자열 중간 cursor 이동 (한글 한 글자 = 2 column 이라도 한 번에 이동), `Home / End` 로 양 끝 점프, `Delete` 로 cursor 위치 char 삭제 — 모든 input pane (Ask / Search / Library filter overlay) 동일 (p9-fb-22). Ask 트랜스크립트는 새 답변이 viewport 아래로 누적될 때 자동으로 tail 을 따라감 (auto-scroll); `j` / `k` 로 위로 스크롤하면 freeze, `Shift-G` 로 다시 bottom + auto-tail 재개. 화면 하단 hint line 은 한국어 동사구로 (`"위로"` / `"아래로"` / `"필터"` / `"타이핑 검색어"` / `"Esc 로 NORMAL 모드"` / `"i 입력모드"` 등) + 현재 (pane, mode) 조합에 맞춰 자동 분기, **첫 fragment 가 항상 `F1 도움말`** (cheatsheet 발견성 보장). 모든 모드에서 항상 떠 있는 상태바 — `kebab v<version> │ <pane> │ <docs> docs │ <state>` (state: streaming/searching/indexing/idle, ingest 진행 중에는 progress 가 같은 자리에 흡수됨). Ask 진입 시 conversation id 8 자 prefix 도 함께 표시. Ask 트랜스크립트와 Inspect 양쪽에서 `PgUp / PgDn` 으로 10 줄씩 페이지 스크롤. Library 의 doc list 위에는 `TITLE / TAGS / UPDATED / CHUNKS` 컬럼 헤더 행 표시 (display-width 정렬, Hangul / CJK 안전). |
| `kebab reset [--all / --data-only / --vector-only / --config-only] [--yes]` | XDG 데이터 wipe. **Irreversible.** TTY 면 confirm prompt, 아니면 `--yes` 필수. `--vector-only` 는 SQLite `embedding_records` 도 함께 truncate (orphan 방지) |
| `kebab eval run / compare` | golden query 회귀 측정 |
| `kebab schema [--json]` | introspection — wire schemas / capabilities / models / stats 한 번에. `--json` 은 `schema.v1` wire; 사람 모드는 서식 출력. **stats 에 (p9-fb-37) `media_breakdown` (5 keys: markdown / pdf / image / audio / other) + `lang_breakdown` (BCP-47 코드, NULL 은 literal `"null"`) + `index_bytes` (sqlite + lancedb on-disk 합계) + `stale_doc_count` (`config.search.stale_threshold_days` 초과 doc 수) 추가.** |
| `kebab ingest-file <path>` | 단일 파일 ingest (workspace 외부 가능). 바이트는 `<workspace.root>/_external/<hash12>.<ext>` 로 copy. `.kebabignore` 매치 시 stderr warn 후 진행 (explicit ingest 가 bypass intent). |
| `kebab ingest-stdin --title <T> [--source-uri <URI>]` | stdin 의 markdown 본문 ingest. frontmatter (title + source_uri) 자동 prepend. v1 markdown only. |
| `kebab mcp` | MCP (Model Context Protocol) stdio server. agent host (Claude Code / Cursor / OpenAI Agents) 가 spawn 하여 tool 호출 (`search` / `bulk_search` / `ask` / `fetch` / `schema` / `doctor` / `ingest_file` / `ingest_stdin`). `--config` honor. |
| `kebab inspect doc <id>` / `inspect chunk <id>` | raw record 보기 |
| `kebab fetch chunk\|doc\|span <id> [flags]` | indexed corpus 에서 verbatim text fetch |
| `kebab eval run \| aggregate \| compare \| variants` | golden query 회귀 측정 + 변형 일관성 진단 |
| `kebab schema [--json]` | introspection — wire schemas / capabilities / models / stats |
| `kebab doctor` | 설정 / 모델 / DB 헬스 체크 |
| `kebab tui` | Ratatui 셸 (Library / Search / Ask / Inspect) |
| `kebab mcp` | MCP stdio server (`search` / `bulk_search` / `ask` / `fetch` / `schema` / `doctor` / `ingest_file` / `ingest_stdin`) |
| `kebab reset [--all \| --data-only \| --vector-only \| --config-only \| --orphans-only] [--yes]` | XDG 데이터 wipe (**irreversible**) |
모든 명령에 `--json` 플래그. 출력은 frozen wire schema v1 (`schema_version` 항상 포함, 예: `ingest_report.v1`, `ingest_progress.v1`, `search_hit.v1`, `answer.v1`, `doctor.v1`, `reset_report.v1`, `schema.v1`). `--json` 모드에서 fatal error 는 stderr 에 `error.v1` ndjson 으로 emit (exit code 0/1/2/3 unchanged).
모든 명령에 `--json` 플래그가 있고, 출력은 frozen **wire schema v1** 을 따른다 (`schema_version` 항상 포함). `--json` 모드에서 fatal error 는 stderr 에 `error.v1` ndjson 으로 emit (exit code 0/1/2/3 불변). 글로벌 flag: `--readonly` (write-path 비활성화), `--quiet` (human stderr 억제), env `KEBAB_PROGRESS=plain`. 전체 flag·wire 의미는 `kebab <cmd> --help` 와 [docs/wire-schema/v1/](docs/wire-schema/v1/). 외부 agent 통합(Claude Code skill / MCP)은 [docs/mcp-usage.md](docs/mcp-usage.md) 와 [integrations/](integrations/).
글로벌 플래그: `--readonly` (또는 `KEBAB_READONLY=1`) — 모든 write-path 명령 (`ingest` / `ingest-file` / `ingest-stdin` / `reset`) 을 비활성화, exit 1. `--quiet` — 진행 바 / hint 등 human-readable stderr 억제 (exit code / stdout 출력은 그대로). `KEBAB_PROGRESS=plain` — TTY 가 없는 환경에서도 진행 상황을 plain-text 한 줄씩 stderr 로 출력 (spinner 대신).
## Configuration
### Score 해석 (fb-38)
`~/.config/kebab/config.toml``kebab init` 가 XDG 경로에 생성한다. 핵심 노브만 정리한다 (전체 절은 생성된 파일 주석 참고, 예시는 [docs/SMOKE.md](docs/SMOKE.md)).
`search_hit.v1.score` 는 **ranking signal** 이지 confidence 가 아니다. `score_kind` 필드로 의미 선언:
```toml
[workspace]
root = "~/KnowledgeBase" # 색인할 폴더. 절대 / tilde / env / 상대 경로 가능.
# 상대 경로의 base 는 config.toml 위치 (cwd 무관).
| `score_kind` | 의미 | 범위 |
|--------------|------|------|
| `rrf` (hybrid) | RRF normalized | `[0, 1]`, ceiling = 1.0 (양 채널 rank=1) |
| `bm25` (lexical) | raw BM25 | unbounded (≥ 0) |
| `cosine` (vector) | cosine sim | `[-1, 1]` |
#### RRF 수식 (hybrid mode)
```
chunk c 의 raw RRF = Σ_m 1 / (k_rrf + rank_m(c))
여기서 m ∈ {lexical, vector}, k_rrf = config.search.rrf_k (default 60).
양 채널 모두 rank=1 일 때 raw RRF = 2 / (k_rrf + 1) ≈ 0.0328.
normalize: rrf_score = raw_rrf / (2 / (k_rrf + 1))
→ rrf_score ∈ [0, 1]. 양쪽 rank=1 → 1.0, 한 쪽만 등장 → ≈ 0.5 천장.
[models.embedding]
provider = "fastembed" # "fastembed"(기본, onnxruntime) / "candle"(순수 Rust)
# / "ollama"(원격 HTTP) / "none"(lexical-only).
# candle 는 같은 모델·같은 벡터를 순수 Rust 로 돌려
# NUMA 서버의 onnxruntime 48-스레드 double-free 를 피하는
# opt-in 백엔드 (e5 는 재색인 불필요).
model = "multilingual-e5-large" # 다국어 sentence embedding (1024-dim).
# 첫 ingest 시 ONNX (~1.3GB) 자동 다운로드.
# candle provider 는 safetensors (~2GB) 다운로드.
# candle/ollama 는 "snowflake-arctic-embed-l-v2.0"
# (설명형 query 의 recall 보강) 도 지원 — 아래 참고.
dimensions = 1024 # config 와 LanceDB stored dim 불일치 시 검색 0건.
num_threads = 0 # candle 전용 CPU 스레드 캡 (0=auto=#cores).
# env KEBAB_EMBED_THREADS 가 우선. NUMA 노드 바인딩은
# numactl 과 조합. fastembed provider 는 무시.
# endpoint = "http://127.0.0.1:11434" # provider="ollama" 전용 HTTP endpoint.
# 생략 시 [models.llm].endpoint 로 폴백.
# fastembed/candle provider 는 무시.
```
`rrf_score = 0.5` 의 의미: chunk 가 한 채널 (lexical 또는 vector) 에서만 rank 1 로 등장. confidence 50% 가 아님 — RRF 수식의 산술적 천장.
**arctic-embed-l-v2.0 (설명형 query recall 보강)**: 기본 e5-large 대신
Snowflake `arctic-embed-l-v2.0` 임베더를 쓸 수 있다 (1024-dim, opt-in). 측정에서
설명형/약어/영문 용어 query 의 recall@10 이 e5 대비 향상됐다. 두 경로:
agent 가 trust threshold 가 필요하면 top-level `score` 가 아닌 nested `retrieval.lexical_score` (BM25 raw) / `retrieval.vector_score` (cosine raw) 사용.
```toml
# (A) candle 백엔드 — 순수 Rust, in-process (NUMA 안전, Metal GPU 가능):
[models.embedding]
provider = "candle"
model = "snowflake-arctic-embed-l-v2.0" # CLS pooling, query 에 "query: " 접두어
# (문서는 무접두어). safetensors ~2GB 다운로드.
## 논리 아키텍처
# (B) ollama 백엔드 — 원격/로컬 Ollama 데몬에 위임 (POST /api/embed):
[models.embedding]
provider = "ollama"
model = "snowflake-arctic-embed2" # Ollama 모델 태그 (ollama pull 필요)
endpoint = "http://127.0.0.1:11434" # 생략 시 [models.llm].endpoint
```
> ⚠️ e5 → arctic 전환은 `embedding_version` cascade 를 트리거한다 (모델이 다르면
> 벡터도 다름). 기존 e5 KB 와 혼용 불가 — 전환 시 **재색인** 필요 (`kebab reset`
> 후 재 ingest). 기본값은 e5 라 기존 사용자는 영향 없음.
**Apple Silicon GPU 가속 (candle / macOS)**: M-시리즈 맥에서 candle 임베딩을
GPU(Metal)로 돌리면 CPU 대비 대용량 ingest 가 크게 빨라진다. 빌드 또는 설치 시
`embed_metal` feature 를 켠다:
```bash
# 빌드만:
cargo build --release --features embed_metal
# 전역 설치 (~/.cargo/bin/kebab):
cargo install --path crates/kebab-cli --features embed_metal --locked
```
벡터는 CPU candle 과 동일 모델이라 호환되므로, 맥에서 GPU 로 색인한
`kebab.sqlite` + `lancedb/` 를 그대로 Linux 서버(CPU candle)로 복사해 질의할 수
있다. 색인 로그에 `candle device = Metal (GPU)` 가 보이면 GPU 사용 중. metal
feature 는 macOS 전용 (Linux/서버는 기본 CPU 빌드).
```toml
[models.llm]
endpoint = "http://localhost:11434" # Ollama host:port
model = "gemma4:e4b"
# request_timeout_secs = 300 # 큰 모델은 늘림. 0 은 disable 이 아니라 "즉시 timeout".
[search]
stale_threshold_days = 30 # search hit / citation 의 stale 플래그 기준 (0 = off).
[rag]
prompt_template_version = "rag-v3" # 답변 언어 = 질문 언어. rag-v1/v2 는 legacy.
nli_threshold = 0.0 # >0 (예: 0.5) 면 mDeBERTa XNLI groundedness 검증.
```
- **파생물 캐시** — embedding 결과를 내용 해시로 자동 캐싱한다 (위 「핵심 기능」 참고). 설정 항목 없음.
- **`[ingest.code]`** — code ingest 의 skip 정책 (`skip_generated_header`, `max_file_bytes`, `extra_skip_globs`). `.gitignore` 자동 honor, `.kebabignore` 는 추가 layer.
- **`[pdf.ocr]`** — scanned PDF 의 page-단위 OCR (default off / opt-in, page 당 ~수십 초 cost). 활성화 후 v0.19 시절 색인분은 `kebab ingest --force-reingest` 로 재처리.
- **`--config <path>`** — 임시 워크스페이스 / 격리 테스트용 (CLI · TUI 모두 honor).
- **`kebab config migrate`** — 새 버전에서 추가된 config 섹션을 기존 `config.toml` 에 설명 주석과 함께 채워 넣는다 (사용자가 손본 값·주석·순서는 보존, 멱등, 변경 시 자동 `.bak` 백업). `--dry-run` 으로 변경 미리보기. `kebab doctor` 가 갱신 필요 시 안내한다. `kebab init` 으로 새로 생성되는 config.toml 도 섹션별 주석을 포함한다.
- **`KEBAB_*` env** — 일부 키 override (`KEBAB_RAG_SCORE_GATE`, `KEBAB_EVAL_GOLDEN` 등).
- **XDG layout**: `~/.config/kebab/`, `~/.local/share/kebab/`, `~/.cache/kebab/`, `~/.local/state/kebab/`.
## 아키텍처
```mermaid
flowchart TB
@@ -146,7 +207,7 @@ flowchart TB
subgraph Pipeline["도메인 + 파이프라인"]
parse["parse-md / parse-pdf / parse-image / parse-code"]
chunker["chunker (md-heading-v1, pdf-page-v1, code-{rust,python,ts,js,go,java,kotlin,c,cpp}-ast-v1, k8s-manifest-resource-v1, dockerfile-file-v1, manifest-file-v1, code-text-paragraph-v1)"]
chunker["chunker (md / pdf / code-AST / manifest)"]
embedder["embedder (fastembed multilingual-e5-large)"]
retriever["retriever (lexical / vector / hybrid RRF)"]
rag["RAG pipeline"]
@@ -188,70 +249,22 @@ flowchart TB
rag --> ollama
```
`kebab-app` 가 facade — UI binary 가 store / parse / search / llm / rag 를 직접 참조하지 않는다 (frozen 설계 §8). 자세한 crate-level 의존성 + 디렉토리 + 핵심 기술 결정은 [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).
v0.21.0 기준 핵심 설계:
## Configuration
- **crate facade** — `kebab-app` 가 유일한 facade다. UI binary (`kebab-cli` / `kebab-tui`) 는 store / parse / search / llm / rag 를 직접 참조하지 않는다 (frozen 설계 §8). 각 user-facing 엔트리는 `*_with_config(cfg, …)` 동반 함수로 explicit config 를 thread 한다.
- **chunk_id 는 위치 기반** — chunk 의 정체성은 문서 내 위치(ordinal + span)다. 반면 파생물 캐시 키는 **내용 해시**라, 내용이 같으면 위치·문서가 달라도 동일 캐시를 재사용한다.
- **wire schema v1** — 모든 `--json` 출력은 `schema_version` 을 담는 frozen contract다. 깨는 변경은 `*.v2` major bump을 요구한다.
- **versioning cascade** — `parser_version` / `chunker_version` / `embedding_version` / `prompt_template_version` / `index_version` 변경은 downstream record(청크·임베딩·캐시·eval)를 무효화한다.
- `~/.config/kebab/config.toml` — `kebab init` 가 XDG 경로에 생성. `[workspace]` (root, exclude — include 필드는 제거됨, 지원 형식은 자동 결정), `[storage]`, `[chunking]`, `[models.embedding]`, `[models.llm]`, `[image.ocr]`, `[image.caption]`, `[search]`, `[rag]`, `[ui]` 절.
- `[models.embedding]` —
- `model` (default `"multilingual-e5-large"`, fb-39b) — 다국어 sentence embedding 모델. 1024-dim. ONNX (~1.3 GB) 첫 실행 시 fastembed cache (`config.storage.model_dir/fastembed/`) 에 자동 다운로드. `"multilingual-e5-small"` (384 dim) 는 backwards-compat 으로 사용 가능 — TOML 에 명시.
- `dimensions` (default `1024`) — 모델의 embedding 차원. config 와 LanceDB stored dim 불일치 시 검색 결과 0 건 (orphan table). 모델 변경 시 `kebab reset --vector-only && kebab ingest` 로 vector index 재구축 권장.
- `[ui] theme = "dark" | "light"` 로 TUI 팔레트 선택 (default `"dark"`, 알 수 없는 값은 dark fallback).
- `[search] stale_threshold_days = 30` (p9-fb-32) — search hit / RAG citation 의 `stale` 플래그 기준 (default 30 일, `0` 으로 비활성화). 옛 config 의 `workspace.include = [...]` 은 silently 무시 + 단발 deprecation warning (p9-fb-25).
- `[ingest.code]` (p10-1A-1) — code ingest 의 skip 정책 + chunker 기본값.
- `skip_generated_header = true` — 첫 ~512 byte 의 generated marker (`@generated` / `DO NOT EDIT` 등) 감지 시 skip.
- `max_file_bytes = 262144` (256 KiB) / `max_file_lines = 5000` — 파일당 cap, 초과 시 skip.
- `extra_skip_globs = []` — 사용자 추가 skip 패턴 (`.gitignore` 문법).
- `.gitignore` honor: 자동 적용. `.kebabignore` 는 추가 layer. 우선순위: built-in safety net (`node_modules/` / `target/` / `__pycache__/` / `.venv/` / `venv/` / `env/`) > `.gitignore` > `.kebabignore`.
- `[rag] prompt_template_version` (default `"rag-v2"`) — RAG system prompt version. `"rag-v1"` 은 legacy backwards-compat (사용자 명시 시 유지). v2 강화 규칙: (1) fact 인용 시 [#번호] 앞에 chunk 속 원문 큰따옴표 표기, (2) 학습 지식 동원 금지, (3) 근거 모호 시 "확실하지 않다" 명시.
- `--config <path>` flag — 임시 워크스페이스 / 격리 테스트 시 사용. CLI / TUI 모두 honor.
- `KEBAB_*` env — 일부 키 override (`KEBAB_RAG_SCORE_GATE`, `KEBAB_EVAL_GOLDEN`, `KEBAB_COMMIT_HASH` 등).
- XDG layout: `~/.config/kebab/`, `~/.local/share/kebab/`, `~/.cache/kebab/`, `~/.local/state/kebab/`.
- `workspace.root` 경로 형식: 절대 (`/foo/bar`) / tilde (`~/KnowledgeBase`, default) / env (`${XDG_DATA_HOME}/kebab`) / 상대 (`./notes`, `notes`, `../shared/x`) 모두 가능. **상대 경로의 base 는 config.toml 자체가 위치한 디렉토리** — 사용자의 `cwd` 와 무관 (`--config /tmp/cfg.toml` + `root = "kb"` → `/tmp/kb`). p9-fb-05 정책.
config 예시는 [docs/SMOKE.md](docs/SMOKE.md) 의 `/tmp/kebab-smoke/config.toml` 블록 참조.
## 외부 AI 통합
`--json` 출력 + frozen wire schema v1 가 stable contract. 통합 옵션:
- **Claude Code skill** — repo 의 [`integrations/claude-code/`](integrations/claude-code/) 가 ship-ready skill. `cp -r integrations/claude-code/kebab ~/.claude/skills/` 한 번이면 새 Claude Code 세션부터 자동 trigger (내부 시스템 / 위키 lookup / 사내 runbook 질문). multi-turn 은 `kebab ask --session <id> --json` 으로 영속 — skill 이 conversation id 관리하면 외부 agent 도 `--repl` 없이 stateful 대화 가능 (p9-fb-18).
- **Codex / 기타 agent host** — `--json` + frozen wire schema v1 가 stable contract. 동일 패턴으로 ~50줄 wrapper 작성 가능. `integrations/<host>/` 에 추가 PR 환영.
- **MCP server** — stdio JSON-RPC 로 `kebab-app` facade 1:1 노출. `kebab mcp` 참조.
- **HTTP wrapper** — `kebab serve --bind 127.0.0.1:7711` (P+, local-only 가치 신중).
## MCP 사용
`kebab mcp` 가 stdio MCP server. 8 tool: `search` / `bulk_search` (p9-fb-42 — N query 한 번에) / `ask` / `fetch` (p9-fb-35) / `schema` / `doctor` / `ingest_file` / `ingest_stdin`.
Claude Code 빠른 등록 (`~/.claude/mcp.json` 또는 host 동등 위치):
```json
{
"mcpServers": {
"kebab": {
"command": "kebab",
"args": ["mcp"]
}
}
}
```
자세한 사용법 (Cursor / OpenAI Agents / Copilot CLI config, per-tool 입출력 예시, troubleshooting, multi-turn ask + session 관리, performance / security) — **[docs/mcp-usage.md](docs/mcp-usage.md)** 참조.
crate-level 의존성 그래프 · 디렉토리 트리 · 확장자→chunker 전체 매핑 · 핵심 기술 결정은 [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md), 진척도는 [HANDOFF.md](HANDOFF.md).
## 비-목표
다중 사용자 SaaS / K8s / 원격 vector DB / enterprise RBAC / 실시간 협업 / 모든 파일 포맷의 완벽한 parsing / agent 임의 파일 수정 / multi-workspace / LLM-as-judge eval / CLIP 시각 embedding / `kebab://` protocol handler — frozen 설계 §11 / §0 참조.
다중 사용자 SaaS / K8s / 원격 vector DB / enterprise RBAC / 실시간 협업 / agent 임의 파일 수정 / multi-workspace / LLM-as-judge eval / CLIP 시각 embedding — frozen 설계 §0 / §11 참조.
## 라이선스
## 버전 / 라이선스 / 참고
`MIT OR Apache-2.0` (workspace `Cargo.toml` 의 `license` 필드).
## 참고
- 진척도: [HANDOFF.md](HANDOFF.md)
- 아키텍처: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
- Frozen 설계: [docs/superpowers/specs/2026-04-27-kebab-final-form-design.md](docs/superpowers/specs/2026-04-27-kebab-final-form-design.md)
- Task 인덱스: [tasks/INDEX.md](tasks/INDEX.md)
- 머지 후 hotfix 로그: [tasks/HOTFIXES.md](tasks/HOTFIXES.md)
- Smoke 절차: [docs/SMOKE.md](docs/SMOKE.md)
- **버전**: v0.21.0 (`kebab --version` 으로 확인).
- **라이선스**: `MIT OR Apache-2.0`.
- 진척도: [HANDOFF.md](HANDOFF.md) · 아키텍처: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) · Frozen 설계: [docs/superpowers/specs/2026-04-27-kebab-final-form-design.md](docs/superpowers/specs/2026-04-27-kebab-final-form-design.md)
- Task 인덱스: [tasks/INDEX.md](tasks/INDEX.md) · Hotfix 로그: [tasks/HOTFIXES.md](tasks/HOTFIXES.md) · Smoke 절차: [docs/SMOKE.md](docs/SMOKE.md) · MCP 사용: [docs/mcp-usage.md](docs/mcp-usage.md)

View File

@@ -18,6 +18,8 @@ kebab-store-vector = { path = "../kebab-store-vector" }
kebab-search = { path = "../kebab-search" }
kebab-embed = { path = "../kebab-embed" }
kebab-embed-local = { path = "../kebab-embed-local" }
kebab-embed-candle = { path = "../kebab-embed-candle" }
kebab-embed-ollama = { path = "../kebab-embed-ollama" }
kebab-llm = { path = "../kebab-llm" }
kebab-llm-local = { path = "../kebab-llm-local" }
kebab-rag = { path = "../kebab-rag" }
@@ -35,6 +37,11 @@ kebab-parse-image = { path = "../kebab-parse-image" }
# per-asset dispatch (see `ingest_one_asset` PDF branch) and runs the
# resulting `CanonicalDocument` through `kebab-chunk::PdfPageV1Chunker`.
kebab-parse-pdf = { path = "../kebab-parse-pdf" }
lopdf = { workspace = true }
# Enhancement 1 (v0.20.x r2): JPEG dimension decode in pdf_ocr_apply.rs.
# jpeg feature added explicitly (F3 closure-r1) rather than relying on
# feature unification via kebab-parse-image.
image = { version = "0.25", default-features = false, features = ["png", "jpeg"] }
# p10-1A-2: Rust AST extractor lives here. App threads it into the
# per-asset dispatch (see `ingest_one_asset` Code branch) and runs the
# resulting `CanonicalDocument` through `kebab-chunk::CodeRustAstV1Chunker`.
@@ -44,6 +51,7 @@ blake3 = { workspace = true }
serde = { workspace = true }
serde_json = { workspace = true }
time = { workspace = true }
uuid = { workspace = true }
tracing = { workspace = true }
tracing-subscriber = { version = "0.3", features = ["env-filter", "fmt", "json"] }
tracing-appender = "0.2"
@@ -61,24 +69,40 @@ unicode-normalization = "0.1"
ignore = "0.4"
# p9-fb-34: opaque pagination cursor encodes payload as base64.
base64 = { workspace = true }
# Enhancement 3 (v0.20.x r2): direct SQL queries for inspect_ocr_stats/failures.
rusqlite = { workspace = true }
[dev-dependencies]
kebab-config = { path = "../kebab-config" }
# doc-side expansion (Phase 2) Task 4: ExpansionGenerator unit tests build
# MockLanguageModel (gated behind kebab-llm's `mock` feature, default OFF in
# [dependencies]). Enabling it here turns it on for the test build only.
kebab-llm = { path = "../kebab-llm", features = ["mock"] }
rusqlite = { workspace = true }
filetime = "0.2"
tempfile = { workspace = true }
# Image-pipeline integration tests use wiremock to stub Ollama for OCR
# / caption HTTP calls. Async runtime to host the mock server only;
# the kb-app code under test stays sync.
wiremock = { workspace = true }
tokio = { workspace = true, features = ["rt-multi-thread"] }
image = { version = "0.25", default-features = false, features = ["png"] }
image = { version = "0.25", default-features = false, features = ["png", "jpeg"] }
# P7-3 PDF integration tests build in-memory PDF fixtures via the same
# lopdf builder pattern `kebab-parse-pdf::tests::common` uses; pinned
# to the same major (0.32) so byte output is identical between the two
# fixture surfaces.
lopdf = "0.32"
lopdf = { workspace = true }
# error_wire::tests::llm_unreachable_classifies_to_model_unreachable needs a real
# reqwest::Error (private constructor) — built from a connect-refused call.
reqwest = { version = "0.12", default-features = false, features = ["blocking", "rustls-tls"] }
[features]
# Marker feature — spec §6.3 Option A (단순): lindera 는 kebab-chunk 가 default dep 으로 소유.
# disable path 없음; 이 feature 는 spec §6.3 명시를 honor 하는 role 만.
default = ["fts_korean_morphological"]
fts_korean_morphological = []
# opt-in (macOS): candle embedder runs on the Apple Silicon GPU. See kebab-embed-candle.
embed_metal = ["kebab-embed-candle/metal"]
[lints]
workspace = true

View File

@@ -43,12 +43,13 @@ use kebab_core::{
Answer, DocumentStore, Embedder, ExtractContext, Extractor, IndexVersion, LanguageModel,
MediaType, Retriever, SearchHit, SearchMode, SearchOpts, SearchQuery, VectorStore,
};
use kebab_embed_candle::CandleEmbedder;
use kebab_embed_local::FastembedEmbedder;
use kebab_embed_ollama::OllamaEmbedder;
use kebab_llm_local::OllamaLanguageModel;
use kebab_parse_code::{
CAstExtractor, CppAstExtractor, GoAstExtractor, JavaAstExtractor,
JavascriptAstExtractor, KotlinAstExtractor, PythonAstExtractor, RustAstExtractor,
TypescriptAstExtractor,
CAstExtractor, CppAstExtractor, GoAstExtractor, JavaAstExtractor, JavascriptAstExtractor,
KotlinAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor,
};
use kebab_parse_image::ImageExtractor;
use kebab_parse_pdf::PdfTextExtractor;
@@ -90,29 +91,6 @@ pub struct SearchResponse {
pub hint: Option<String>,
}
/// v0.17.0 A5 Step 4b: decide whether to attach a "3자 이상 키워드 권장"
/// hint to a `SearchResponse`. Fires only when the result set is empty
/// *and* the trimmed query is shorter than the trigram tokenizer can
/// resolve. Raw FTS5 mode (`'...'`) opts out — the user explicitly
/// invoked FTS5 syntax. Identical condition powers the CLI stderr line
/// and (separately) the TUI status bar.
pub fn short_query_hint(query_text: &str, hits_empty: bool) -> Option<String> {
if !hits_empty {
return None;
}
let trimmed = query_text.trim();
let bytes = trimmed.as_bytes();
// Raw single-quote mode: user opted into FTS5 syntax, no advisory.
if bytes.len() >= 2 && bytes[0] == b'\'' && bytes[bytes.len() - 1] == b'\'' {
return None;
}
if trimmed.chars().count() < 3 {
Some("3자 이상 키워드 권장 (trigram tokenizer 제약)".to_string())
} else {
None
}
}
/// Facade state — see module docs for lifetime rules.
///
/// The struct is public so long-lived callers (kb-eval, the future P9
@@ -213,6 +191,34 @@ impl App {
sqlite
.run_migrations()
.context("kb-app: run SqliteStore migrations")?;
// V009 의 tokenized_korean_text column 의 first-boot eager backfill.
// 신규 ingest 의 chunks_ai trigger 가 이미 채우므로 NULL row 가 없으면 즉시 0 반환 (idempotent).
// V007 → V009 업그레이드 시 KB 크기 비례 (~10000 chunk 당 ~30-60s).
let backfill_count = sqlite
.backfill_tokenized_korean_text(
|done, total| {
if total > 0 && done % 500 == 0 {
tracing::info!(
target: "kebab-app",
"korean tokenizer backfill: {done}/{total}"
);
}
},
kebab_chunk::tokenize_korean_morphological,
)
.unwrap_or_else(|e| {
tracing::warn!(
target: "kebab-app",
"korean tokenizer backfill failed: {e}"
);
0
});
if backfill_count > 0 {
tracing::info!(
target: "kebab-app",
"korean tokenizer backfill complete: {backfill_count} chunks updated"
);
}
// p9-fb-19: build the LRU cache from config. Capacity 0 →
// `None` (cache disabled — every search hits the retrievers).
let search_cache = NonZeroUsize::new(config.search.cache_capacity)
@@ -242,15 +248,15 @@ impl App {
// kebab-nli construction. Failure (`?`) surfaces as a user-
// facing error at App boot — never a panic in the pipeline's
// `expect("verifier must be Some when nli_threshold > 0.0")`.
let pipeline_verifier: Option<Arc<dyn kebab_nli::NliVerifier>> =
if config.rag.nli_threshold > 0.0 {
let v = kebab_nli::OnnxNliVerifier::new(&config).context(
"kebab-app: construct OnnxNliVerifier (config.rag.nli_threshold > 0)",
)?;
Some(Arc::new(v))
} else {
None
};
let pipeline_verifier: Option<Arc<dyn kebab_nli::NliVerifier>> = if config.rag.nli_threshold
> 0.0
{
let v = kebab_nli::OnnxNliVerifier::new(&config)
.context("kebab-app: construct OnnxNliVerifier (config.rag.nli_threshold > 0)")?;
Some(Arc::new(v))
} else {
None
};
Ok(Self {
config,
sqlite: Arc::new(sqlite),
@@ -350,7 +356,9 @@ impl App {
// so other in-flight searches can use the cache concurrently.
drop(guard);
let hits = self.search_uncached(query)?;
let mut guard = cache.lock().unwrap_or_else(std::sync::PoisonError::into_inner);
let mut guard = cache
.lock()
.unwrap_or_else(std::sync::PoisonError::into_inner);
guard.put(key, hits.clone());
Ok(hits)
}
@@ -430,11 +438,7 @@ impl App {
///
/// `SearchResponse.next_cursor` and `truncated` are independent
/// signals — see `SearchResponse` doc for details.
pub fn search_with_opts(
&self,
query: SearchQuery,
opts: SearchOpts,
) -> Result<SearchResponse> {
pub fn search_with_opts(&self, query: SearchQuery, opts: SearchOpts) -> Result<SearchResponse> {
use crate::cursor;
let corpus_revision = self.sqlite.corpus_revision().to_string();
@@ -519,8 +523,7 @@ impl App {
// Apply offset + k_effective truncation (mirrors non-trace path).
let drop_n = offset.min(traced_hits.len());
traced_hits.drain(..drop_n);
let mut hits: Vec<SearchHit> =
traced_hits.into_iter().take(k_effective).collect();
let mut hits: Vec<SearchHit> = traced_hits.into_iter().take(k_effective).collect();
// Snippet truncation if opts.snippet_chars set (mirror non-trace path).
if opts.snippet_chars.is_some() {
@@ -533,7 +536,7 @@ impl App {
// Trace path skips the budget loop. Caller will inspect
// `hits.len()` and `trace.timing` rather than paginate.
let hint = short_query_hint(&query.text, hits.is_empty());
let hint: Option<String> = None;
return Ok(SearchResponse {
hits,
next_cursor: None,
@@ -551,8 +554,7 @@ impl App {
// Skip offset.
let drop_n = offset.min(all_hits.len());
all_hits.drain(..drop_n);
let mut hits: Vec<SearchHit> =
all_hits.into_iter().take(k_effective).collect();
let mut hits: Vec<SearchHit> = all_hits.into_iter().take(k_effective).collect();
// Apply snippet_chars override if shorter than what the
// retriever returned (retriever already honored
@@ -573,15 +575,11 @@ impl App {
// Step 1: shorten snippets progressively to a 60-char floor.
const SNIPPET_FLOOR: usize = 60;
let mut current_snippet_cap = snippet_chars;
while estimate_chars(&hits) > max_chars
&& current_snippet_cap > SNIPPET_FLOOR
{
current_snippet_cap =
(current_snippet_cap / 2).max(SNIPPET_FLOOR);
while estimate_chars(&hits) > max_chars && current_snippet_cap > SNIPPET_FLOOR {
current_snippet_cap = (current_snippet_cap / 2).max(SNIPPET_FLOOR);
for h in &mut hits {
if h.snippet.chars().count() > current_snippet_cap {
h.snippet =
trim_to_chars(&h.snippet, current_snippet_cap);
h.snippet = trim_to_chars(&h.snippet, current_snippet_cap);
truncated = true;
}
}
@@ -622,7 +620,7 @@ impl App {
None
};
let hint = short_query_hint(&query.text, hits.is_empty());
let hint: Option<String> = None;
Ok(SearchResponse {
hits,
next_cursor,
@@ -651,8 +649,7 @@ impl App {
retriever: Arc<dyn Retriever>,
llm: Arc<dyn LanguageModel>,
) -> RagPipeline {
let pipeline =
RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone());
let pipeline = RagPipeline::new(self.config.clone(), retriever, llm, self.sqlite.clone());
match &self.pipeline_verifier {
Some(v) => pipeline.with_verifier(v.clone()),
None => pipeline,
@@ -723,12 +720,7 @@ impl App {
/// returns; on persistence error, the answer is still returned
/// (don't lose the user's compute) but the error is logged so
/// the operator notices.
pub fn ask_with_session(
&self,
session_id: &str,
query: &str,
opts: AskOpts,
) -> Result<Answer> {
pub fn ask_with_session(&self, session_id: &str, query: &str, opts: AskOpts) -> Result<Answer> {
use kebab_core::traits::{ChatSessionRepo, ChatSessionRow, ChatTurnRow};
use std::time::{SystemTime, UNIX_EPOCH};
@@ -766,13 +758,8 @@ impl App {
let retriever = self.build_retriever(opts.mode)?;
let llm = self.llm()?;
let pipeline = self.build_pipeline(retriever, llm);
let answer = pipeline.ask_with_history(
query,
history,
session_id.to_string(),
next_index,
opts,
)?;
let answer =
pipeline.ask_with_history(query, history, session_id.to_string(), next_index, opts)?;
// Auto-create the session header on first use. Title from
// the first question (≤40 chars after trim).
@@ -813,7 +800,8 @@ impl App {
turn_index: next_index,
question: query.to_string(),
answer: answer.answer.clone(),
citations_json: serde_json::to_string(&answer.citations).unwrap_or_else(|_| "[]".to_string()),
citations_json: serde_json::to_string(&answer.citations)
.unwrap_or_else(|_| "[]".to_string()),
created_at: now_unix,
};
if let Err(e) = self.sqlite.append_turn(&turn_row) {
@@ -847,10 +835,31 @@ impl App {
if let Some(e) = self.embedder.get() {
return Ok(Some(e.clone()));
}
let emb: Arc<dyn Embedder + Send + Sync> = Arc::new(
FastembedEmbedder::new(&self.config)
.context("kb-app: load FastembedEmbedder")?,
);
// Provider branch (Track 1 spec §3 + arctic-embedder spec). The
// `embeddings_disabled()` check above already handled `"none"`; here we
// route the live providers. `fastembed`/`onnx`/(empty) keep the default
// onnxruntime path (vectors unchanged — `embedding_version` is
// preserved); `candle` selects the pure-Rust NUMA-safe backend (e5 or
// arctic via its model registry); `ollama` offloads to a remote
// `/api/embed` daemon.
let provider = self.config.models.embedding.provider.as_str();
let emb: Arc<dyn Embedder + Send + Sync> = match provider {
"fastembed" | "onnx" | "" => Arc::new(
FastembedEmbedder::new(&self.config).context("kb-app: load FastembedEmbedder")?,
),
"candle" => Arc::new(
CandleEmbedder::new(&self.config).context("kb-app: load CandleEmbedder")?,
),
"ollama" => Arc::new(
OllamaEmbedder::new(&self.config).context("kb-app: load OllamaEmbedder")?,
),
other => {
return Err(anyhow!(
"kb-app: unknown embedding provider {other:?}; expected one of \
`fastembed` (default), `candle`, `ollama`, or `none` (lexical-only)"
));
}
};
// `set` returns Err if another thread won the race; in that case
// the loser still returns the (now-cached) winner via `get()`.
let _ = self.embedder.set(emb.clone());
@@ -925,7 +934,9 @@ impl App {
/// clear` admin command). No-op when the cache is disabled.
pub fn clear_search_cache(&self) {
if let Some(cache) = self.search_cache.as_ref() {
let mut guard = cache.lock().unwrap_or_else(std::sync::PoisonError::into_inner);
let mut guard = cache
.lock()
.unwrap_or_else(std::sync::PoisonError::into_inner);
guard.clear();
}
}
@@ -946,8 +957,8 @@ impl App {
/// git tree) correctly keep `repo: None` — `Metadata.repo` is already
/// `None` for those, so the assignment is a no-op.
fn backfill_repo(&self, hits: &mut [SearchHit]) {
use std::collections::HashMap;
use kebab_core::DocumentId;
use std::collections::HashMap;
// doc_id → Option<String> where None means "not found / no repo"
let mut cache: HashMap<DocumentId, Option<String>> = HashMap::new();
@@ -956,26 +967,24 @@ impl App {
if hit.repo.is_some() {
continue;
}
let repo_val = cache
.entry(hit.doc_id.clone())
.or_insert_with(|| {
// Deliberately non-aborting: a failed store lookup for
// one hit must not abort the whole search response. Log
// the error so it's observable rather than silently
// dropped (review #140 round 1).
match self.sqlite.get_document(&hit.doc_id) {
Ok(opt) => opt.and_then(|doc| doc.metadata.repo),
Err(e) => {
tracing::warn!(
target: "kebab-app",
doc_id = %hit.doc_id,
error = %e,
"backfill_repo: get_document failed; leaving hit.repo = None"
);
None
}
let repo_val = cache.entry(hit.doc_id.clone()).or_insert_with(|| {
// Deliberately non-aborting: a failed store lookup for
// one hit must not abort the whole search response. Log
// the error so it's observable rather than silently
// dropped (review #140 round 1).
match self.sqlite.get_document(&hit.doc_id) {
Ok(opt) => opt.and_then(|doc| doc.metadata.repo),
Err(e) => {
tracing::warn!(
target: "kebab-app",
doc_id = %hit.doc_id,
error = %e,
"backfill_repo: get_document failed; leaving hit.repo = None"
);
None
}
});
}
});
if let Some(r) = repo_val {
hit.repo = Some(r.clone());
}
@@ -986,10 +995,7 @@ impl App {
/// "switch to --mode lexical" error when embeddings are disabled.
fn require_embeddings(
&self,
) -> Result<(
Arc<dyn Embedder + Send + Sync>,
Arc<LanceVectorStore>,
)> {
) -> Result<(Arc<dyn Embedder + Send + Sync>, Arc<LanceVectorStore>)> {
let emb = self.embedder()?.ok_or_else(|| {
anyhow!(
"embeddings disabled (config.models.embedding.provider == \"none\" \
@@ -1011,8 +1017,16 @@ impl App {
/// the active config. This token surfaces in `SearchHit.index_version`
/// and on snapshot tests; including the chunker version pins it to
/// the chunking policy in effect.
///
/// V009 (2026-05-28): FTS5 tokenizer 가 trigram → unicode61 + 한국어
/// 형태소 분해 column 로 갱신됨. `fts5-v009-korean-morphological`
/// suffix 가 V007 baseline 과 구별되어 eval runner 의 config
/// snapshot 및 search cache 무효화에 picks up 된다.
fn lexical_index_version(config: &kebab_config::Config) -> IndexVersion {
IndexVersion(format!("lex:{}", config.chunking.chunker_version))
IndexVersion(format!(
"lex:{}:fts5-v009-korean-morphological",
config.chunking.chunker_version
))
}
/// p9-fb-37: stand-in for the vector retriever in the trace path when
@@ -1116,6 +1130,223 @@ fn backfill_code_lang(hits: &mut [SearchHit]) {
}
}
// ── v0.20.x r2 Enhancement 3: OCR stats + failures inspect ──────────────
/// Wire type for `kebab inspect ocr-stats --json` (`ocr_stats.v1`).
#[derive(serde::Serialize)]
pub struct OcrStatsV1 {
pub schema_version: &'static str,
pub total_events: u64,
pub total_runs: u64,
pub success_count: u64,
pub failure_count: u64,
pub success_rate: f64,
pub p50_ms: Option<u64>,
pub p90_ms: Option<u64>,
pub p99_ms: Option<u64>,
pub max_ms: Option<u64>,
pub by_engine: std::collections::BTreeMap<String, u64>,
pub by_doc: Vec<OcrStatsByDoc>,
}
/// Per-doc breakdown row inside `OcrStatsV1`.
#[derive(serde::Serialize)]
pub struct OcrStatsByDoc {
pub doc_id: String,
pub failure_count: u64,
pub success_count: u64,
pub p90_ms: Option<u64>,
}
/// Wire type for `kebab inspect ocr-failures --json` (`ocr_failures.v1`).
#[derive(serde::Serialize)]
pub struct OcrFailuresV1 {
pub schema_version: &'static str,
pub doc_id: Option<String>,
pub failure_count: u64,
pub failures: Vec<OcrFailureRow>,
}
/// Single failure row inside `OcrFailuresV1`.
#[derive(serde::Serialize)]
pub struct OcrFailureRow {
pub ts: String,
pub page: u32,
pub ms: u64,
pub reason: String,
pub image_byte_size: Option<u64>,
}
impl App {
/// Corpus-wide OCR statistics from the `pdf_ocr_events` SQLite mirror.
pub fn inspect_ocr_stats(&self) -> Result<OcrStatsV1> {
self.inspect_ocr_stats_with_config(&self.config)
}
#[doc(hidden)]
pub fn inspect_ocr_stats_with_config(&self, _cfg: &kebab_config::Config) -> Result<OcrStatsV1> {
use crate::ingest_log::percentiles;
let conn = self.sqlite.read_conn();
// 1. Aggregate counters
let (total_events, success_count, failure_count, total_runs): (u64, u64, u64, u64) = conn
.query_row(
"SELECT COUNT(*), \
SUM(CASE WHEN success=1 THEN 1 ELSE 0 END), \
SUM(CASE WHEN success=0 THEN 1 ELSE 0 END), \
COUNT(DISTINCT run_id) \
FROM pdf_ocr_events",
[],
|r| Ok((r.get(0)?, r.get(1)?, r.get(2)?, r.get(3)?)),
)
.unwrap_or((0, 0, 0, 0));
let success_rate = if total_events == 0 {
0.0
} else {
success_count as f64 / total_events as f64
};
// 2. Latency percentiles from successful events
let samples: Vec<u64> = {
let mut stmt = conn
.prepare("SELECT ms FROM pdf_ocr_events WHERE success=1 ORDER BY ms")
.context("prepare ms query")?;
stmt.query_map([], |r| r.get::<_, u64>(0))
.context("query ms")?
.filter_map(Result::ok)
.collect()
};
let (p50_ms, p90_ms, p99_ms, max_ms) = percentiles(&samples);
// 3. Engine breakdown
let mut by_engine = std::collections::BTreeMap::new();
{
let mut stmt = conn
.prepare("SELECT ocr_engine, COUNT(*) FROM pdf_ocr_events GROUP BY ocr_engine")
.context("prepare engine query")?;
let rows = stmt
.query_map([], |r| Ok((r.get::<_, String>(0)?, r.get::<_, u64>(1)?)))
.context("query engine")?;
for row in rows.filter_map(Result::ok) {
by_engine.insert(row.0, row.1);
}
}
// 4. Top-10 docs by failure count
let by_doc: Vec<OcrStatsByDoc> = {
let mut stmt = conn
.prepare(
"SELECT doc_id, \
SUM(CASE WHEN success=0 THEN 1 ELSE 0 END), \
SUM(CASE WHEN success=1 THEN 1 ELSE 0 END) \
FROM pdf_ocr_events \
WHERE doc_id IS NOT NULL \
GROUP BY doc_id \
ORDER BY 2 DESC \
LIMIT 10",
)
.context("prepare by_doc query")?;
stmt.query_map([], |r| {
Ok(OcrStatsByDoc {
doc_id: r.get(0)?,
failure_count: r.get(1)?,
success_count: r.get(2)?,
p90_ms: None, // per-doc p90 deferred (open question #3)
})
})
.context("query by_doc")?
.filter_map(Result::ok)
.collect()
};
Ok(OcrStatsV1 {
schema_version: "ocr_stats.v1",
total_events,
total_runs,
success_count,
failure_count,
success_rate,
p50_ms,
p90_ms,
p99_ms,
max_ms,
by_engine,
by_doc,
})
}
/// Recent OCR failure rows, optionally filtered by `doc_id`.
pub fn inspect_ocr_failures(
&self,
doc_id: Option<&str>,
limit: usize,
) -> Result<OcrFailuresV1> {
self.inspect_ocr_failures_with_config(&self.config, doc_id, limit)
}
#[doc(hidden)]
pub fn inspect_ocr_failures_with_config(
&self,
_cfg: &kebab_config::Config,
doc_id: Option<&str>,
limit: usize,
) -> Result<OcrFailuresV1> {
let conn = self.sqlite.read_conn();
let failures: Vec<OcrFailureRow> = if let Some(did) = doc_id {
let mut stmt = conn
.prepare(
"SELECT ts, page, ms, COALESCE(reason,'unknown'), image_byte_size \
FROM pdf_ocr_events \
WHERE success=0 AND doc_id=? \
ORDER BY ts DESC \
LIMIT ?",
)
.context("prepare failures by doc_id")?;
stmt.query_map(rusqlite::params![did, limit as i64], |r| {
Ok(OcrFailureRow {
ts: r.get(0)?,
page: r.get(1)?,
ms: r.get(2)?,
reason: r.get(3)?,
image_byte_size: r.get(4)?,
})
})
.context("query failures by doc_id")?
.filter_map(Result::ok)
.collect()
} else {
let mut stmt = conn
.prepare(
"SELECT ts, page, ms, COALESCE(reason,'unknown'), image_byte_size \
FROM pdf_ocr_events \
WHERE success=0 \
ORDER BY ts DESC \
LIMIT ?",
)
.context("prepare failures corpus-wide")?;
stmt.query_map(rusqlite::params![limit as i64], |r| {
Ok(OcrFailureRow {
ts: r.get(0)?,
page: r.get(1)?,
ms: r.get(2)?,
reason: r.get(3)?,
image_byte_size: r.get(4)?,
})
})
.context("query failures corpus-wide")?
.filter_map(Result::ok)
.collect()
};
Ok(OcrFailuresV1 {
schema_version: "ocr_failures.v1",
doc_id: doc_id.map(String::from),
failure_count: failures.len() as u64,
failures,
})
}
}
#[cfg(test)]
mod tests {
use super::*;
@@ -1278,8 +1509,8 @@ mod tests_extractor_dispatch {
MediaType::Code("kotlin".into()),
MediaType::Code("c".into()),
MediaType::Code("cpp".into()),
MediaType::Code("yaml".into()), // registry NOT cover
MediaType::Code("shell".into()), // registry NOT cover
MediaType::Code("yaml".into()), // registry NOT cover
MediaType::Code("shell".into()), // registry NOT cover
MediaType::Audio(AudioType::Wav), // registry NOT cover
];
for sample in &samples {

View File

@@ -126,7 +126,10 @@ fn parse_one(raw: &Value) -> Result<(SearchQuery, SearchOpts), String> {
let text = obj
.get("query")
.and_then(|v| v.as_str())
.ok_or("missing required field: query")?
.ok_or(
"missing required field: query \
(expected {\"query\":\"<text>\",\"mode\":\"lexical|vector|hybrid\",\"k\":3,...})",
)?
.to_string();
let mode = match obj.get("mode").and_then(|v| v.as_str()) {
@@ -215,7 +218,10 @@ fn parse_one(raw: &Value) -> Result<(SearchQuery, SearchOpts), String> {
.and_then(serde_json::Value::as_u64)
.map(|n| n as usize),
cursor: obj.get("cursor").and_then(|v| v.as_str()).map(String::from),
trace: obj.get("trace").and_then(serde_json::Value::as_bool).unwrap_or(false),
trace: obj
.get("trace")
.and_then(serde_json::Value::as_bool)
.unwrap_or(false),
};
Ok((
@@ -299,4 +305,17 @@ mod tests {
assert!(items[1].error.is_some());
assert_eq!(items[1].error.as_ref().unwrap()["code"], "invalid_input");
}
#[test]
fn missing_query_error_message_includes_shape_hint() {
let cfg = open_temp();
let raw = vec![serde_json::json!({"mode": "lexical"})];
let (items, _summary) = bulk_search_with_config(cfg, raw).unwrap();
let err = items[0].error.as_ref().unwrap();
let msg = err["message"].as_str().unwrap();
assert!(
msg.contains("query") && msg.contains("mode"),
"missing shape hint in error message: {msg}"
);
}
}

View File

@@ -0,0 +1,61 @@
//! Derivation-cache payload encoding helpers (design 2026-05-31 §3.3).
//!
//! - embedding: `dimensions × f32` little-endian bytes (1024×4 = 4096 B/chunk).
//! - alias / korean_tokens: UTF-8 as-is (handled inline by the caller — no
//! helper needed, `String::as_bytes` / `String::from_utf8`).
/// Encode an embedding vector as a little-endian `f32` byte string (§3.3).
pub fn encode_embedding(vector: &[f32]) -> Vec<u8> {
let mut out = Vec::with_capacity(vector.len() * 4);
for &v in vector {
out.extend_from_slice(&v.to_le_bytes());
}
out
}
/// Decode a little-endian `f32` byte string back into a vector (§3.3).
///
/// Returns `None` if the payload length is not a multiple of 4 (corrupt
/// entry) — the caller treats this as a cache miss and recomputes, so a bad
/// payload never produces a wrong vector.
pub fn decode_embedding(payload: &[u8]) -> Option<Vec<f32>> {
if payload.len() % 4 != 0 {
return None;
}
Some(
payload
.chunks_exact(4)
.map(|c| f32::from_le_bytes([c[0], c[1], c[2], c[3]]))
.collect(),
)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn roundtrips_vector() {
let v = vec![0.0_f32, 1.5, -2.25, 3.125e10, f32::MIN, f32::MAX];
let bytes = encode_embedding(&v);
assert_eq!(bytes.len(), v.len() * 4);
assert_eq!(decode_embedding(&bytes), Some(v));
}
#[test]
fn empty_vector_roundtrips() {
assert_eq!(encode_embedding(&[]), Vec::<u8>::new());
assert_eq!(decode_embedding(&[]), Some(vec![]));
}
#[test]
fn misaligned_payload_is_none() {
assert_eq!(decode_embedding(&[1, 2, 3]), None);
}
#[test]
fn little_endian_layout_is_fixed() {
// 1.0_f32 == 0x3F800000, little-endian bytes [0x00,0x00,0x80,0x3F].
assert_eq!(encode_embedding(&[1.0]), vec![0x00, 0x00, 0x80, 0x3F]);
}
}

View File

@@ -10,6 +10,6 @@
pub use crate::doctor_signal::{DoctorUnhealthy, NoHitSignal, RefusalSignal};
pub use kebab_config::{ConfigInvalid, ConfigNotFound};
pub use kebab_llm_local::LlmError;
pub use kebab_config::ConfigInvalid;
pub use kebab_store_sqlite::NotIndexed;

View File

@@ -9,7 +9,7 @@
use serde::{Deserialize, Serialize};
use serde_json::{Value, json};
use crate::error_signal::{ConfigInvalid, LlmError, NotIndexed};
use crate::error_signal::{ConfigInvalid, ConfigNotFound, LlmError, NotIndexed};
// p9-fb-34: `stale_cursor` is constructed directly by `cursor::decode`
// and surfaced through `StructuredError` (an anyhow-friendly wrapper
@@ -65,6 +65,20 @@ pub fn classify(err: &anyhow::Error, verbose: bool) -> ErrorV1 {
hint: Some("check `--config <path>` and TOML syntax".to_string()),
};
}
if let Some(s) = err.downcast_ref::<ConfigNotFound>() {
return ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "config_not_found".to_string(),
message: s.to_string(),
details: json!({
"path": s.path.to_string_lossy(),
}),
hint: Some(
"verify --config <path>; pass an existing toml file or omit --config to use XDG default"
.to_string(),
),
};
}
if let Some(s) = err.downcast_ref::<NotIndexed>() {
return ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
@@ -158,7 +172,10 @@ mod tests {
});
let v1 = classify(&err, false);
assert_eq!(v1.code, "config_invalid");
assert_eq!(v1.details.get("path").and_then(|p| p.as_str()), Some("/tmp/x.toml"));
assert_eq!(
v1.details.get("path").and_then(|p| p.as_str()),
Some("/tmp/x.toml")
);
assert!(v1.hint.is_some());
}
@@ -182,7 +199,8 @@ mod tests {
// the resulting LlmError::Unreachable maps to "model_unreachable".
let client = reqwest::blocking::Client::builder()
.timeout(std::time::Duration::from_millis(500))
.build().unwrap();
.build()
.unwrap();
let err = client.get("http://127.0.0.1:1").send().unwrap_err();
let llm = LlmError::Unreachable {
endpoint: "http://127.0.0.1:1".to_string(),
@@ -198,7 +216,10 @@ mod tests {
let llm = LlmError::ModelNotPulled("gemma4:e4b".to_string());
let v1 = classify(&anyhow::Error::new(llm), false);
assert_eq!(v1.code, "model_not_pulled");
assert_eq!(v1.details.get("model").and_then(|p| p.as_str()), Some("gemma4:e4b"));
assert_eq!(
v1.details.get("model").and_then(|p| p.as_str()),
Some("gemma4:e4b")
);
}
#[test]
@@ -235,7 +256,10 @@ mod tests {
// (single source of truth). classify must not pattern-match on
// anyhow string contents — that would create two sources of
// truth. The bare anyhow string falls through to "generic".
assert_ne!(v1.code, "stale_cursor", "classify must not produce stale_cursor from bare anyhow string");
assert_ne!(
v1.code, "stale_cursor",
"classify must not produce stale_cursor from bare anyhow string"
);
}
#[test]

View File

@@ -36,9 +36,7 @@ pub fn ensure_kebabignore_entry(workspace_root: &Path) -> Result<()> {
} else {
String::new()
};
let already = existing
.lines()
.any(|line| line.trim() == KEBABIGNORE_LINE);
let already = existing.lines().any(|line| line.trim() == KEBABIGNORE_LINE);
if already {
return Ok(());
}
@@ -57,11 +55,7 @@ pub fn ensure_kebabignore_entry(workspace_root: &Path) -> Result<()> {
/// Copy bytes to `<external_dir>/<blake3-12>.<ext>`. Idempotent — if the
/// destination file already exists with the expected hash, the existing
/// file is reused (no second write). Returns the destination path.
pub fn copy_to_external(
external_dir: &Path,
bytes: &[u8],
ext: &str,
) -> Result<PathBuf> {
pub fn copy_to_external(external_dir: &Path, bytes: &[u8], ext: &str) -> Result<PathBuf> {
let hash = blake3::hash(bytes);
let hex = hash.to_hex();
let prefix = &hex.as_str()[..12];
@@ -82,11 +76,7 @@ pub fn copy_to_external(
/// Internal `yaml_quote` always uses double-quoted YAML form with backslash
/// escapes for `"` / `\` / control chars — agent-supplied titles with
/// special characters are safe.
pub fn inject_frontmatter(
body: &str,
title: &str,
source_uri: Option<&str>,
) -> Result<String> {
pub fn inject_frontmatter(body: &str, title: &str, source_uri: Option<&str>) -> Result<String> {
let head = body.trim_start();
if head.starts_with("---\n") || head.starts_with("---\r\n") || head.starts_with("---\r") {
anyhow::bail!(

View File

@@ -50,14 +50,14 @@ impl App {
fn fetch_chunk(app: &App, id: ChunkId, opts: FetchOpts) -> Result<FetchResult> {
let target = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_chunk(&app.sqlite, &id)?
.ok_or_else(|| {
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "chunk_not_found".to_string(),
message: format!("chunk_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "chunk_not_found".to_string(),
message: format!("chunk_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
let doc_id = target.doc_id.clone();
let doc =
@@ -107,14 +107,14 @@ fn fetch_chunk(app: &App, id: ChunkId, opts: FetchOpts) -> Result<FetchResult> {
fn fetch_doc(app: &App, id: DocumentId, opts: FetchOpts) -> Result<FetchResult> {
let doc = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_document(&app.sqlite, &id)?
.ok_or_else(|| {
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "doc_not_found".to_string(),
message: format!("doc_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "doc_not_found".to_string(),
message: format!("doc_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
let mut text = fmt_canonical_to_markdown(&doc);
let mut truncated = false;
@@ -176,14 +176,14 @@ fn fetch_span(
) -> Result<FetchResult> {
let doc = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_document(&app.sqlite, &id)?
.ok_or_else(|| {
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "doc_not_found".to_string(),
message: format!("doc_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "doc_not_found".to_string(),
message: format!("doc_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
// Reject line-incompatible media types (PDF / audio). `SourceType`
// (markdown / note / paper / reference / inbox) is the *user-facing*

View File

@@ -0,0 +1,446 @@
//! Per-ingest-run structured ndjson log writer (v0.20.x ingest log feature).
//!
//! Each `kebab ingest` run produces one `ingest-{run_id}.ndjson` file in
//! `config.logging.ingest_log_dir`. Records are appended line by line; the
//! last record is always `kind="summary"`. `IngestLogWriter::open` returns
//! `Ok(None)` when `ingest_log_enabled = false` so callers need not branch.
use std::fs::File;
use std::io::{BufWriter, Write};
use std::path::{Path, PathBuf};
use std::time::SystemTime;
use serde::{Deserialize, Serialize};
use time::format_description::well_known::Rfc3339;
pub struct IngestLogWriter {
file: BufWriter<File>,
path: PathBuf,
run_id: String,
started_at: SystemTime,
}
impl IngestLogWriter {
/// Open a new log file. Returns `Ok(None)` when `cfg.ingest_log_enabled == false` (AC-6).
pub fn open(cfg: &kebab_config::LoggingCfg) -> anyhow::Result<Option<Self>> {
if !cfg.ingest_log_enabled {
return Ok(None);
}
let run_id = generate_run_id();
let log_dir = expand_log_dir(&cfg.ingest_log_dir);
std::fs::create_dir_all(&log_dir)?;
// Cleanup before creating the new file (non-critical: warn on error).
if let Err(e) = cleanup_old_logs(&log_dir, cfg.keep_recent_runs, cfg.retention_days) {
tracing::warn!(target: "kebab-app", "ingest log cleanup failed: {e}");
}
let path = log_dir.join(format!("ingest-{run_id}.ndjson"));
let file = BufWriter::new(File::create(&path)?);
Ok(Some(Self {
file,
path,
run_id,
started_at: SystemTime::now(),
}))
}
pub fn write_event(&mut self, event: &LogEvent<'_>) -> anyhow::Result<()> {
serde_json::to_writer(&mut self.file, event)?;
writeln!(self.file)?;
Ok(())
}
pub fn write_summary(&mut self, summary: &IngestSummary) -> anyhow::Result<()> {
serde_json::to_writer(&mut self.file, summary)?;
writeln!(self.file)?;
Ok(())
}
pub fn flush(&mut self) -> anyhow::Result<()> {
self.file.flush()?;
Ok(())
}
pub fn run_id(&self) -> &str {
&self.run_id
}
pub fn path(&self) -> &Path {
&self.path
}
pub fn started_at(&self) -> SystemTime {
self.started_at
}
}
impl Drop for IngestLogWriter {
fn drop(&mut self) {
let _ = self.file.flush();
}
}
/// ISO 8601 compact timestamp + uuid v7 suffix: `20260528T013000Z-abc123de`.
/// uuid v7 is the workspace dep (Cargo.toml); `rand` is not added (spec §6 R-5).
fn generate_run_id() -> String {
use time::macros::format_description;
let now = time::OffsetDateTime::now_utc();
let ts = now
.format(format_description!(
"[year][month][day]T[hour][minute][second]Z"
))
.unwrap_or_else(|_| "19700101T000000Z".to_string());
let uid = uuid::Uuid::now_v7().simple().to_string();
let suffix = &uid[uid.len() - 8..];
format!("{ts}-{suffix}")
}
/// Expand `{state_dir}` placeholder → XDG state dir (spec §6 R-3).
/// Other tilde/env expansion is delegated to `kebab_config::expand_path`.
fn expand_log_dir(path: &Path) -> PathBuf {
let path_str = path.to_string_lossy();
if path_str.contains("{state_dir}") {
let state_dir = kebab_config::Config::xdg_state_dir();
PathBuf::from(path_str.replace("{state_dir}", &state_dir.to_string_lossy()))
} else {
path.to_path_buf()
}
}
/// RFC 3339 UTC timestamp for log records.
#[allow(dead_code)]
pub(crate) fn now_ts() -> String {
time::OffsetDateTime::now_utc()
.format(&Rfc3339)
.unwrap_or_else(|_| "1970-01-01T00:00:00Z".to_string())
}
/// Ingest event record (ndjson line). `kind` is the discriminator.
#[derive(Serialize, Deserialize)]
#[serde(tag = "kind", rename_all = "snake_case")]
pub enum LogEvent<'a> {
Ocr {
ts: String,
/// v0.20.x r2: additive field — doc_id for dual-write SQLite correlation.
/// Round 1 ndjson logs deserialize with doc_id=None (Serde Option default).
#[serde(skip_serializing_if = "Option::is_none")]
doc_id: Option<&'a str>,
doc_path: &'a str,
page: u32,
image_byte_size: Option<u64>,
image_width: Option<u32>,
image_height: Option<u32>,
ms: u64,
chars: u32,
success: bool,
reason: Option<&'a str>,
ocr_engine: &'a str,
},
ParseError {
ts: String,
doc_path: &'a str,
reason: &'a str,
message: &'a str,
},
Skip {
ts: String,
doc_path: &'a str,
reason: &'a str,
detail: Option<&'a str>,
},
Error {
ts: String,
code: &'a str,
message: &'a str,
},
}
/// Final summary record — always the last line of the log file.
/// Explicit `kind` field serializes to `"kind": "summary"`.
#[derive(Serialize, Deserialize)]
pub struct IngestSummary {
pub kind: String,
pub ts: String,
pub run_id: String,
pub scanned: u32,
pub new: u32,
pub errors: u32,
pub ocr_pages: u32,
pub ocr_failures: u32,
pub ocr_p50_ms: Option<u64>,
pub ocr_p90_ms: Option<u64>,
pub ocr_max_ms: Option<u64>,
pub duration_ms: u64,
}
impl IngestSummary {
#[allow(clippy::too_many_arguments)]
pub fn new(
ts: String,
run_id: String,
scanned: u32,
new: u32,
errors: u32,
ocr_pages: u32,
ocr_failures: u32,
ocr_ms_samples: &[u64],
duration_ms: u64,
) -> Self {
let (p50, p90, _p99, max) = percentiles(ocr_ms_samples);
Self {
kind: "summary".to_string(),
ts,
run_id,
scanned,
new,
errors,
ocr_pages,
ocr_failures,
ocr_p50_ms: p50,
ocr_p90_ms: p90,
ocr_max_ms: max,
duration_ms,
}
}
}
/// Simple percentile extraction on a sorted copy of `samples`.
/// Returns `(p50, p90, p99, max)`. All `None` when samples is empty.
/// p99 surfaces via `inspect ocr-stats`; `IngestSummary` uses p50/p90/max only.
pub(crate) fn percentiles(samples: &[u64]) -> (Option<u64>, Option<u64>, Option<u64>, Option<u64>) {
if samples.is_empty() {
return (None, None, None, None);
}
let mut sorted = samples.to_vec();
sorted.sort_unstable();
let n = sorted.len();
let p50 = sorted[(n.saturating_sub(1) * 50) / 100];
let p90 = sorted[(n.saturating_sub(1) * 90) / 100];
let p99 = sorted[(n.saturating_sub(1) * 99) / 100];
let max = *sorted.last().unwrap();
(Some(p50), Some(p90), Some(p99), Some(max))
}
/// Delete old ingest log files from `log_dir`.
///
/// **Retention rule (§3.4 OR-on-stale semantics):**
/// Keep a file iff BOTH conditions hold: (idx < keep_recent) AND (modified > cutoff).
/// Delete iff (idx >= keep_recent) OR (modified <= cutoff) — either stale condition
/// triggers deletion. Files are indexed newest-first so `idx=0` is the most recent.
pub(crate) fn cleanup_old_logs(
log_dir: &Path,
keep_recent: u32,
retention_days: u32,
) -> anyhow::Result<()> {
let mut entries: Vec<_> = std::fs::read_dir(log_dir)?
.filter_map(Result::ok)
.filter(|e| {
e.path()
.file_name()
.and_then(|n| n.to_str())
.is_some_and(|s| s.starts_with("ingest-") && s.ends_with(".ndjson"))
})
.collect();
// Sort newest-first by mtime (files without mtime go to the end).
entries.sort_by_key(|e| std::cmp::Reverse(e.metadata().ok().and_then(|m| m.modified().ok())));
let cutoff = SystemTime::now()
.checked_sub(std::time::Duration::from_secs(
u64::from(retention_days) * 86400,
))
.unwrap_or(SystemTime::UNIX_EPOCH);
for (idx, entry) in entries.into_iter().enumerate() {
let modified = entry
.metadata()
.ok()
.and_then(|m| m.modified().ok())
.unwrap_or(SystemTime::UNIX_EPOCH);
// Keep iff (idx < keep_recent) AND (modified > cutoff).
if (idx as u32) < keep_recent && modified > cutoff {
continue;
}
if let Err(e) = std::fs::remove_file(entry.path()) {
tracing::warn!(
target: "kebab-app",
"failed to remove old log {}: {e}",
entry.path().display()
);
}
}
Ok(())
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_config::LoggingCfg;
use std::time::SystemTime;
use tempfile::TempDir;
#[test]
fn generate_run_id_has_iso_prefix_and_8_hex_suffix() {
let id = generate_run_id();
// Format: YYYYMMDDTHHmmssZ-xxxxxxxx (total len = 16+1+8 = 25)
assert_eq!(id.len(), 25, "run_id len should be 25: {id}");
let (prefix, suffix) = id.split_once('-').expect("run_id should contain '-'");
assert_eq!(prefix.len(), 16, "prefix should be 16 chars: {prefix}");
assert!(prefix.contains('T'), "prefix should contain T: {prefix}");
assert!(prefix.ends_with('Z'), "prefix should end with Z: {prefix}");
assert_eq!(suffix.len(), 8, "suffix should be 8 chars: {suffix}");
assert!(
suffix.chars().all(|c| c.is_ascii_hexdigit()),
"suffix should be hex: {suffix}"
);
}
#[test]
fn expand_log_dir_substitutes_state_dir_placeholder() {
let input = PathBuf::from("{state_dir}/logs");
let expanded = expand_log_dir(&input);
let expected = kebab_config::Config::xdg_state_dir().join("logs");
assert_eq!(expanded, expected);
assert!(!expanded.to_string_lossy().contains("{state_dir}"));
}
#[test]
fn writer_disabled_returns_none() {
let cfg = LoggingCfg {
ingest_log_enabled: false,
ingest_log_dir: PathBuf::from("/tmp/should-not-exist"),
..Default::default()
};
let result = IngestLogWriter::open(&cfg).expect("open should not error");
assert!(result.is_none(), "disabled writer should return None");
}
#[test]
fn writer_writes_one_event_per_line_with_kind_discriminator() {
let tmp = TempDir::new().unwrap();
let cfg = LoggingCfg {
ingest_log_enabled: true,
ingest_log_dir: tmp.path().to_path_buf(),
..Default::default()
};
let mut writer = IngestLogWriter::open(&cfg).unwrap().unwrap();
let path = writer.path().to_path_buf();
writer
.write_event(&LogEvent::Skip {
ts: now_ts(),
doc_path: "a.zip",
reason: "builtin_blacklist",
detail: Some(".zip extension"),
})
.unwrap();
writer
.write_event(&LogEvent::Error {
ts: now_ts(),
code: "ingest_fatal",
message: "something bad",
})
.unwrap();
writer
.write_event(&LogEvent::ParseError {
ts: now_ts(),
doc_path: "weird.pdf",
reason: "lopdf_error",
message: "unexpected EOF",
})
.unwrap();
writer.flush().unwrap();
let contents = std::fs::read_to_string(&path).unwrap();
let lines: Vec<&str> = contents.lines().collect();
assert_eq!(lines.len(), 3, "expected 3 lines, got: {}", lines.len());
for line in &lines {
assert!(
line.starts_with('{'),
"each line should be JSON object: {line}"
);
assert!(
line.contains("\"kind\""),
"each line should have 'kind': {line}"
);
}
}
#[test]
fn drop_flushes_pending_buffer() {
let tmp = TempDir::new().unwrap();
let cfg = LoggingCfg {
ingest_log_enabled: true,
ingest_log_dir: tmp.path().to_path_buf(),
..Default::default()
};
let mut writer = IngestLogWriter::open(&cfg).unwrap().unwrap();
let path = writer.path().to_path_buf();
writer
.write_event(&LogEvent::Error {
ts: now_ts(),
code: "test",
message: "drop flush test",
})
.unwrap();
// Drop without explicit flush — Drop impl should flush BufWriter.
drop(writer);
let contents = std::fs::read_to_string(&path).unwrap();
assert!(
contents.lines().count() >= 1,
"file should have at least 1 line after drop"
);
}
/// AC-7: keep_recent=3 with 5 files, oldest 2 should be deleted.
#[test]
fn cleanup_keeps_recent_n_drops_old() {
let tmp = TempDir::new().unwrap();
let dir = tmp.path();
// Create 5 files with mtime spread across 60 days
for i in 0..5u64 {
let path = dir.join(format!("ingest-file{i}.ndjson"));
std::fs::write(&path, b"x").unwrap();
// Set mtime: file 0 = newest, file 4 = 60 days old
let age_days = i * 15; // 0, 15, 30, 45, 60 days old
let mtime = SystemTime::now()
.checked_sub(std::time::Duration::from_secs(age_days * 86400))
.unwrap();
filetime::set_file_mtime(&path, filetime::FileTime::from_system_time(mtime)).unwrap();
}
// keep_recent=3, retention_days=90 (no time-based deletion)
cleanup_old_logs(dir, 3, 90).unwrap();
let remaining: Vec<_> = std::fs::read_dir(dir)
.unwrap()
.filter_map(Result::ok)
.collect();
assert_eq!(remaining.len(), 3, "expected 3 files after cleanup");
}
/// F5 OR-on-stale: files within keep_recent count but older than retention_days
/// must still be deleted.
#[test]
fn cleanup_drops_stale_even_within_count() {
let tmp = TempDir::new().unwrap();
let dir = tmp.path();
// 2 files, both 90 days old — well past retention_days=30
for i in 0..2u64 {
let path = dir.join(format!("ingest-old{i}.ndjson"));
std::fs::write(&path, b"x").unwrap();
let mtime = SystemTime::now()
.checked_sub(std::time::Duration::from_secs(90 * 86400))
.unwrap();
filetime::set_file_mtime(&path, filetime::FileTime::from_system_time(mtime)).unwrap();
}
// keep_recent=10 (both within count) but retention_days=30 → both stale
cleanup_old_logs(dir, 10, 30).unwrap();
let remaining: Vec<_> = std::fs::read_dir(dir)
.unwrap()
.filter_map(Result::ok)
.collect();
assert_eq!(
remaining.len(),
0,
"stale files must be deleted even within keep_recent"
);
}
}

View File

@@ -46,10 +46,21 @@ pub struct AggregateCounts {
/// Ordering invariant per design §2.4a:
///
/// ```text
/// ScanStarted < ScanCompleted < (AssetStarted < AssetFinished)*
/// < (Completed | Aborted)
/// ScanStarted < ScanCompleted
/// < ( AssetStarted
/// [< (PdfOcrStarted < PdfOcrFinished)*]
/// [< AssetChunked]
/// [< AssetTimings]
/// < AssetFinished )*
/// < (Completed | Aborted)
/// ```
///
/// `[]` = optional. `PdfOcr*` is per-PDF asset only (v0.20.0 sub-item 1).
/// `AssetChunked` / `AssetTimings` are the v0.24.0 asset-internal phase
/// events: `AssetChunked` fires once right after chunking (markdown /
/// image / PDF); `AssetTimings` reports per-phase wall-clock once
/// (markdown only).
///
/// Embed-batch events (`embed_batch_started` / `embed_batch_finished`
/// in §2.4a) are reserved for a future iteration and are not emitted
/// by this task; the spec calls them out as "임의 위치" (optional).
@@ -79,12 +90,59 @@ pub enum IngestEvent {
result: IngestItemKind,
chunks: u32,
},
/// v0.24.0 (additive): emitted right after an asset is chunked, before
/// expansion / embed / store. Surfaces "this document is N chunks"
/// immediately so a single large document no longer looks frozen at
/// `idx/total` while its per-chunk phases churn. `chunks` is the chunk
/// count for asset `idx`.
AssetChunked { idx: u32, total: u32, chunks: u32 },
/// v0.24.0 (additive): per-phase wall-clock (milliseconds) for asset
/// `idx`, emitted once the asset's markdown pipeline finishes. Lets a
/// user see *where* the time went (parse / chunk / embed / store)
/// without parsing logs. Only the markdown path emits this; the
/// image / PDF paths surface `AssetChunked` but skip phase timing (their
/// phase shapes differ — OCR / caption). `expansion_ms` is retained for
/// wire compatibility but is always 0 since doc-side expansion was
/// removed (HOTFIXES 2026-06-03).
AssetTimings {
idx: u32,
total: u32,
parse_ms: u64,
chunk_ms: u64,
expansion_ms: u64,
embed_ms: u64,
store_ms: u64,
},
/// Run finished normally. `counts` is the final aggregate.
Completed { counts: AggregateCounts },
/// Run finished by user cancellation. `counts` is the partial
/// aggregate at the cancel boundary. Emitted by `p9-fb-04`; this
/// task never produces `Aborted`.
Aborted { counts: AggregateCounts },
/// PDF page 별 OCR 시작 시 emit. v0.20.0 sub-item 1.
PdfOcrStarted { page: u32 },
/// PDF page 별 OCR 종료 시 emit. v0.20.0 sub-item 1.
/// `skipped` = `true` 일 시 OCR 미수행 (DCTDecode 부재 또는 engine 실패).
/// `chars = 0` 만으로는 "skip" 과 "0-char OCR result" 구분 불가, `skipped` field 가 명시적.
PdfOcrFinished {
page: u32,
ms: u64,
chars: u32,
ocr_engine: String,
skipped: bool,
/// v0.20.x ingest log: raster image byte size (additive minor, optional).
#[serde(skip_serializing_if = "Option::is_none")]
image_byte_size: Option<u64>,
/// v0.20.x ingest log: raster image width in pixels (additive minor, optional).
#[serde(skip_serializing_if = "Option::is_none")]
image_width: Option<u32>,
/// v0.20.x ingest log: raster image height in pixels (additive minor, optional).
#[serde(skip_serializing_if = "Option::is_none")]
image_height: Option<u32>,
/// v0.20.x ingest log: OCR failure reason (additive minor, optional).
#[serde(skip_serializing_if = "Option::is_none")]
failure_reason: Option<String>,
},
}
/// Map a `MediaType` to the short label used by `IngestEvent::AssetStarted`.
@@ -118,10 +176,7 @@ pub fn render_skipped_breakdown(map: &std::collections::BTreeMap<String, u32>) -
/// Best-effort send into an optional `mpsc::Sender`. A dropped receiver
/// is silently absorbed — the ingest hot path must not stall on a slow
/// consumer. Logged at `trace` for diagnostics.
pub(crate) fn emit(
progress: Option<&std::sync::mpsc::Sender<IngestEvent>>,
event: IngestEvent,
) {
pub(crate) fn emit(progress: Option<&std::sync::mpsc::Sender<IngestEvent>>, event: IngestEvent) {
if let Some(tx) = progress {
if tx.send(event).is_err() {
tracing::trace!(
@@ -165,13 +220,69 @@ mod tests {
media: "markdown".into(),
};
let v = serde_json::to_value(&ev).unwrap();
assert_eq!(v.get("kind").and_then(|s| s.as_str()), Some("asset_started"));
assert_eq!(
v.get("kind").and_then(|s| s.as_str()),
Some("asset_started")
);
assert_eq!(v.get("idx").and_then(serde_json::Value::as_u64), Some(1));
assert_eq!(v.get("total").and_then(serde_json::Value::as_u64), Some(10));
assert_eq!(v.get("path").and_then(|s| s.as_str()), Some("notes/foo.md"));
assert_eq!(v.get("media").and_then(|s| s.as_str()), Some("markdown"));
}
#[test]
fn asset_chunked_serializes_with_discriminator() {
// v0.24.0 additive variant — `kind` must be snake_case
// `asset_chunked` so wire v1 consumers branch on it cleanly.
let ev = IngestEvent::AssetChunked {
idx: 3,
total: 10,
chunks: 142,
};
let v = serde_json::to_value(&ev).unwrap();
assert_eq!(
v.get("kind").and_then(|s| s.as_str()),
Some("asset_chunked")
);
assert_eq!(v.get("idx").and_then(serde_json::Value::as_u64), Some(3));
assert_eq!(
v.get("chunks").and_then(serde_json::Value::as_u64),
Some(142)
);
}
#[test]
fn asset_timings_serializes_all_phase_fields() {
let ev = IngestEvent::AssetTimings {
idx: 2,
total: 7,
parse_ms: 12,
chunk_ms: 3,
expansion_ms: 45_000,
embed_ms: 800,
store_ms: 20,
};
let v = serde_json::to_value(&ev).unwrap();
assert_eq!(
v.get("kind").and_then(|s| s.as_str()),
Some("asset_timings")
);
// All five phase fields are present (plain u64, always serialized).
for (field, want) in [
("parse_ms", 12u64),
("chunk_ms", 3),
("expansion_ms", 45_000),
("embed_ms", 800),
("store_ms", 20),
] {
assert_eq!(
v.get(field).and_then(serde_json::Value::as_u64),
Some(want),
"field {field}"
);
}
}
#[test]
fn ingest_event_completed_has_counts() {
let ev = IngestEvent::Completed {
@@ -184,8 +295,14 @@ mod tests {
let v = serde_json::to_value(&ev).unwrap();
assert_eq!(v.get("kind").and_then(|s| s.as_str()), Some("completed"));
let counts = v.get("counts").unwrap();
assert_eq!(counts.get("scanned").and_then(serde_json::Value::as_u64), Some(5));
assert_eq!(counts.get("new").and_then(serde_json::Value::as_u64), Some(2));
assert_eq!(
counts.get("scanned").and_then(serde_json::Value::as_u64),
Some(5)
);
assert_eq!(
counts.get("new").and_then(serde_json::Value::as_u64),
Some(2)
);
}
#[test]

File diff suppressed because it is too large Load Diff

View File

@@ -26,7 +26,9 @@ pub fn init(level: LogLevel) -> Result<WorkerGuard> {
let (nb, guard) = tracing_appender::non_blocking(file_appender);
let env_filter = match level {
LogLevel::Default => EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("warn")),
LogLevel::Default => {
EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("warn"))
}
LogLevel::Verbose => EnvFilter::new("info"),
LogLevel::Debug => EnvFilter::new("debug"),
};

View File

@@ -0,0 +1,362 @@
// crates/kebab-app/src/pdf_ocr_apply.rs
//
// PDF post-extract OCR enrichment. parser isolation 보존 — kebab-parse-pdf 가
// kebab-parse-image::OcrEngine 을 import 하지 않도록, helper 는 kebab-app 에 둠.
// image path 의 apply_ocr (kebab-parse-image::ocr::apply_ocr) 의
// PDF page 변형 — image 는 ImageRefBlock.ocr 를 mutate, PDF 는
// Block::Paragraph.text / inlines 를 in-place mutate (단일 OCR fallback) 또는
// 새 Block::Paragraph 를 push (always_on dual-block).
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::time::Instant;
use anyhow::{Context, Result};
use kebab_core::{
Block, CanonicalDocument, CommonBlock, Inline, Lang, ProvenanceEvent, ProvenanceKind,
SourceSpan, TextBlock, id_for_block,
};
use kebab_parse_image::OcrEngine;
use kebab_parse_pdf::{compute_valid_char_ratio, extract_dctdecode_page_image};
use lopdf::Document as LopdfDocument;
use time::OffsetDateTime;
use tracing::warn;
/// Extract width/height from a JPEG (or any image format) byte slice.
/// Returns `None` on corrupt / unsupported data — callers fall back to
/// `(None, None)` so OCR results remain valid (R-4 mitigation).
fn extract_image_dimensions(bytes: &[u8]) -> Option<(u32, u32)> {
use image::ImageReader;
ImageReader::new(std::io::Cursor::new(bytes))
.with_guessed_format()
.ok()?
.into_dimensions()
.ok()
}
/// Per-page OCR knobs threaded through [`apply_ocr_to_pdf_pages`].
/// Mirrors the `[pdf.ocr]` config block (spec §4.5); the facade
/// (`kebab_app::ingest_one_pdf_asset`) fills these from
/// `kebab_config::Config::pdf::ocr` plus runtime flags (CLI / SIGINT).
pub struct PdfOcrOpts {
/// Master switch. `false` short-circuits to
/// `PdfOcrSummary { pages_ocrd: 0, ms_total: 0 }` without lopdf reparse.
pub enabled: bool,
/// `true` → 모든 page OCR (dual-block path, new `Block::Paragraph` push).
/// `false` → text-detect block 의 `min_char_count` 또는
/// `valid_ratio_threshold` 미달인 page 만 OCR (in-place mutate).
pub always_on: bool,
/// 0.0..=1.0. text-detect block 의 `compute_valid_char_ratio` 가
/// 본 임계 미만이면 OCR fallback. Default `0.5`.
pub valid_ratio_threshold: f32,
/// text-detect block 의 char count 가 본 임계 미만이면 OCR fallback.
/// empty page (cover, blank separator) 자동 skip. Default `20`.
pub min_char_count: u32,
/// OCR engine 에 전달할 언어 힌트 (예: `Lang("kor".into())`).
/// `None` → no hint passed to engine.
pub lang_hint: Option<Lang>,
/// Optional per-page cancellation handle. checked at start of each page
/// loop iteration; set→true 시 `cancelled mid-PDF` error 반환. plan §6 E4
/// + verifier LOW L-1 resolution + spec §4.8 line 1159 명시.
pub cancel: Option<Arc<AtomicBool>>,
}
/// OCR run summary returned by [`apply_ocr_to_pdf_pages`] for the caller's
/// `IngestItem.pdf_ocr_pages` + `pdf_ocr_ms_total` wire fields (§4.6.2).
#[derive(Debug)]
pub struct PdfOcrSummary {
/// Number of pages 가 OCR pipeline 을 실제 통과 (skipped page 제외).
pub pages_ocrd: u32,
/// Cumulative wall-clock duration of successful OCR engine calls (ms).
/// `saturating_add` 사용 — 24-day cumulative 까지 overflow-safe.
pub ms_total: u64,
}
/// Post-extract OCR enrichment for PDF. Walks `canonical.blocks` page-by-page,
/// classifies each page via `text_quality::compute_valid_char_ratio` +
/// `min_char_count`, and either:
/// - skips (vector PDF + sufficient text + `always_on=false`),
/// - mutates the text-detect `Block::Paragraph` in-place with OCR output
/// (scanned/mojibake page), or
/// - pushes a new `Block::Paragraph` with dual ordinal (`always_on=true` +
/// vector page).
///
/// Errors:
/// - cancel handle (`opts.cancel = Some(true)`) → `Err("PDF OCR cancelled mid-PDF at page N")`.
/// - lopdf re-parse failure → `Err(...)`.
/// - per-page OCR engine failure 또는 DCTDecode 부재 → `ProvenanceKind::Warning`
/// event push + `emit_progress(Finished { skipped: true })` + continue
/// (no `Err` propagation).
///
/// See spec §4.1 + §4.4 for the full pipeline.
pub fn apply_ocr_to_pdf_pages<F>(
canonical: &mut CanonicalDocument,
engine: &dyn OcrEngine,
pdf_bytes: &[u8],
opts: &PdfOcrOpts,
mut emit_progress: F,
) -> Result<PdfOcrSummary>
where
F: FnMut(PdfOcrProgress),
{
if !opts.enabled {
return Ok(PdfOcrSummary {
pages_ocrd: 0,
ms_total: 0,
});
}
let pdf_doc = LopdfDocument::load_mem(pdf_bytes)
.context("kb-app::pdf_ocr_apply: re-parse PDF for image extract")?;
let page_count = pdf_doc.get_pages().len() as u32;
let mut new_events: Vec<ProvenanceEvent> = Vec::new();
let mut ocr_blocks: Vec<Block> = Vec::new();
let mut pages_ocrd: u32 = 0;
let mut ms_total: u64 = 0;
// canonical.blocks 의 page → block index map (text-detect block 의 in-place
// mutate 또는 dual-block push 결정용).
// PdfTextExtractor 가 page 마다 1 Block::Paragraph + SourceSpan::Page 를
// 생성 (§1.4) — 그 invariant 사용.
for page_num in 1..=page_count {
if let Some(cancel) = &opts.cancel {
if cancel.load(std::sync::atomic::Ordering::Relaxed) {
anyhow::bail!("PDF OCR cancelled mid-PDF at page {page_num}");
}
}
let text_block_idx = find_paragraph_block_idx(&canonical.blocks, page_num);
let text = match &canonical.blocks[text_block_idx] {
Block::Paragraph(tb) => tb.text.clone(),
_ => String::new(),
};
let chars = text.chars().count() as u32;
let valid_ratio = compute_valid_char_ratio(&text);
let needs_ocr = chars < opts.min_char_count || valid_ratio < opts.valid_ratio_threshold;
// 결정 matrix:
// always_on=true → 모든 page OCR (dual-block).
// always_on=false + needs_ocr → in-place OCR (text-detect block mutate).
// needs_ocr=false → skip.
let do_ocr = opts.always_on || needs_ocr;
if !do_ocr {
continue;
}
emit_progress(PdfOcrProgress::Started { page: page_num });
let page_image_bytes = if let Some(b) = extract_dctdecode_page_image(&pdf_doc, page_num)? {
b
} else {
let note = format!(
"page={page_num} skipped: no DCTDecode image XObject (vector PDF page or unsupported /Filter — v1 supports DCTDecode passthrough only; see release notes for normalization guidance)"
);
warn!(target: "kebab-app", "{}", note);
new_events.push(ProvenanceEvent {
at: OffsetDateTime::now_utc(),
agent: "kb-parse-pdf".to_string(),
kind: ProvenanceKind::Warning,
note: Some(note),
});
emit_progress(PdfOcrProgress::Finished {
page: page_num,
ms: 0,
chars: 0,
skipped: true,
image_byte_size: None,
image_width: None,
image_height: None,
failure_reason: None,
});
continue;
};
let start = Instant::now();
let ocr = match engine.recognize(&page_image_bytes, opts.lang_hint.as_ref()) {
Ok(t) => t,
Err(e) => {
// OCR failure: warning event + skip (text-detect block 그대로).
let note = format!(
"page={} OCR failed engine={} version={} err={}",
page_num,
engine.engine_name(),
engine.engine_version(),
e
);
warn!(target: "kebab-app", "{}", note);
new_events.push(ProvenanceEvent {
at: OffsetDateTime::now_utc(),
agent: "kb-parse-pdf".to_string(),
kind: ProvenanceKind::Warning,
note: Some(note),
});
let (image_width, image_height) = extract_image_dimensions(&page_image_bytes)
.map_or((None, None), |(w, h)| (Some(w), Some(h)));
emit_progress(PdfOcrProgress::Finished {
page: page_num,
ms: start.elapsed().as_millis() as u64,
chars: 0,
skipped: true,
image_byte_size: Some(page_image_bytes.len() as u64),
image_width,
image_height,
failure_reason: Some("ocr_error".to_string()),
});
continue;
}
};
let elapsed_ms = start.elapsed().as_millis() as u64;
let chars_ocr = ocr.joined.chars().count() as u32;
pages_ocrd = pages_ocrd.saturating_add(1);
ms_total = ms_total.saturating_add(elapsed_ms);
if opts.always_on && !needs_ocr {
// dual-block path: 새 Block::Paragraph push, ordinal = page-1 + page_count.
let ocr_ordinal = (page_num - 1) + page_count;
let span_ocr = SourceSpan::Page {
page: page_num,
char_start: Some(0),
char_end: Some(chars_ocr),
};
let block_id =
id_for_block(&canonical.doc_id, "paragraph", &[], ocr_ordinal, &span_ocr);
let common = CommonBlock {
block_id,
heading_path: Vec::new(),
source_span: span_ocr,
};
ocr_blocks.push(Block::Paragraph(TextBlock {
common,
text: ocr.joined.clone(),
inlines: if ocr.joined.is_empty() {
Vec::new()
} else {
vec![Inline::Text {
text: ocr.joined.clone(),
}]
},
}));
} else {
// in-place mutate: text-detect block (빈 또는 low-valid) 의 text/inlines 교체.
// block_id / ordinal 보존 — span 의 char_end 만 갱신.
if let Block::Paragraph(tb) = &mut canonical.blocks[text_block_idx] {
tb.text = ocr.joined.clone();
tb.inlines = if ocr.joined.is_empty() {
Vec::new()
} else {
vec![Inline::Text {
text: ocr.joined.clone(),
}]
};
if let SourceSpan::Page { char_end, .. } = &mut tb.common.source_span {
*char_end = Some(chars_ocr);
}
}
}
new_events.push(ProvenanceEvent {
at: OffsetDateTime::now_utc(),
agent: "kb-parse-pdf".to_string(),
kind: ProvenanceKind::OcrApplied,
note: Some(format!(
"page={} engine={} version={} regions={} ms={} chars={}",
page_num,
engine.engine_name(),
engine.engine_version(),
ocr.regions.len(),
elapsed_ms,
chars_ocr
)),
});
let (image_width, image_height) = extract_image_dimensions(&page_image_bytes)
.map_or((None, None), |(w, h)| (Some(w), Some(h)));
emit_progress(PdfOcrProgress::Finished {
page: page_num,
ms: elapsed_ms,
chars: chars_ocr,
skipped: false,
image_byte_size: Some(page_image_bytes.len() as u64),
image_width,
image_height,
failure_reason: None,
});
}
canonical.blocks.extend(ocr_blocks);
canonical.provenance.events.extend(new_events);
Ok(PdfOcrSummary {
pages_ocrd,
ms_total,
})
}
fn find_paragraph_block_idx(blocks: &[Block], page_num: u32) -> usize {
blocks
.iter()
.position(|b| match b {
Block::Paragraph(tb) => matches!(
tb.common.source_span,
SourceSpan::Page { page, .. } if page == page_num
),
_ => false,
})
.expect("PdfTextExtractor emits 1 Block::Paragraph per page (invariant)")
}
/// Per-page OCR progress event 가 caller 의 `emit_progress` closure 호출 시 emit.
/// Step 6 의 ingest_one_pdf_asset 가 IngestEvent::PdfOcrStarted / PdfOcrFinished
/// 로 carry (spec §4.6.1 wire schema).
pub enum PdfOcrProgress {
/// page 별 OCR 시작 시 emit. `engine.recognize` 호출 직전.
Started {
/// 1-based PDF page number.
page: u32,
},
/// page 별 OCR 종료 시 emit (성공 / skip / failure 모두).
Finished {
/// 1-based PDF page number.
page: u32,
/// `engine.recognize` wall-clock duration. skip path 의 의미는 mixed
/// (DCTDecode 부재 시 `0`, OCR engine 실패 시 actual latency before bail).
ms: u64,
/// OCR result text 의 char count. skip 시 `0`.
chars: u32,
/// `true` = DCTDecode 부재 또는 OCR engine 실패 로 skip.
/// `false` = 정상 OCR 완료.
skipped: bool,
/// v0.20.x ingest log: raster image byte size (additive, optional).
image_byte_size: Option<u64>,
/// v0.20.x ingest log: raster image width in pixels (additive, optional).
image_width: Option<u32>,
/// v0.20.x ingest log: raster image height in pixels (additive, optional).
image_height: Option<u32>,
/// v0.20.x ingest log: failure reason string when OCR failed (additive, optional).
/// Values: "timeout" | "ocr_error" | "network_error" | None (success).
failure_reason: Option<String>,
},
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn extract_image_dimensions_valid_jpeg() {
let img = image::RgbImage::new(16, 12);
let mut bytes = Vec::new();
image::DynamicImage::from(img)
.write_to(
&mut std::io::Cursor::new(&mut bytes),
image::ImageFormat::Jpeg,
)
.expect("encode jpeg");
assert_eq!(extract_image_dimensions(&bytes), Some((16, 12)));
}
#[test]
fn extract_image_dimensions_corrupt_returns_none() {
assert_eq!(extract_image_dimensions(b"not a jpeg"), None);
}
}

View File

@@ -85,8 +85,7 @@ pub fn enumerate_paths(scope: ResetScope, cfg: &Config) -> Vec<PathBuf> {
ResetScope::All => vec![cfg_dir, data_dir, cache_dir, state_dir],
ResetScope::DataOnly => vec![data_dir, cache_dir, state_dir],
ResetScope::VectorOnly => {
let vector_dir =
expand_path(&cfg.storage.vector_dir, &data_dir.to_string_lossy());
let vector_dir = expand_path(&cfg.storage.vector_dir, &data_dir.to_string_lossy());
vec![vector_dir]
}
ResetScope::ConfigOnly => vec![cfg_dir],
@@ -137,8 +136,8 @@ pub fn estimate_size_bytes(paths: &[PathBuf]) -> u64 {
/// the double scan is acceptable for a rare destructive operation.
pub fn enumerate_orphans(cfg: &Config) -> Result<Vec<WorkspacePath>> {
use kebab_core::DocumentStore as _;
use kebab_source_fs::FsSourceConnector;
use kebab_core::SourceScope;
use kebab_source_fs::FsSourceConnector;
let store = kebab_store_sqlite::SqliteStore::open(cfg)
.context("enumerate_orphans: open SqliteStore")?;
@@ -160,16 +159,13 @@ pub fn enumerate_orphans(cfg: &Config) -> Result<Vec<WorkspacePath>> {
..Default::default()
};
let connector = FsSourceConnector::new(cfg)
.context("enumerate_orphans: build FsSourceConnector")?;
let connector =
FsSourceConnector::new(cfg).context("enumerate_orphans: build FsSourceConnector")?;
let (assets, _skips) = connector
.scan_with_skips(&scope)
.context("enumerate_orphans: scan workspace")?;
let scanned: HashSet<WorkspacePath> = assets
.into_iter()
.map(|a| a.workspace_path)
.collect();
let scanned: HashSet<WorkspacePath> = assets.into_iter().map(|a| a.workspace_path).collect();
let mut orphans: Vec<WorkspacePath> = stored
.into_iter()
@@ -206,8 +202,7 @@ pub fn execute(scope: ResetScope, cfg: &Config) -> Result<ResetReport> {
if !p.exists() {
continue;
}
std::fs::remove_dir_all(p)
.with_context(|| format!("remove {}", p.display()))?;
std::fs::remove_dir_all(p).with_context(|| format!("remove {}", p.display()))?;
removed.push(p.clone());
}
@@ -229,8 +224,7 @@ pub fn execute(scope: ResetScope, cfg: &Config) -> Result<ResetReport> {
/// Execute the `OrphansOnly` variant: reconcile stored docs against the
/// current walker scope without touching any filesystem directory.
fn execute_orphans_only(cfg: &Config) -> Result<ResetReport> {
let orphans = enumerate_orphans(cfg)
.context("execute_orphans_only: enumerate orphans")?;
let orphans = enumerate_orphans(cfg).context("execute_orphans_only: enumerate orphans")?;
if orphans.is_empty() {
return Ok(ResetReport {

View File

@@ -39,6 +39,14 @@ pub struct Capabilities {
pub struct Models {
pub parser_version: String,
pub chunker_version: String,
/// v0.20.1+ (Bug #13). Corpus 안 활성 parser version 전체.
/// 빈 corpus → empty Vec. backward compat: `parser_version` field 보존.
#[serde(default)]
pub active_parsers: Vec<String>,
/// v0.20.1+ (Bug #13). Corpus 안 활성 chunker version 전체.
/// 빈 corpus → empty Vec.
#[serde(default)]
pub active_chunkers: Vec<String>,
pub embedding_version: String,
pub prompt_template_version: String,
pub index_version: String,
@@ -100,6 +108,7 @@ const WIRE_SCHEMAS: &[&str] = &[
"doc_summary.v1",
"chunk_inspection.v1",
"doctor.v1",
"config_migration.v1",
"ingest_report.v1",
"ingest_progress.v1",
"reset_report.v1",
@@ -108,6 +117,9 @@ const WIRE_SCHEMAS: &[&str] = &[
"error.v1",
"bulk_search_item.v1",
"bulk_search_response.v1",
// v0.20.x r2 Enhancement 3: OCR statistics + failures introspection.
"ocr_stats.v1",
"ocr_failures.v1",
];
/// Build a [`SchemaV1`] introspection report for the given config.
@@ -142,10 +154,10 @@ fn capabilities_snapshot() -> Capabilities {
rag_multi_turn: true,
search_cache: true,
incremental_ingest: true,
streaming_ask: false,
streaming_ask: true,
http_daemon: false,
mcp_server: true,
single_file_ingest: false,
single_file_ingest: true,
bulk_search: true,
}
}
@@ -160,12 +172,8 @@ fn open_store_for_stats(cfg: &Config) -> anyhow::Result<kebab_store_sqlite::Sqli
kebab_store_sqlite::SqliteStore::open_existing(&db_path)
}
fn collect_stats(
cfg: &Config,
store: &kebab_store_sqlite::SqliteStore,
) -> anyhow::Result<Stats> {
let counts = store
.count_summary_with_threshold(u64::from(cfg.search.stale_threshold_days))?;
fn collect_stats(cfg: &Config, store: &kebab_store_sqlite::SqliteStore) -> anyhow::Result<Stats> {
let counts = store.count_summary_with_threshold(u64::from(cfg.search.stale_threshold_days))?;
let data_dir = kebab_config::expand_path(&cfg.storage.data_dir, "");
let index_bytes = kebab_store_sqlite::stats_ext::index_bytes(&data_dir)
.map_err(|e| anyhow::anyhow!("index_bytes: {e}"))?;
@@ -190,12 +198,16 @@ fn collect_stats(
}
fn collect_models(cfg: &Config, store: &kebab_store_sqlite::SqliteStore) -> Models {
let active_parsers = store.fetch_distinct_parser_versions().unwrap_or_default();
let active_chunkers = store.fetch_distinct_chunker_versions().unwrap_or_default();
Models {
// markdown parser only — pdf-page-v1 (P7) / image extractors (P6)
// maintain their own versions; surface those when SchemaV1.models
// becomes a multi-medium map (P+).
parser_version: kebab_parse_md::PARSER_VERSION.to_string(),
chunker_version: cfg.chunking.chunker_version.clone(),
active_parsers,
active_chunkers,
// EmbeddingModelCfg uses `.model` (not `.id`) — adapt from plan.
embedding_version: cfg.models.embedding.model.clone(),
prompt_template_version: cfg.rag.prompt_template_version.clone(),
@@ -268,3 +280,27 @@ mod tests_stats_ext {
assert_eq!(s.stats.stale_doc_count, 0);
}
}
#[cfg(test)]
mod tests_capabilities {
use super::*;
#[test]
fn capabilities_streaming_ask_matches_cli_surface() {
// Bug #9: kebab ask --stream 가 answer_event.v1 ndjson 191 event 정상 emit →
// capabilities.streaming_ask 가 true 여야 함.
let caps = capabilities_snapshot();
assert!(caps.streaming_ask, "streaming_ask must be true (Bug #9)");
}
#[test]
fn capabilities_single_file_ingest_matches_cli_surface() {
// Bug #9: kebab ingest-file <path> + kebab ingest-stdin --title <T> 양쪽 모두
// ingest_report.v1 정상 emit → capabilities.single_file_ingest 가 true 여야 함.
let caps = capabilities_snapshot();
assert!(
caps.single_file_ingest,
"single_file_ingest must be true (Bug #9)"
);
}
}

View File

@@ -10,11 +10,7 @@ use kebab_core::SearchHit;
///
/// p9-fb-32: mirrored in `kebab_rag::pipeline::compute_stale` (dep-boundary
/// rule prevents `kebab-rag → kebab-app`). Update both together.
pub fn compute_stale(
indexed_at: OffsetDateTime,
now: OffsetDateTime,
threshold_days: u32,
) -> bool {
pub fn compute_stale(indexed_at: OffsetDateTime, now: OffsetDateTime, threshold_days: u32) -> bool {
if threshold_days == 0 {
return false;
}
@@ -23,11 +19,7 @@ pub fn compute_stale(
}
/// Sets `stale` on each hit in place using `compute_stale`.
pub fn mark_stale_in_place(
hits: &mut [SearchHit],
now: OffsetDateTime,
threshold_days: u32,
) {
pub fn mark_stale_in_place(hits: &mut [SearchHit], now: OffsetDateTime, threshold_days: u32) {
for h in hits {
h.stale = compute_stale(h.indexed_at, now, threshold_days);
}

View File

@@ -29,9 +29,8 @@ fn rust_file_ingests_and_searches_as_code_citation() {
)
.unwrap();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no errors expected: {report:?}");
let items = report.items.as_ref().expect("items present");
@@ -127,9 +126,8 @@ fn rust_code_search_hit_has_repo() {
)
.unwrap();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("mul"))
@@ -147,8 +145,7 @@ fn rust_code_search_hit_has_repo() {
.and_then(|n| n.to_str())
.map(str::to_owned);
assert_eq!(
h.repo,
expected_repo,
h.repo, expected_repo,
"SearchHit.repo must match the workspace dir name (detect_repo result)"
);
// Also sanity-check code_lang is still filled.
@@ -177,9 +174,8 @@ fn python_file_ingests_and_searches_as_code_citation() {
)
.unwrap();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert!(report.new >= 1, "python file ingested: {report:?}");
@@ -254,9 +250,8 @@ fn typescript_file_ingests_and_searches_as_code_citation() {
)
.unwrap();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert!(report.new >= 1, "ts file ingested: {report:?}");
@@ -331,9 +326,8 @@ fn javascript_file_ingests_and_searches_as_code_citation() {
)
.unwrap();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert!(report.new >= 1, "js file ingested: {report:?}");
@@ -515,7 +509,11 @@ fn java_file_ingests_and_searches_as_code_citation() {
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("java"), "citation.lang must be 'java'");
assert_eq!(
lang.as_deref(),
Some("java"),
"citation.lang must be 'java'"
);
assert_eq!(
symbol.as_deref(),
Some("com.foo.Foo.bar"),
@@ -586,7 +584,11 @@ fn kotlin_file_ingests_and_searches_as_code_citation() {
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("kotlin"), "citation.lang must be 'kotlin'");
assert_eq!(
lang.as_deref(),
Some("kotlin"),
"citation.lang must be 'kotlin'"
);
assert_eq!(
symbol.as_deref(),
Some("com.foo.Foo.bar"),
@@ -651,8 +653,8 @@ fn tier2_k8s_yaml_ingest_searchable() {
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), query).expect("search must succeed");
let h = hits
.iter()
@@ -666,7 +668,11 @@ fn tier2_k8s_yaml_ingest_searchable() {
line_start,
..
} => {
assert_eq!(lang.as_deref(), Some("yaml"), "citation.lang must be 'yaml'");
assert_eq!(
lang.as_deref(),
Some("yaml"),
"citation.lang must be 'yaml'"
);
assert_eq!(
symbol.as_deref(),
Some("Deployment/prod/api"),
@@ -730,8 +736,8 @@ fn tier2_dockerfile_ingest_searchable() {
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), query).expect("search must succeed");
let h = hits
.iter()
@@ -813,8 +819,8 @@ fn tier2_cargo_toml_ingest_searchable() {
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), query).expect("search must succeed");
let h = hits
.iter()
@@ -896,8 +902,8 @@ fn tier3_shell_ingest_searchable() {
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), query).expect("search must succeed");
let h = hits
.iter()
@@ -987,8 +993,8 @@ fn tier3_yaml_fallback_picks_up_non_k8s_yaml() {
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), query).expect("search must succeed");
let h = hits
.iter()
@@ -1031,14 +1037,9 @@ fn tier3_yaml_fallback_picks_up_non_k8s_yaml() {
fn rust_file_re_ingest_is_unchanged() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("stable.rs"),
"pub fn noop() {}\n",
)
.unwrap();
std::fs::write(env.workspace_root.join("stable.rs"), "pub fn noop() {}\n").unwrap();
let r1 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let r1 = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let item1 = r1
.items
.as_ref()
@@ -1049,8 +1050,7 @@ fn rust_file_re_ingest_is_unchanged() {
.unwrap();
assert_eq!(item1.kind, IngestItemKind::New);
let r2 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let r2 = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let item2 = r2
.items
.unwrap()
@@ -1081,9 +1081,8 @@ fn tier3_yaml_fallback_reingest_is_unchanged() {
)
.unwrap();
let report1 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("first ingest");
let report1 = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("first ingest");
let item1 = report1
.items
.as_ref()
@@ -1093,7 +1092,8 @@ fn tier3_yaml_fallback_reingest_is_unchanged() {
.expect("docker-compose.yml in first report");
assert!(
matches!(item1.kind, IngestItemKind::New),
"first ingest must be New, got {:?}", item1.kind
"first ingest must be New, got {:?}",
item1.kind
);
assert_eq!(
item1.chunker_version.as_ref().map(|c| c.0.as_str()),
@@ -1101,9 +1101,8 @@ fn tier3_yaml_fallback_reingest_is_unchanged() {
"first ingest must use Tier 3 fallback chunker"
);
let report2 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest");
let report2 = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest");
let item2 = report2
.items
.as_ref()
@@ -1113,7 +1112,8 @@ fn tier3_yaml_fallback_reingest_is_unchanged() {
.expect("docker-compose.yml in second report");
assert!(
matches!(item2.kind, IngestItemKind::Unchanged),
"second ingest must be Unchanged, got {:?}", item2.kind
"second ingest must be Unchanged, got {:?}",
item2.kind
);
}
@@ -1163,8 +1163,8 @@ fn tier1_c_ingest_searchable() {
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), query).expect("search must succeed");
let h = hits
.iter()
@@ -1247,8 +1247,8 @@ fn tier1_cpp_ingest_searchable() {
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), query).expect("search must succeed");
let h = hits
.iter()
@@ -1266,7 +1266,9 @@ fn tier1_cpp_ingest_searchable() {
// Symbol could be "kebab::chunk::Foo" (class) or "kebab::chunk::Foo::bar"
// (method) depending on which chunk ranks first.
assert!(
symbol.as_deref().is_some_and(|s| s.starts_with("kebab::chunk::Foo")),
symbol
.as_deref()
.is_some_and(|s| s.starts_with("kebab::chunk::Foo")),
"C++ symbol must start with namespace::Class prefix, got {symbol:?}"
);
assert!(*line_start >= 1, "line_start must be >=1");
@@ -1335,8 +1337,8 @@ fn tier2_k8s_multi_resource_yaml_ingests_without_collision() {
..Default::default()
},
};
let hits = kebab_app::search_with_config(env.config.clone(), query)
.expect("search must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), query).expect("search must succeed");
assert!(
hits.len() >= 2,
"expected ≥2 hits (Deployment + Service), got {}",
@@ -1359,9 +1361,8 @@ fn tier3_shell_reingest_is_unchanged() {
)
.unwrap();
let report1 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("first ingest");
let report1 = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("first ingest");
let item1 = report1
.items
.as_ref()
@@ -1371,12 +1372,12 @@ fn tier3_shell_reingest_is_unchanged() {
.expect("deploy.sh in first report");
assert!(
matches!(item1.kind, IngestItemKind::New),
"first ingest must be New, got {:?}", item1.kind
"first ingest must be New, got {:?}",
item1.kind
);
let report2 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest");
let report2 = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest");
let item2 = report2
.items
.as_ref()
@@ -1386,6 +1387,7 @@ fn tier3_shell_reingest_is_unchanged() {
.expect("deploy.sh in second report");
assert!(
matches!(item2.kind, IngestItemKind::Unchanged),
"shell reingest must be Unchanged, got {:?}", item2.kind
"shell reingest must be Unchanged, got {:?}",
item2.kind
);
}

View File

@@ -0,0 +1,60 @@
use std::sync::Mutex;
use anyhow::Result;
use kebab_core::{Lang, OcrText};
use kebab_parse_image::OcrEngine;
pub struct MockOcrEngine {
expected_texts: Vec<String>,
call_index: Mutex<usize>,
fail: bool,
}
impl MockOcrEngine {
/// Single text (backward-compat ctor for pdf_ocr_apply.rs 10 sites).
pub fn single(text: impl Into<String>, fail: bool) -> Self {
Self {
expected_texts: vec![text.into()],
call_index: Mutex::new(0),
fail,
}
}
/// Per-page texts (cursor advances per recognize call).
pub fn per_page(texts: Vec<String>, fail: bool) -> Self {
Self {
expected_texts: texts,
call_index: Mutex::new(0),
fail,
}
}
}
impl OcrEngine for MockOcrEngine {
fn engine_name(&self) -> &'static str {
"mock-ocr"
}
fn engine_version(&self) -> String {
"mock-v1".to_string()
}
fn recognize(&self, _img: &[u8], _hint: Option<&Lang>) -> Result<OcrText> {
if self.fail {
anyhow::bail!("mock failure");
}
let mut idx = self.call_index.lock().unwrap();
let text = self
.expected_texts
.get(*idx)
.cloned()
.unwrap_or_else(|| self.expected_texts.last().cloned().unwrap_or_default());
*idx += 1;
Ok(OcrText {
joined: text,
regions: vec![],
engine: "mock-ocr".to_string(),
engine_version: "mock-v1".to_string(),
})
}
}

View File

@@ -93,8 +93,7 @@ impl TestEnv {
/// directly. Caller can invoke this multiple times to simulate
/// re-opening the binary after a corpus revision bump.
pub fn app(&self) -> kebab_app::App {
kebab_app::App::open_with_config(self.config.clone())
.expect("App::open_with_config")
kebab_app::App::open_with_config(self.config.clone()).expect("App::open_with_config")
}
}
@@ -169,3 +168,5 @@ fn copy_dir_recursive(src: &Path, dest: &Path) {
}
}
}
pub mod mock_ocr;

View File

@@ -0,0 +1,85 @@
use std::fs;
#[test]
fn migrate_writes_backup_and_atomic_with_dry_run_noop() {
let dir = tempfile::tempdir().unwrap();
let cfg = dir.path().join("config.toml");
fs::write(
&cfg,
"schema_version = 1\n\n[workspace]\nroot = \"/n\"\ninclude = [\"*.md\"]\n",
)
.unwrap();
// dry-run: 파일·백업 미변경.
let report = kebab_app::config_migrate_with_config_path(Some(&cfg), true).unwrap();
assert!(report.changed);
assert!(report.dry_run);
assert!(report.backup_path.is_none());
assert!(!dir.path().join("config.toml.bak").exists());
assert!(
fs::read_to_string(&cfg).unwrap().contains("include"),
"dry-run modified file"
);
// 실제 적용: 백업 생성 + 파일 갱신.
let report = kebab_app::config_migrate_with_config_path(Some(&cfg), false).unwrap();
assert!(report.changed);
assert!(!report.dry_run);
assert!(report.backup_path.is_some());
assert!(dir.path().join("config.toml.bak").exists());
let new = fs::read_to_string(&cfg).unwrap();
assert!(!new.contains("include"));
assert!(new.contains("[ingest.code]"));
// 멱등: 재실행 changed=false.
let report = kebab_app::config_migrate_with_config_path(Some(&cfg), false).unwrap();
assert!(!report.changed);
}
#[test]
fn migrate_missing_file_errors() {
let dir = tempfile::tempdir().unwrap();
let cfg = dir.path().join("nope.toml");
assert!(kebab_app::config_migrate_with_config_path(Some(&cfg), false).is_err());
}
#[test]
fn annotated_default_serialization_contains_section_comments() {
let doc = kebab_config::migrate::annotated_default_document();
let text = doc.to_string();
assert!(
text.contains("code ingest skip 정책"),
"section comment missing:\n{text}"
);
assert!(text.contains("[ingest.code]"));
}
#[test]
fn doctor_flags_outdated_config() {
let dir = tempfile::tempdir().unwrap();
let cfg = dir.path().join("config.toml");
fs::write(
&cfg,
"schema_version = 1\n\n[workspace]\nroot = \"/n\"\ninclude=[\"*.md\"]\n",
)
.unwrap();
let report = kebab_app::doctor_with_config_path(Some(&cfg)).unwrap();
let check = report
.checks
.iter()
.find(|c| c.name == "config_migration")
.unwrap();
assert!(!check.ok, "outdated config should fail check");
assert!(check.hint.as_deref().unwrap().contains("config migrate"));
assert!(!report.ok, "overall doctor should be false");
// migrate 후엔 통과.
kebab_app::config_migrate_with_config_path(Some(&cfg), false).unwrap();
let report = kebab_app::doctor_with_config_path(Some(&cfg)).unwrap();
let check = report
.checks
.iter()
.find(|c| c.name == "config_migration")
.unwrap();
assert!(check.ok, "after migrate should pass");
}

View File

@@ -12,7 +12,11 @@ fn open(env: &common::TestEnv) -> App {
#[test]
fn fetch_chunk_returns_target_only_when_no_context() {
let env = common::TestEnv::new();
common::ingest_md(&env, "a.md", "# Title\n\nFirst paragraph.\n\n## Section\n\nSecond.\n");
common::ingest_md(
&env,
"a.md",
"# Title\n\nFirst paragraph.\n\n## Section\n\nSecond.\n",
);
let app = open(&env);
// Find a chunk via search to obtain its id.
@@ -42,7 +46,8 @@ fn fetch_chunk_with_context_returns_neighbors() {
// match. The earlier fixture used 2-char tokens like `A1`/`A3` for
// section bodies — those zero-hit under trigram. Use 5-char unique
// words per section so the query can pin one chunk deterministically.
let body = "# H1\n\napples\n\n# H2\n\nbanana\n\n# H3\n\ncherry\n\n# H4\n\ndurian\n\n# H5\n\nelder\n";
let body =
"# H1\n\napples\n\n# H2\n\nbanana\n\n# H3\n\ncherry\n\n# H4\n\ndurian\n\n# H5\n\nelder\n";
common::ingest_md(&env, "multi.md", body);
let app = env.app();
@@ -110,7 +115,10 @@ fn fetch_doc_returns_serialized_markdown() {
.unwrap();
assert_eq!(result.kind, FetchKind::Doc);
let text = result.text.expect("doc text");
assert!(text.contains("Heading One"), "doc text contains heading: {text:?}");
assert!(
text.contains("Heading One"),
"doc text contains heading: {text:?}"
);
assert!(text.contains("First paragraph"), "doc text contains body");
assert!(!result.truncated);
}
@@ -155,7 +163,11 @@ fn fetch_doc_with_max_tokens_truncates() {
.unwrap();
assert!(result.truncated);
let text = result.text.expect("doc text");
assert!(text.chars().count() <= 100, "trimmed text len {}", text.chars().count());
assert!(
text.chars().count() <= 100,
"trimmed text len {}",
text.chars().count()
);
}
#[test]
@@ -292,8 +304,7 @@ fn fetch_span_line_start_beyond_total_returns_empty_text() {
fn fetch_chunk_context_at_first_chunk_clamps_lower_bound() {
let env = common::TestEnv::new();
// Multi-chunk markdown so context ±N has neighbors.
let body =
"# H1\n\nFirst chunk text body.\n\n# H2\n\nSecond chunk.\n\n# H3\n\nThird chunk.\n";
let body = "# H1\n\nFirst chunk text body.\n\n# H2\n\nSecond chunk.\n\n# H3\n\nThird chunk.\n";
common::ingest_md(&env, "boundary.md", body);
let app = env.app();
let q = kebab_core::SearchQuery {

View File

@@ -16,8 +16,8 @@
mod common;
use common::TestEnv;
use kebab_app::ingest_with_config_opts;
use kebab_app::IngestOpts;
use kebab_app::ingest_with_config_opts;
use kebab_core::{DocFilter, DocumentStore, SearchMode, SearchQuery, SourceScope};
/// Helper: open the store via `TestEnv` and run `list_documents`.
@@ -125,17 +125,10 @@ fn include_scope_narrowing_does_not_purge() {
include: vec!["**/*.rs".to_string()],
exclude: env.config.workspace.exclude.clone(),
};
let first = ingest_with_config_opts(
env.config.clone(),
wide_scope,
false,
IngestOpts::default(),
)
.expect("first ingest (wide) must succeed");
assert!(
first.new >= 2,
"expected at least 2 new docs: {first:?}"
);
let first =
ingest_with_config_opts(env.config.clone(), wide_scope, false, IngestOpts::default())
.expect("first ingest (wide) must succeed");
assert!(first.new >= 2, "expected at least 2 new docs: {first:?}");
assert_eq!(
first.purged_deleted_files, 0,
"no purges on first ingest: {first:?}"

View File

@@ -24,8 +24,7 @@ use wiremock::{Mock, MockServer, ResponseTemplate};
/// inspectable in stored DB rows.
fn write_red_png(root: &Path, name: &str) -> std::path::PathBuf {
use image::{ImageBuffer, Rgb};
let img: ImageBuffer<Rgb<u8>, _> =
ImageBuffer::from_fn(100, 50, |_, _| Rgb([255, 0, 0]));
let img: ImageBuffer<Rgb<u8>, _> = ImageBuffer::from_fn(100, 50, |_, _| Rgb([255, 0, 0]));
let path = root.join(name);
img.save(&path).expect("write PNG fixture");
path
@@ -80,7 +79,12 @@ async fn ingest_image_with_ocr_produces_chunk_containing_ocr_text() {
// Counters: scanned should include the PNG; new ≥ 1 (markdown
// fixtures from the workspace tree may also count).
assert!(report.scanned >= 1, "scanned={}, items={:?}", report.scanned, report.items);
assert!(
report.scanned >= 1,
"scanned={}, items={:?}",
report.scanned,
report.items
);
assert_eq!(report.errors, 0, "no errors on lenient OCR path");
// Locate the image doc in the report items.
@@ -94,7 +98,11 @@ async fn ingest_image_with_ocr_produces_chunk_containing_ocr_text() {
kebab_core::IngestItemKind::New,
"image asset must be classified New on first ingest"
);
assert_eq!(img_item.chunk_count, Some(1), "image emits exactly one chunk");
assert_eq!(
img_item.chunk_count,
Some(1),
"image emits exactly one chunk"
);
// Inspect the stored chunk text via kb-app's inspect_chunk facade.
let doc_id = img_item.doc_id.clone().expect("image doc id");
@@ -117,10 +125,12 @@ async fn ingest_image_with_ocr_produces_chunk_containing_ocr_text() {
// Sanity: the doc was actually persisted into SQLite (kb-app's
// list_docs facade reads the same store the chunker writes to).
let summaries = kebab_app::list_docs_with_config(cfg, kebab_core::DocFilter::default())
.expect("list_docs");
let summaries =
kebab_app::list_docs_with_config(cfg, kebab_core::DocFilter::default()).expect("list_docs");
assert!(
summaries.iter().any(|s| s.doc_path.0.ends_with("diagram.png")),
summaries
.iter()
.any(|s| s.doc_path.0.ends_with("diagram.png")),
"image doc must appear in list_docs"
);
@@ -171,8 +181,7 @@ async fn ingest_image_with_ocr_and_caption_populates_both_fields() {
.iter()
.find(|i| i.doc_path.0.ends_with("diagram.png"))
.unwrap();
let doc = kebab_app::inspect_doc_with_config(cfg, img_item.doc_id.as_ref().unwrap())
.unwrap();
let doc = kebab_app::inspect_doc_with_config(cfg, img_item.doc_id.as_ref().unwrap()).unwrap();
let block = match &doc.blocks[0] {
kebab_core::Block::ImageRef(b) => b,
_ => unreachable!(),
@@ -267,8 +276,7 @@ async fn image_indexed_with_filename_when_ocr_and_caption_disabled() {
let cfg_clone = cfg.clone();
let scope = env.scope();
let report = spawn_blocking(move || {
kebab_app::ingest_with_config(cfg_clone, scope, false)
.expect("ingest with no OCR/caption")
kebab_app::ingest_with_config(cfg_clone, scope, false).expect("ingest with no OCR/caption")
})
.await
.expect("task");
@@ -282,8 +290,7 @@ async fn image_indexed_with_filename_when_ocr_and_caption_disabled() {
.find(|i| i.doc_path.0.ends_with("raw.png"))
.unwrap();
assert_eq!(img_item.chunk_count, Some(1), "image emits one chunk");
let doc = kebab_app::inspect_doc_with_config(cfg, img_item.doc_id.as_ref().unwrap())
.unwrap();
let doc = kebab_app::inspect_doc_with_config(cfg, img_item.doc_id.as_ref().unwrap()).unwrap();
let block = match &doc.blocks[0] {
kebab_core::Block::ImageRef(b) => b,
_ => unreachable!(),
@@ -392,16 +399,12 @@ async fn re_ingest_image_produces_unchanged_with_same_doc_id() {
let scope1 = scope.clone();
let scope2 = scope.clone();
let r1 = spawn_blocking(move || {
kebab_app::ingest_with_config(cfg1, scope1, false).unwrap()
})
.await
.unwrap();
let r2 = spawn_blocking(move || {
kebab_app::ingest_with_config(cfg2, scope2, false).unwrap()
})
.await
.unwrap();
let r1 = spawn_blocking(move || kebab_app::ingest_with_config(cfg1, scope1, false).unwrap())
.await
.unwrap();
let r2 = spawn_blocking(move || kebab_app::ingest_with_config(cfg2, scope2, false).unwrap())
.await
.unwrap();
let id1 = r1
.items

View File

@@ -21,11 +21,16 @@ fn second_ingest_of_unchanged_corpus_marks_all_unchanged() {
// First ingest — populates the DB. Use the legacy entry so the
// assertions cover the "previously ingested" set without needing
// IngestOpts::default() to behave identically.
let first =
ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let first = ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
assert_eq!(first.errors, 0, "first ingest must not error: {first:?}");
assert!(first.new >= 1, "first ingest must create new docs: {first:?}");
assert_eq!(first.unchanged, 0, "first ingest cannot have unchanged: {first:?}");
assert!(
first.new >= 1,
"first ingest must create new docs: {first:?}"
);
assert_eq!(
first.unchanged, 0,
"first ingest cannot have unchanged: {first:?}"
);
let scanned = first.scanned;
@@ -38,9 +43,15 @@ fn second_ingest_of_unchanged_corpus_marks_all_unchanged() {
IngestOpts::default(),
)
.unwrap();
assert_eq!(second.scanned, scanned, "second scanned matches first: {second:?}");
assert_eq!(
second.scanned, scanned,
"second scanned matches first: {second:?}"
);
assert_eq!(second.new, 0, "no new docs on re-ingest: {second:?}");
assert_eq!(second.updated, 0, "nothing should be marked updated: {second:?}");
assert_eq!(
second.updated, 0,
"nothing should be marked updated: {second:?}"
);
assert_eq!(
second.unchanged, scanned,
"every doc must be Unchanged: {second:?}"
@@ -52,10 +63,12 @@ fn second_ingest_of_unchanged_corpus_marks_all_unchanged() {
fn force_reingest_bypasses_skip() {
let env = TestEnv::lexical_only();
let first =
ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let first = ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
assert_eq!(first.errors, 0, "first ingest must not error: {first:?}");
assert!(first.new >= 1, "first ingest must create new docs: {first:?}");
assert!(
first.new >= 1,
"first ingest must create new docs: {first:?}"
);
let scanned = first.scanned;
let second = ingest_with_config_opts(

View File

@@ -107,13 +107,9 @@ fn cancel_none_is_uncancellable_default() {
// ingest_with_config_progress (no cancel) runs to completion.
let env = TestEnv::lexical_only();
let (tx, rx) = mpsc::channel::<IngestEvent>();
let report = kebab_app::ingest_with_config_progress(
env.config.clone(),
env.scope(),
true,
Some(tx),
)
.unwrap();
let report =
kebab_app::ingest_with_config_progress(env.config.clone(), env.scope(), true, Some(tx))
.unwrap();
assert_eq!(report.scanned, 3);
assert_eq!(report.new, 3);

View File

@@ -107,5 +107,8 @@ fn ingest_file_errors_on_unsupported_extension() {
let err = kebab_app::ingest_file_with_config(cfg, &docx).unwrap_err();
assert!(err.to_string().contains("unsupported extension"), "{err}");
assert!(err.to_string().contains(".docx") || err.to_string().contains("docx"), "{err}");
assert!(
err.to_string().contains(".docx") || err.to_string().contains("docx"),
"{err}"
);
}

View File

@@ -8,8 +8,7 @@ use common::TestEnv;
#[test]
fn ingest_then_list_inspects_round_trip() {
let env = TestEnv::lexical_only();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
// The fixture has 3 markdown files; first ingest should label them
// all as New.
@@ -27,17 +26,14 @@ fn ingest_then_list_inspects_round_trip() {
}
// list_docs returns the 3 docs.
let docs = kebab_app::list_docs_with_config(
env.config.clone(),
kebab_core::DocFilter::default(),
)
.unwrap();
let docs =
kebab_app::list_docs_with_config(env.config.clone(), kebab_core::DocFilter::default())
.unwrap();
assert_eq!(docs.len(), 3, "docs: {docs:?}");
// inspect_doc round-trips one of them.
let any_doc_id = docs[0].doc_id.clone();
let canonical = kebab_app::inspect_doc_with_config(env.config.clone(), &any_doc_id)
.unwrap();
let canonical = kebab_app::inspect_doc_with_config(env.config.clone(), &any_doc_id).unwrap();
assert_eq!(canonical.doc_id, any_doc_id);
assert!(!canonical.blocks.is_empty(), "blocks empty");
}
@@ -46,12 +42,10 @@ fn ingest_then_list_inspects_round_trip() {
fn ingest_idempotent_on_second_run() {
let env = TestEnv::lexical_only();
let r1 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let r1 = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
assert_eq!(r1.new, 3);
let r2 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let r2 = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
// Same files re-ingested — p9-fb-23 task 7 introduced the early-skip
// path: when checksum + parser/chunker/embedding versions all match,
// the second run reports `Unchanged` rather than `Updated`. Pre-p9-fb-23
@@ -63,19 +57,16 @@ fn ingest_idempotent_on_second_run() {
assert_eq!(r2.unchanged, 3, "second run unchanged: {r2:?}");
// list_docs still has 3 docs (no duplicates).
let docs = kebab_app::list_docs_with_config(
env.config.clone(),
kebab_core::DocFilter::default(),
)
.unwrap();
let docs =
kebab_app::list_docs_with_config(env.config.clone(), kebab_core::DocFilter::default())
.unwrap();
assert_eq!(docs.len(), 3);
}
#[test]
fn ingest_summary_only_drops_items() {
let env = TestEnv::lexical_only();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert_eq!(report.scanned, 3);
assert!(report.items.is_none(), "summary-only should null items");
}
@@ -87,12 +78,10 @@ fn ingest_records_ingest_runs_row_with_aggregate_counts() {
// of every run. `summary_only=true` writes `items_json=NULL`; the
// counts MUST still be present.
let env = TestEnv::lexical_only();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert_eq!(report.scanned, 3);
let db_path = std::path::PathBuf::from(&env.config.storage.data_dir)
.join("kebab.sqlite");
let db_path = std::path::PathBuf::from(&env.config.storage.data_dir).join("kebab.sqlite");
let conn = rusqlite::Connection::open(&db_path).expect("open kebab.sqlite");
let (scanned, new_c, updated, skipped, errors, items_json): (
i64,
@@ -141,25 +130,18 @@ fn ingest_provider_none_skips_lance() {
// tree shape (no `<data_dir>/lancedb` directory, or no `*.lance`
// tables under it).
let env = TestEnv::lexical_only();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
assert_eq!(report.errors, 0, "lexical-only run must not error");
assert_eq!(report.new, 3);
let lance_dir = std::path::PathBuf::from(&env.config.storage.data_dir)
.join("lancedb");
let lance_dir = std::path::PathBuf::from(&env.config.storage.data_dir).join("lancedb");
if lance_dir.exists() {
// If the dir was created (e.g., by an earlier consumer touching
// the path), it MUST contain no `.lance` tables.
let mut had_lance_table = false;
for entry in std::fs::read_dir(&lance_dir).expect("read lance_dir") {
let entry = entry.unwrap();
if entry
.path()
.extension()
.and_then(|s| s.to_str())
== Some("lance")
{
if entry.path().extension().and_then(|s| s.to_str()) == Some("lance") {
had_lance_table = true;
break;
}
@@ -189,8 +171,7 @@ fn list_docs_filters_by_tags_any() {
tags_any: vec!["rust".to_string()],
..Default::default()
};
let rust_docs =
kebab_app::list_docs_with_config(env.config.clone(), rust_filter).unwrap();
let rust_docs = kebab_app::list_docs_with_config(env.config.clone(), rust_filter).unwrap();
// intro.md and notes/cargo.md both tag "rust".
assert_eq!(rust_docs.len(), 2, "expected 2 rust docs: {rust_docs:?}");
}
@@ -198,8 +179,9 @@ fn list_docs_filters_by_tags_any() {
#[test]
fn inspect_doc_not_found_returns_actionable_error() {
let env = TestEnv::lexical_only();
let bogus =
kebab_core::DocumentId("0000000000000000000000000000000000000000000000000000000000000000".to_string());
let bogus = kebab_core::DocumentId(
"0000000000000000000000000000000000000000000000000000000000000000".to_string(),
);
let err = kebab_app::inspect_doc_with_config(env.config.clone(), &bogus).unwrap_err();
let msg = format!("{err:#}");
assert!(
@@ -218,8 +200,7 @@ fn inspect_chunk_not_found_returns_actionable_error() {
let bogus = kebab_core::ChunkId(
"0000000000000000000000000000000000000000000000000000000000000000".to_string(),
);
let err = kebab_app::inspect_chunk_with_config(env.config.clone(), &bogus)
.unwrap_err();
let err = kebab_app::inspect_chunk_with_config(env.config.clone(), &bogus).unwrap_err();
let msg = format!("{err:#}");
assert!(msg.contains("not found"), "got: {msg}");
}
@@ -251,22 +232,18 @@ fn ingest_with_config_opts_default_matches_legacy_behaviour() {
#[test]
fn ingest_stamps_chunker_version_on_document() {
let env = TestEnv::lexical_only();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
assert!(report.new >= 1, "expected at least one new doc: {report:?}");
assert_eq!(report.errors, 0, "no errors expected: {report:?}");
let docs = kebab_app::list_docs_with_config(
env.config.clone(),
kebab_core::DocFilter::default(),
)
.unwrap();
let docs =
kebab_app::list_docs_with_config(env.config.clone(), kebab_core::DocFilter::default())
.unwrap();
assert!(!docs.is_empty(), "no docs after ingest");
for doc_entry in &docs {
let canonical =
kebab_app::inspect_doc_with_config(env.config.clone(), &doc_entry.doc_id)
.unwrap();
kebab_app::inspect_doc_with_config(env.config.clone(), &doc_entry.doc_id).unwrap();
assert!(
canonical.last_chunker_version.is_some(),
"last_chunker_version must be stamped for doc {}: got {:?}",

View File

@@ -0,0 +1,171 @@
// crates/kebab-app/tests/ingest_log_smoke.rs
//
// Integration tests for ingest_log feature (v0.20.x). Spec §5 AC-9 + AC-6.
use std::path::PathBuf;
use kebab_app::{IngestOpts, ingest_with_config_opts};
use kebab_config::{Config, LoggingCfg};
use kebab_core::SourceScope;
use serde_json::Value;
use tempfile::TempDir;
fn minimal_config(workspace: &std::path::Path, log_dir: &std::path::Path) -> Config {
let data_dir = workspace.parent().unwrap().join("data");
std::fs::create_dir_all(&data_dir).unwrap();
let model_dir = workspace.parent().unwrap().join("models");
std::fs::create_dir_all(&model_dir).unwrap();
let mut cfg = Config::defaults();
cfg.workspace.root = workspace.to_string_lossy().into_owned();
cfg.workspace.exclude.clear();
cfg.storage.data_dir = data_dir.to_string_lossy().into_owned();
cfg.storage.model_dir = model_dir.to_string_lossy().into_owned();
cfg.models.embedding.provider = "none".to_string();
cfg.models.embedding.dimensions = 0;
cfg.chunking.target_tokens = 80;
cfg.chunking.overlap_tokens = 20;
cfg.logging = LoggingCfg {
ingest_log_enabled: true,
ingest_log_dir: log_dir.to_path_buf(),
..Default::default()
};
cfg
}
/// AC-9: ingest → log file exists + each line valid JSON + last line kind=summary + scanned>0.
#[test]
fn ingest_log_smoke() {
let tmp = TempDir::new().unwrap();
let workspace = tmp.path().join("kb");
std::fs::create_dir_all(&workspace).unwrap();
let log_dir = tmp.path().join("logs");
// 1. Minimal corpus: 1 markdown + 1 scanned PDF (OCR disabled — no Ollama needed).
std::fs::write(
workspace.join("hello.md"),
"# Hello\n\nThis is a smoke test.\n",
)
.unwrap();
let pdf_src = PathBuf::from("../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf");
if pdf_src.exists() {
std::fs::copy(&pdf_src, workspace.join("scanned.pdf")).unwrap();
}
// 2. Config with logging enabled.
let cfg = minimal_config(&workspace, &log_dir);
let scope = SourceScope {
root: workspace.clone(),
exclude: vec![],
..Default::default()
};
// 3. Run ingest.
ingest_with_config_opts(cfg, scope, false, IngestOpts::default())
.expect("ingest should succeed");
// 4. Assert log file exists in log_dir.
let log_files: Vec<_> = std::fs::read_dir(&log_dir)
.unwrap()
.filter_map(Result::ok)
.filter(|e| {
e.file_name().to_string_lossy().starts_with("ingest-")
&& e.file_name().to_string_lossy().ends_with(".ndjson")
})
.collect();
assert_eq!(
log_files.len(),
1,
"expected exactly 1 ingest-*.ndjson file, found: {log_files:?}"
);
// 5. Parse each line as JSON — assert kind field present and valid.
let body = std::fs::read_to_string(log_files[0].path()).unwrap();
let lines: Vec<&str> = body.lines().collect();
assert!(!lines.is_empty(), "log file should not be empty");
let valid_kinds = ["ocr", "parse_error", "skip", "error", "summary"];
for line in &lines {
let v: Value = serde_json::from_str(line)
.unwrap_or_else(|e| panic!("line is not valid JSON: {e}\nline: {line}"));
let kind = v
.get("kind")
.and_then(|k| k.as_str())
.unwrap_or_else(|| panic!("line missing 'kind' field: {line}"));
assert!(
valid_kinds.contains(&kind),
"unexpected kind '{kind}' in line: {line}"
);
}
// 6. Last line must be kind=summary with scanned > 0.
let last = lines.last().unwrap();
let last_v: Value = serde_json::from_str(last).unwrap();
assert_eq!(
last_v.get("kind").and_then(|k| k.as_str()),
Some("summary"),
"last line must be kind=summary, got: {last}"
);
let scanned = last_v.get("scanned").and_then(Value::as_u64).unwrap_or(0);
assert!(scanned > 0, "summary.scanned should be > 0, got: {last}");
}
/// AC-6: ingest_log_enabled=false → no log file created.
#[test]
fn ingest_log_disabled_emits_no_file() {
let tmp = TempDir::new().unwrap();
let workspace = tmp.path().join("kb");
std::fs::create_dir_all(&workspace).unwrap();
let log_dir = tmp.path().join("logs");
std::fs::write(
workspace.join("hello.md"),
"# Hello\n\nDisabled log test.\n",
)
.unwrap();
let data_dir = tmp.path().join("data");
std::fs::create_dir_all(&data_dir).unwrap();
let model_dir = tmp.path().join("models");
std::fs::create_dir_all(&model_dir).unwrap();
let mut cfg = Config::defaults();
cfg.workspace.root = workspace.to_string_lossy().into_owned();
cfg.workspace.exclude.clear();
cfg.storage.data_dir = data_dir.to_string_lossy().into_owned();
cfg.storage.model_dir = model_dir.to_string_lossy().into_owned();
cfg.models.embedding.provider = "none".to_string();
cfg.models.embedding.dimensions = 0;
cfg.logging = LoggingCfg {
ingest_log_enabled: false,
ingest_log_dir: log_dir.clone(),
..Default::default()
};
let scope = SourceScope {
root: workspace.clone(),
exclude: vec![],
..Default::default()
};
ingest_with_config_opts(cfg, scope, false, IngestOpts::default())
.expect("ingest should succeed");
// log_dir should either not exist or contain 0 ingest-*.ndjson files.
let log_file_count = if log_dir.exists() {
std::fs::read_dir(&log_dir)
.unwrap()
.filter_map(Result::ok)
.filter(|e| {
e.file_name().to_string_lossy().starts_with("ingest-")
&& e.file_name().to_string_lossy().ends_with(".ndjson")
})
.count()
} else {
0
};
assert_eq!(
log_file_count, 0,
"no ingest-*.ndjson file should be created when disabled"
);
}

View File

@@ -0,0 +1,117 @@
//! Integration smoke tests for the PDF OCR pipeline (§ Acceptance §9 #1 + #2).
//!
//! Tests 1 and 2 require a live Ollama endpoint — `#[ignore]` by default.
//! Manual invoke:
//! KEBAB_PDF_OCR_ENDPOINT=http://192.168.0.47:11434 \
//! cargo test -p kebab-app --test ingest_pdf_ocr_smoke --ignored -j 4
//!
//! Test 3 (cancel) uses a dummy endpoint + pre-set cancel — runs by default
//! to verify the cancel wiring doesn't panic/deadlock.
mod common;
use std::path::PathBuf;
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use common::TestEnv;
fn ollama_endpoint() -> String {
std::env::var("KEBAB_PDF_OCR_ENDPOINT").unwrap_or_else(|_| "http://localhost:11434".to_string())
}
fn make_ocr_env_real() -> TestEnv {
let mut env = TestEnv::lexical_only();
env.config.pdf.ocr.enabled = true;
env.config.pdf.ocr.endpoint = Some(ollama_endpoint());
env.config.models.embedding.provider = "none".to_string();
let src = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.unwrap()
.join("kebab-parse-pdf/tests/fixtures/scanned_page1.pdf");
let dest = env.workspace_root.join("scanned_page1.pdf");
std::fs::copy(&src, &dest).expect("copy scanned_page1.pdf to workspace");
env
}
/// § Acceptance §9 #1 — real Ollama OCR + IngestItem.pdf_ocr_pages = Some(1).
#[test]
#[ignore = "real Ollama qwen2.5vl:3b dependency"]
fn ingest_with_mock_ocr_yields_pdf_ocr_summary() {
let env = make_ocr_env_real();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).expect("ingest");
assert!(report.new >= 1, "at least one PDF ingested: {report:?}");
let items = report.items.unwrap_or_default();
let pdf_item = items.iter().find(|i| i.doc_path.0.ends_with(".pdf"));
assert!(
pdf_item.is_some(),
"PDF item must appear in ingest report items: {items:?}"
);
let pdf_item = pdf_item.unwrap();
assert!(
pdf_item.pdf_ocr_pages.is_some(),
"pdf_ocr_pages must be set for scanned PDF: {pdf_item:?}"
);
assert_eq!(
pdf_item.pdf_ocr_pages.unwrap(),
1,
"scanned_page1.pdf has exactly 1 page"
);
}
/// § Acceptance §9 #2 — OCR text indexed and retrievable via lexical search.
#[test]
#[ignore = "real Ollama qwen2.5vl:3b dependency"]
fn ocr_text_indexed_and_searchable() {
let env = make_ocr_env_real();
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).expect("ingest");
// Search for a Korean morpheme expected to appear in qwen2.5vl:3b OCR
// output of the PoC ground-truth page. "다음" is a high-frequency token
// in page1.txt truth file.
let query = common::lexical_query("다음");
let hits = kebab_app::search_with_config(env.config.clone(), query).expect("search");
assert!(
!hits.is_empty(),
"OCR-indexed text must surface in lexical search results"
);
}
/// Production cancel wiring smoke — pre-set cancel exits before any OCR call.
/// Dummy endpoint (port 1 = connection-refused) means OCR HTTP calls would
/// fail, but cancel=true prevents the loop from reaching OCR at all.
/// Verifies no panic/deadlock regardless of Ok/Err outcome.
#[test]
fn ingest_with_cancel_aborts_mid_pdf() {
let mut env = TestEnv::lexical_only();
env.config.pdf.ocr.enabled = true;
env.config.pdf.ocr.endpoint = Some("http://127.0.0.1:1".to_string());
let src = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.unwrap()
.join("kebab-parse-pdf/tests/fixtures/scanned_page1.pdf");
let dest = env.workspace_root.join("scanned_page1.pdf");
std::fs::copy(&src, &dest).expect("copy scanned_page1.pdf to workspace");
let cancel = Arc::new(AtomicBool::new(true)); // pre-set — abort immediately
let result = kebab_app::ingest_with_config_cancellable(
env.config.clone(),
env.scope(),
false,
None,
Some(cancel),
);
// Both Ok (pre-cancel exit) and Err (eager OCR engine fail) are acceptable —
// key assertion is no panic/deadlock.
let _ = result;
}

View File

@@ -13,13 +13,9 @@ use kebab_core::IngestItemKind;
fn run_with_progress() -> Vec<IngestEvent> {
let env = TestEnv::lexical_only();
let (tx, rx) = mpsc::channel::<IngestEvent>();
let report = kebab_app::ingest_with_config_progress(
env.config.clone(),
env.scope(),
false,
Some(tx),
)
.unwrap();
let report =
kebab_app::ingest_with_config_progress(env.config.clone(), env.scope(), false, Some(tx))
.unwrap();
assert_eq!(report.scanned, 3);
assert_eq!(report.new, 3);
@@ -73,40 +69,74 @@ fn progress_event_sequence_matches_design_section_2_4a() {
other => panic!("expected Completed last, got {other:?}"),
}
// Middle: 3 AssetStarted/AssetFinished pairs in monotonic idx order.
let asset_events: Vec<&IngestEvent> = events[2..events.len() - 1].iter().collect();
assert_eq!(
asset_events.len(),
6,
"expected 3 (Started + Finished) pairs, got {asset_events:?}"
);
for (chunk_idx, pair) in asset_events.chunks(2).enumerate() {
let expected_idx = chunk_idx as u32 + 1;
match (pair[0], pair[1]) {
(
IngestEvent::AssetStarted {
idx: si,
total: st,
media,
..
},
IngestEvent::AssetFinished {
idx: fi,
total: ft,
result,
chunks,
},
) => {
assert_eq!(*si, expected_idx, "Started idx mismatch: {pair:?}");
assert_eq!(*fi, expected_idx, "Finished idx mismatch: {pair:?}");
assert_eq!(*st, 3, "Started total mismatch");
assert_eq!(*ft, 3, "Finished total mismatch");
assert_eq!(media, "markdown", "fixture is markdown only");
assert_eq!(*result, IngestItemKind::New, "first ingest → New");
assert!(*chunks >= 1, "chunks: {pair:?}");
// Middle (v0.24.0 ordering invariant §2.4a): per asset the stream is
// AssetStarted < AssetChunked < [ExpansionProgress*] < AssetTimings
// < AssetFinished
// Expansion is disabled in the lexical fixture, so no ExpansionProgress
// frames appear here — but AssetChunked + AssetTimings are emitted for
// every markdown asset.
let middle = &events[2..events.len() - 1];
// 3 AssetStarted events, monotonic idx 1..=3, all markdown, total = 3.
let started: Vec<u32> = middle
.iter()
.filter_map(|e| match e {
IngestEvent::AssetStarted {
idx, total, media, ..
} => {
assert_eq!(*total, 3, "Started total mismatch: {e:?}");
assert_eq!(media, "markdown", "fixture is markdown only: {e:?}");
Some(*idx)
}
other => panic!("expected Started+Finished pair, got {other:?}"),
}
_ => None,
})
.collect();
assert_eq!(started, vec![1, 2, 3], "AssetStarted idx order: {middle:?}");
// 3 AssetFinished events, monotonic idx 1..=3, each New with ≥1 chunk.
let finished: Vec<u32> = middle
.iter()
.filter_map(|e| match e {
IngestEvent::AssetFinished {
idx,
total,
result,
chunks,
} => {
assert_eq!(*total, 3, "Finished total mismatch: {e:?}");
assert_eq!(*result, IngestItemKind::New, "first ingest → New: {e:?}");
assert!(*chunks >= 1, "chunks: {e:?}");
Some(*idx)
}
_ => None,
})
.collect();
assert_eq!(finished, vec![1, 2, 3], "AssetFinished idx order: {middle:?}");
// v0.24.0 additive events: exactly one AssetChunked + one AssetTimings
// per asset, each strictly bracketed by that asset's Started / Finished.
for target in 1u32..=3 {
let started_at = middle
.iter()
.position(|e| matches!(e, IngestEvent::AssetStarted { idx, .. } if *idx == target))
.unwrap_or_else(|| panic!("missing AssetStarted for idx {target}: {middle:?}"));
let finished_at = middle
.iter()
.position(|e| matches!(e, IngestEvent::AssetFinished { idx, .. } if *idx == target))
.unwrap_or_else(|| panic!("missing AssetFinished for idx {target}: {middle:?}"));
let chunked_at = middle
.iter()
.position(|e| matches!(e, IngestEvent::AssetChunked { idx, chunks, .. } if *idx == target && *chunks >= 1))
.unwrap_or_else(|| panic!("missing AssetChunked for idx {target}: {middle:?}"));
let timings_at = middle
.iter()
.position(|e| matches!(e, IngestEvent::AssetTimings { idx, .. } if *idx == target))
.unwrap_or_else(|| panic!("missing AssetTimings for idx {target}: {middle:?}"));
assert!(
started_at < chunked_at && chunked_at < timings_at && timings_at < finished_at,
"idx {target} ordering: started={started_at} chunked={chunked_at} \
timings={timings_at} finished={finished_at}: {middle:?}"
);
}
}
@@ -116,13 +146,9 @@ fn ingest_with_config_progress_none_matches_ingest_with_config() {
// `ingest_with_config_progress(..., None)` must produce identical
// reports modulo wall-clock duration.
let env = TestEnv::lexical_only();
let r_none = kebab_app::ingest_with_config_progress(
env.config.clone(),
env.scope(),
true,
None,
)
.unwrap();
let r_none =
kebab_app::ingest_with_config_progress(env.config.clone(), env.scope(), true, None)
.unwrap();
assert_eq!(r_none.scanned, 3);
assert_eq!(r_none.new, 3);
}
@@ -134,12 +160,77 @@ fn dropped_receiver_does_not_panic_or_fail_ingest() {
let env = TestEnv::lexical_only();
let (tx, rx) = mpsc::channel::<IngestEvent>();
drop(rx);
let report = kebab_app::ingest_with_config_progress(
env.config.clone(),
env.scope(),
true,
Some(tx),
)
.unwrap();
let report =
kebab_app::ingest_with_config_progress(env.config.clone(), env.scope(), true, Some(tx))
.unwrap();
assert_eq!(report.scanned, 3);
}
/// v0.20.0 sub-item 1: pdf_ocr_started + pdf_ocr_finished events 가 PDF asset 의
/// OCR-enabled ingest 시 emit 됨을 검증. real Ollama 의존 — `#[ignore]` default.
///
/// Manual invoke:
/// ```
/// KEBAB_PDF_OCR_ENABLED=true \
/// KEBAB_PDF_OCR_ENDPOINT=http://192.168.0.47:11434 \
/// cargo test -p kebab-app --test ingest_progress \
/// --ignored pdf_ocr_progress_emits_started_finished_events
/// ```
#[test]
#[ignore = "real Ollama dependency — manual invoke via KEBAB_PDF_OCR_ENABLED=true"]
fn pdf_ocr_progress_emits_started_finished_events() {
// F1 fixture (DCTDecode JPEG passthrough) 을 tmpdir 의 workspace 로 copy.
let tmpdir = tempfile::tempdir().expect("create tmpdir");
let workspace = tmpdir.path().join("workspace");
std::fs::create_dir_all(&workspace).expect("create workspace dir");
let f1_src = std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf");
let f1 = std::fs::read(&f1_src).expect("F1 fixture present");
std::fs::write(workspace.join("page1.pdf"), &f1).expect("copy F1");
let data_dir = tmpdir.path().join("data");
std::fs::create_dir_all(&data_dir).expect("create data dir");
let mut config = kebab_config::Config::defaults();
config.workspace.root = workspace.to_string_lossy().into_owned();
config.storage.data_dir = data_dir.to_string_lossy().into_owned();
config.models.embedding.provider = "none".to_string();
config.models.embedding.dimensions = 0;
config.pdf.ocr.enabled = true;
if let Ok(endpoint) = std::env::var("KEBAB_PDF_OCR_ENDPOINT") {
config.pdf.ocr.endpoint = Some(endpoint);
}
let scope = kebab_core::SourceScope {
root: workspace.clone(),
..Default::default()
};
let (tx, rx) = mpsc::channel::<IngestEvent>();
let _report = kebab_app::ingest_with_config_progress(config, scope, false, Some(tx))
.expect("ingest_with_config_progress");
let events: Vec<_> = rx.iter().collect();
let started_count = events
.iter()
.filter(|e| matches!(e, IngestEvent::PdfOcrStarted { .. }))
.count();
let finished_count = events
.iter()
.filter(|e| matches!(e, IngestEvent::PdfOcrFinished { .. }))
.count();
assert!(
started_count >= 1,
"PdfOcrStarted 가 ≥ 1 emit 됨 (got {started_count})"
);
assert!(
finished_count >= 1,
"PdfOcrFinished 가 ≥ 1 emit 됨 (got {finished_count})"
);
assert_eq!(
started_count, finished_count,
"Started 와 Finished 의 count 일치"
);
}

View File

@@ -29,12 +29,14 @@ fn ingest_stdin_writes_frontmatter_and_reports_new() {
"## Body content\n\nMore.",
"Article X",
Some("https://example.com/x"),
).unwrap();
)
.unwrap();
assert_eq!(report.new, 1, "{report:?}");
// _external/ contains exactly one .md file with frontmatter.
let ext_dir = std::path::PathBuf::from(&cfg.workspace.root).join("_external");
let entries: Vec<_> = fs::read_dir(&ext_dir).unwrap()
let entries: Vec<_> = fs::read_dir(&ext_dir)
.unwrap()
.filter_map(std::result::Result::ok)
.collect();
assert_eq!(entries.len(), 1);
@@ -50,16 +52,13 @@ fn ingest_stdin_without_source_uri() {
let dir = tempfile::tempdir().unwrap();
let cfg = fresh_cfg(dir.path());
let report = kebab_app::ingest_stdin_with_config(
cfg.clone(),
"## Body",
"Title",
None,
).unwrap();
let report =
kebab_app::ingest_stdin_with_config(cfg.clone(), "## Body", "Title", None).unwrap();
assert_eq!(report.new, 1);
let ext_dir = std::path::PathBuf::from(&cfg.workspace.root).join("_external");
let entries: Vec<_> = fs::read_dir(&ext_dir).unwrap()
let entries: Vec<_> = fs::read_dir(&ext_dir)
.unwrap()
.filter_map(std::result::Result::ok)
.collect();
let content = fs::read_to_string(entries[0].path()).unwrap();

View File

@@ -17,9 +17,8 @@ fn init_workspace_header_lists_supported_extensions() {
}
kebab_app::init_workspace(true).expect("init_workspace");
let cfg_path = kebab_config::Config::xdg_config_path();
let body = std::fs::read_to_string(&cfg_path).unwrap_or_else(|e| {
panic!("read config at {}: {e}", cfg_path.display())
});
let body = std::fs::read_to_string(&cfg_path)
.unwrap_or_else(|e| panic!("read config at {}: {e}", cfg_path.display()));
assert!(
body.contains("처리 가능한 형식"),
"header lists supported types section: body=\n{body}"

View File

@@ -0,0 +1,122 @@
//! Bug #3 regression: multi-scanned PDF ingest must produce globally unique chunk_ids.
//! v0.20.0 sub-item 1 bugfix.
//!
//! Strategy: helper-level chain test (apply_ocr_to_pdf_pages → PdfPageV1Chunker).
//! Facade mock injection is unavailable (kebab-app hardcodes OllamaVisionOcr), so
//! this test covers the full OCR→chunk pipeline with real PDF fixtures + MockOcrEngine,
//! adding value beyond kebab-chunk unit test B5 (which tests PdfPageV1Chunker alone).
mod common;
use std::collections::HashSet;
use std::path::{Path, PathBuf};
use common::mock_ocr::MockOcrEngine;
use kebab_app::pdf_ocr_apply::{PdfOcrOpts, apply_ocr_to_pdf_pages};
use kebab_chunk::PdfPageV1Chunker;
use kebab_core::{
AssetStorage, Checksum, ChunkPolicy, Chunker, ExtractConfig, ExtractContext, Extractor,
MediaType, RawAsset, SourceUri, WorkspacePath, id_for_asset,
};
use kebab_parse_image::OcrEngine;
use kebab_parse_pdf::PdfTextExtractor;
use time::OffsetDateTime;
fn make_pdf_asset(path: &str, hash_char: char, byte_len: u64) -> RawAsset {
let fake_hash: String = hash_char.to_string().repeat(64);
let asset_id = id_for_asset(&fake_hash);
RawAsset {
asset_id,
source_uri: SourceUri::File(PathBuf::from(path)),
workspace_path: WorkspacePath::new(path.to_string()).unwrap(),
media_type: MediaType::Pdf,
byte_len,
checksum: Checksum(fake_hash),
discovered_at: OffsetDateTime::UNIX_EPOCH,
stored: AssetStorage::Copied {
path: PathBuf::from(path),
},
}
}
fn extract_and_ocr(
bytes: &[u8],
path: &str,
hash_char: char,
engine: &dyn OcrEngine,
) -> kebab_core::CanonicalDocument {
let asset = make_pdf_asset(path, hash_char, bytes.len() as u64);
let workspace_root = Path::new("/");
let config = ExtractConfig::default();
let ctx = ExtractContext {
asset: &asset,
workspace_root,
config: &config,
};
let mut canonical = PdfTextExtractor::new().extract(&ctx, bytes).unwrap();
let opts = PdfOcrOpts {
enabled: true,
always_on: false,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: None,
};
apply_ocr_to_pdf_pages(&mut canonical, engine, bytes, &opts, |_| {}).unwrap();
canonical
}
#[test]
fn multi_scanned_pdf_ingest_no_chunk_id_collision() {
let f1_bytes = std::fs::read("../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf")
.expect("F1 fixture missing");
let f2_bytes = std::fs::read("../kebab-parse-pdf/tests/fixtures/scanned_page2.pdf")
.expect("F2 fixture missing");
// Bug #3 trigger shape: 10-char early segment + ". " + 500-char tail.
// byte_len = 10*3 + 2 + 500*3 = 1532 > target_bytes=1500 → multi-chunk.
// overlap_bytes = min(240, 750) = 240 / chars=80 → second chunk's actual_start
// collapses to prev_min=0 without the fix → same #c0 suffix → chunk_id collision.
let trigger_text = format!("{}. {}", "".repeat(10), "".repeat(500));
let f1_engine = MockOcrEngine::single("F1 mock OCR page text", false);
let f2_engine = MockOcrEngine::single(&trigger_text, false);
let f1_canonical = extract_and_ocr(&f1_bytes, "page1.pdf", '1', &f1_engine);
let f2_canonical = extract_and_ocr(&f2_bytes, "page2.pdf", '2', &f2_engine);
let chunk_policy = ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: PdfPageV1Chunker.chunker_version(),
};
let f1_chunks = PdfPageV1Chunker
.chunk(&f1_canonical, &chunk_policy)
.unwrap();
let f2_chunks = PdfPageV1Chunker
.chunk(&f2_canonical, &chunk_policy)
.unwrap();
assert!(
f2_chunks.len() >= 2,
"F2 trigger text must produce ≥2 chunks for the collision to be possible; got {}",
f2_chunks.len()
);
let all_ids: Vec<&str> = f1_chunks
.iter()
.chain(f2_chunks.iter())
.map(|c| c.chunk_id.0.as_str())
.collect();
let total = all_ids.len();
let unique: HashSet<&str> = all_ids.iter().copied().collect();
assert_eq!(
unique.len(),
total,
"all chunk_ids must be globally unique across F1 + F2 ({} unique vs {} total — collision detected)",
unique.len(),
total,
);
}

View File

@@ -0,0 +1,156 @@
//! Integration smoke tests for `kebab inspect ocr-stats / ocr-failures`.
//! AC-4, AC-5, AC-6, AC-11 (ocr_inspect_smoke binary), AC-13.
mod common;
use common::TestEnv;
use kebab_app::App;
use kebab_store_sqlite::SqliteStore;
/// Insert synthetic pdf_ocr_events rows directly so the test runs without
/// a live Ollama endpoint.
fn seed_ocr_events(env: &TestEnv, store: &SqliteStore) {
// Success rows
for i in 0..3u32 {
store
.record_pdf_ocr_event(
"run-aaa",
&format!("2026-05-28T0{i}:00:00Z"),
Some("doc-abc"),
"path/scanned.pdf",
i + 1,
Some(50_000),
Some(200),
Some(150),
100 + u64::from(i) * 20,
42,
true,
None,
"qwen2.5vl",
)
.expect("seed success row");
}
// Failure row
store
.record_pdf_ocr_event(
"run-bbb",
"2026-05-28T10:00:00Z",
Some("doc-abc"),
"path/scanned.pdf",
4,
Some(30_000),
Some(200),
Some(150),
9999,
0,
false,
Some("ocr_error"),
"qwen2.5vl",
)
.expect("seed failure row");
// Row for different doc
store
.record_pdf_ocr_event(
"run-ccc",
"2026-05-28T11:00:00Z",
Some("doc-xyz"),
"path/other.pdf",
1,
None,
None,
None,
200,
10,
true,
None,
"qwen2.5vl",
)
.expect("seed doc-xyz row");
// Trigger migration (no-op if already done via App::open_with_config)
let _ = env;
}
fn open_app_with_seeded_events(env: &TestEnv) -> App {
let app = env.app();
let store = SqliteStore::open(&env.config).expect("open store for seed");
store.run_migrations().expect("run migrations for seed");
seed_ocr_events(env, &store);
app
}
/// AC-4: `inspect_ocr_stats` returns `schema_version = "ocr_stats.v1"`,
/// `total_events >= 1`, `0 ≤ success_rate ≤ 1`.
#[test]
fn ocr_stats_after_seeded_events() {
let env = TestEnv::lexical_only();
let app = open_app_with_seeded_events(&env);
let stats = app.inspect_ocr_stats().expect("inspect_ocr_stats");
assert_eq!(stats.schema_version, "ocr_stats.v1");
assert!(stats.total_events >= 1, "total_events should be >= 1");
assert!(
(0.0..=1.0).contains(&stats.success_rate),
"success_rate must be in [0, 1]: {}",
stats.success_rate
);
assert!(stats.total_runs >= 1, "total_runs should be >= 1");
// by_engine should have at least one entry
assert!(!stats.by_engine.is_empty(), "by_engine must be non-empty");
}
/// AC-6: `inspect_ocr_failures` (no doc_id, corpus-wide) returns failures list.
#[test]
fn ocr_failures_corpus_wide() {
let env = TestEnv::lexical_only();
let app = open_app_with_seeded_events(&env);
let result = app
.inspect_ocr_failures(None, 10)
.expect("inspect_ocr_failures");
assert_eq!(result.schema_version, "ocr_failures.v1");
assert!(result.failure_count >= 1, "expected at least 1 failure");
assert!(
!result.failures.is_empty(),
"failures list must be non-empty"
);
}
/// AC-5: `inspect_ocr_failures` with doc_id filter returns matching rows.
#[test]
fn ocr_failures_filter_by_doc_id() {
let env = TestEnv::lexical_only();
let app = open_app_with_seeded_events(&env);
let result = app
.inspect_ocr_failures(Some("doc-abc"), 10)
.expect("inspect_ocr_failures by doc_id");
assert_eq!(result.schema_version, "ocr_failures.v1");
assert_eq!(
result.doc_id.as_deref(),
Some("doc-abc"),
"doc_id must be echoed back"
);
// All rows must belong to doc-abc (no cross-doc leak)
for row in &result.failures {
// rows are failure rows for doc-abc only (reason = ocr_error)
assert_eq!(row.reason, "ocr_error");
}
}
/// AC-13: SKILL.md lists both new wire schemas.
#[test]
fn skill_md_lists_new_schemas() {
let skill_md = std::fs::read_to_string("../../integrations/claude-code/kebab/SKILL.md")
.expect("read SKILL.md");
assert!(
skill_md.contains("ocr_stats.v1"),
"SKILL.md must mention ocr_stats.v1"
);
assert!(
skill_md.contains("ocr_failures.v1"),
"SKILL.md must mention ocr_failures.v1"
);
}

View File

@@ -0,0 +1,358 @@
//! Integration tests for pdf_ocr_apply helper. spec §5.5 MockOcrEngine pattern.
mod common;
use std::path::{Path, PathBuf};
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use common::mock_ocr::MockOcrEngine;
use kebab_app::pdf_ocr_apply::{PdfOcrOpts, apply_ocr_to_pdf_pages};
use kebab_core::{
AssetStorage, Block, CanonicalDocument, Checksum, ExtractConfig, ExtractContext, Extractor,
Inline, Lang, MediaType, RawAsset, SourceSpan, SourceUri, WorkspacePath, id_for_asset,
};
use kebab_parse_pdf::PdfTextExtractor;
use time::OffsetDateTime;
// ── Fixture helpers ───────────────────────────────────────────────────────
fn f1_pdf_bytes() -> Vec<u8> {
std::fs::read("../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf")
.expect("F1 fixture missing")
}
fn make_raw_asset(path: &str, media_type: MediaType, byte_len: u64) -> RawAsset {
let fake_hash = "0".repeat(64);
let asset_id = id_for_asset(&fake_hash);
RawAsset {
asset_id,
source_uri: SourceUri::File(PathBuf::from(path)),
workspace_path: WorkspacePath::new(path.to_string()).unwrap(),
media_type,
byte_len,
checksum: Checksum(fake_hash.clone()),
discovered_at: OffsetDateTime::UNIX_EPOCH,
stored: AssetStorage::Copied {
path: PathBuf::from(path),
},
}
}
/// Build a CanonicalDocument from raw PDF bytes using PdfTextExtractor.
/// F1 (scanned) returns an empty-text Block::Paragraph per page.
fn extract_canonical_from_bytes(bytes: &[u8]) -> CanonicalDocument {
let asset = make_raw_asset("test.pdf", MediaType::Pdf, bytes.len() as u64);
let workspace_root = Path::new("/");
let config = ExtractConfig::default();
let ctx = ExtractContext {
asset: &asset,
workspace_root,
config: &config,
};
PdfTextExtractor::new().extract(&ctx, bytes).unwrap()
}
/// F1 bytes → canonical with 1 empty Block::Paragraph for page 1.
fn canonical_with_empty_block() -> CanonicalDocument {
extract_canonical_from_bytes(&f1_pdf_bytes())
}
/// F1-based canonical with block text replaced by `text` (high valid_ratio, chars≥20).
fn canonical_with_filled_block(text: &str) -> CanonicalDocument {
let mut canonical = extract_canonical_from_bytes(&f1_pdf_bytes());
if let Some(Block::Paragraph(tb)) = canonical.blocks.first_mut() {
let char_count = text.chars().count() as u32;
tb.text = text.to_string();
tb.inlines = vec![Inline::Text {
text: text.to_string(),
}];
if let SourceSpan::Page { char_end, .. } = &mut tb.common.source_span {
*char_end = Some(char_count);
}
}
canonical
}
/// F1-based canonical with block text replaced by PUA codepoints (low valid_ratio).
fn canonical_with_mojibake_block() -> CanonicalDocument {
let mut canonical = extract_canonical_from_bytes(&f1_pdf_bytes());
if let Some(Block::Paragraph(tb)) = canonical.blocks.first_mut() {
let pua = "\u{E000}".repeat(25); // 25 PUA codepoints → valid_ratio ≈ 0
let char_count = pua.chars().count() as u32;
tb.text = pua.clone();
tb.inlines = vec![Inline::Text { text: pua }];
if let SourceSpan::Page { char_end, .. } = &mut tb.common.source_span {
*char_end = Some(char_count);
}
}
canonical
}
fn default_opts(enabled: bool) -> PdfOcrOpts {
PdfOcrOpts {
enabled,
always_on: false,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: None,
}
}
// ── Tests ─────────────────────────────────────────────────────────────────
// Test 1: F1 + enabled=true → in-place mutate
#[test]
fn f1_input_with_ocr_enabled_replaces_empty_block() {
let bytes = f1_pdf_bytes();
let mut canonical = canonical_with_empty_block();
let engine = MockOcrEngine::single("MOCK_OCR_TEXT", false);
let opts = PdfOcrOpts {
enabled: true,
always_on: false,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: Some(Lang("kor".into())),
cancel: None,
};
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 1);
let first_para = canonical.blocks.iter().find_map(|b| match b {
Block::Paragraph(tb) => Some(tb),
_ => None,
});
assert!(first_para.is_some());
assert_eq!(first_para.unwrap().text, "MOCK_OCR_TEXT");
}
// Test 2: F3 vector (mock filled canonical) + enabled=true → OCR skip (needs_ocr=false)
#[test]
fn f3_input_with_ocr_enabled_keeps_text_detect_blocks() {
let bytes = f1_pdf_bytes(); // reuse F1 bytes; decision is based on canonical text
let text = "충분한 한국어 텍스트 컨텐츠입니다. This has more than twenty characters.";
let mut canonical = canonical_with_filled_block(text);
let engine = MockOcrEngine::single("SHOULD_NOT_BE_CALLED", false);
let opts = default_opts(true);
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 0, "vector PDF 의 OCR 호출 0");
let first_para = canonical.blocks.iter().find_map(|b| match b {
Block::Paragraph(tb) => Some(tb),
_ => None,
});
if let Some(tb) = first_para {
assert!(tb.text.starts_with("충분한"), "원본 text 보존");
}
}
// Test 3: F1 + enabled=false → no-op
#[test]
fn f1_input_with_ocr_disabled_keeps_empty_block() {
let bytes = f1_pdf_bytes();
let mut canonical = canonical_with_empty_block();
let engine = MockOcrEngine::single("IGNORED", false);
let opts = default_opts(false);
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 0);
assert_eq!(summary.ms_total, 0);
}
// Test 4: mojibake canonical (PUA chars) + enabled=true → in-place mutate
#[test]
fn f4_input_with_ocr_enabled_replaces_mojibake_block() {
let bytes = f1_pdf_bytes(); // F1 bytes carry DCTDecode image
let mut canonical = canonical_with_mojibake_block();
let engine = MockOcrEngine::single("OCR_MOJIBAKE_REPLACEMENT", false);
let opts = PdfOcrOpts {
enabled: true,
always_on: false,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: None,
};
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 1, "mojibake page 의 OCR 호출");
let first_para = canonical.blocks.iter().find_map(|b| match b {
Block::Paragraph(tb) => Some(tb),
_ => None,
});
if let Some(tb) = first_para {
assert_eq!(tb.text, "OCR_MOJIBAKE_REPLACEMENT");
}
}
// Test 5: filled canonical + always_on=true → dual-block (+1 OCR block)
#[test]
fn f3_input_with_always_on_pushes_dual_blocks() {
let bytes = f1_pdf_bytes();
let text = "vector PDF 충분한 텍스트 컨텐츠입니다. This has enough characters for valid ratio.";
let mut canonical = canonical_with_filled_block(text);
let original_block_count = canonical.blocks.len();
let engine = MockOcrEngine::single("OCR_DUAL", false);
let opts = PdfOcrOpts {
enabled: true,
always_on: true,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: None,
};
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 1);
assert_eq!(
canonical.blocks.len(),
original_block_count + 1,
"always_on 시 새 Block::Paragraph push"
);
let texts: Vec<&str> = canonical
.blocks
.iter()
.filter_map(|b| match b {
Block::Paragraph(tb) => Some(tb.text.as_str()),
_ => None,
})
.collect();
assert!(texts.contains(&"OCR_DUAL"), "OCR block 포함");
assert!(
texts.iter().any(|t| t.starts_with("vector")),
"원본 text-detect block 보존"
);
}
// Test 6: F6 FlateDecode → extract_dctdecode_page_image=None → skip + warning
#[test]
fn f6_flatedecode_skipped_with_warning() {
let bytes = std::fs::read("../kebab-parse-pdf/tests/fixtures/flate_raw.pdf")
.expect("F6 fixture missing");
let mut canonical = canonical_with_empty_block(); // page-1 block from F1
let engine = MockOcrEngine::single("SHOULD_NOT_BE_CALLED", false);
let opts = default_opts(true);
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(
summary.pages_ocrd, 0,
"FlateDecode page 는 skip (DCTDecode-only v1 invariant)"
);
let warning_count = canonical
.provenance
.events
.iter()
.filter(|e| e.kind == kebab_core::ProvenanceKind::Warning)
.count();
assert!(warning_count >= 1, "FlateDecode skip 시 Warning event 발행");
}
// Test 7: F7 CCITTFax → skip + warning (verifier M-4 split)
#[test]
fn f7_ccittfax_skipped_with_warning() {
let bytes =
std::fs::read("../kebab-parse-pdf/tests/fixtures/ccitt.pdf").expect("F7 fixture missing");
let mut canonical = canonical_with_empty_block(); // page-1 block from F1
let engine = MockOcrEngine::single("SHOULD_NOT_BE_CALLED", false);
let opts = default_opts(true);
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 0, "CCITTFax page 는 skip");
let warning_count = canonical
.provenance
.events
.iter()
.filter(|e| e.kind == kebab_core::ProvenanceKind::Warning)
.count();
assert!(warning_count >= 1, "CCITTFax skip 시 Warning event 발행");
}
// Test 8: OCR engine failure → warning event + skip
#[test]
fn ocr_engine_failure_surfaces_as_warning() {
let bytes = f1_pdf_bytes();
let mut canonical = canonical_with_empty_block();
let engine = MockOcrEngine::single("", true);
let opts = default_opts(true);
let summary = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
assert_eq!(summary.pages_ocrd, 0, "OCR failure 시 pages_ocrd=0");
let warning_with_failure = canonical.provenance.events.iter().any(|e| {
e.kind == kebab_core::ProvenanceKind::Warning
&& e.note.as_deref().unwrap_or("").contains("mock failure")
});
assert!(
warning_with_failure,
"OCR failure 의 error message 가 warning event 의 note 안"
);
}
// Test 9: dual-block ordinals are deterministic and unique
#[test]
fn dual_block_ordinals_are_deterministic_and_unique() {
let bytes = f1_pdf_bytes(); // 1-page PDF → page_count=1
let text = "vector 충분한 텍스트. This text has more than twenty characters total.";
let mut canonical = canonical_with_filled_block(text);
let engine = MockOcrEngine::single("DUAL", false);
let opts = PdfOcrOpts {
enabled: true,
always_on: true,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: None,
};
apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {}).unwrap();
// page_count=1 → text-detect ordinal=0, ocr ordinal=1 (page_num-1 + page_count = 0+1=1)
let para_count = canonical
.blocks
.iter()
.filter(|b| matches!(b, Block::Paragraph(_)))
.count();
assert_eq!(para_count, 2, "dual-block: text-detect + OCR");
let all_page_1 = canonical
.blocks
.iter()
.filter_map(|b| match b {
Block::Paragraph(tb) => Some(&tb.common.source_span),
_ => None,
})
.all(|s| matches!(s, SourceSpan::Page { page: 1, .. }));
assert!(all_page_1, "두 block 모두 page=1");
}
// Test 10: cancel handle aborts mid-PDF
#[test]
fn cancel_handle_aborts_mid_pdf() {
let bytes = f1_pdf_bytes();
let mut canonical = canonical_with_empty_block();
let cancel = Arc::new(AtomicBool::new(true)); // pre-cancel
let engine = MockOcrEngine::single("IGNORED", false);
let opts = PdfOcrOpts {
enabled: true,
always_on: false,
valid_ratio_threshold: 0.5,
min_char_count: 20,
lang_hint: None,
cancel: Some(cancel.clone()),
};
let result = apply_ocr_to_pdf_pages(&mut canonical, &engine, &bytes, &opts, |_| {});
let err = result.expect_err("cancel=true 시 error 반환");
assert!(
format!("{err}").contains("cancelled mid-PDF"),
"error message 가 'cancelled mid-PDF' 포함: {err}"
);
}

View File

@@ -0,0 +1,139 @@
//! Integration smoke test: dual-write (ndjson + SQLite) for PDF OCR events.
//! AC-3: SQLite row count and doc_id matches ndjson LogEvent::Ocr.
//!
//! Uses wiremock to stub the Ollama `/api/generate` endpoint so the test
//! runs without a live Ollama instance.
mod common;
use std::path::PathBuf;
use common::TestEnv;
use kebab_config::LoggingCfg;
use serde_json::Value;
use tokio::task::spawn_blocking;
use wiremock::matchers::{method, path};
use wiremock::{Mock, MockServer, ResponseTemplate};
fn scanned_pdf_src() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.unwrap()
.join("kebab-parse-pdf/tests/fixtures/scanned_page1.pdf")
}
/// AC-3: ndjson OCR line count == pdf_ocr_events row count, and doc_id matches.
#[tokio::test]
async fn ingest_dual_write_doc_id_matches_ndjson() {
let src = scanned_pdf_src();
if !src.exists() {
eprintln!("skipping test: scanned_page1.pdf fixture not found");
return;
}
let server = MockServer::start().await;
// Stub Ollama /api/generate to return a minimal OCR response.
Mock::given(method("POST"))
.and(path("/api/generate"))
.respond_with(ResponseTemplate::new(200).set_body_json(serde_json::json!({
"model": "qwen2.5vl:3b",
"response": "test ocr output",
"done": true,
"done_reason": "stop"
})))
.mount(&server)
.await;
let mock_url = server.uri();
let result = spawn_blocking(move || {
let mut env = TestEnv::lexical_only();
// Enable PDF OCR + set up mock endpoint
env.config.pdf.ocr.enabled = true;
env.config.pdf.ocr.endpoint = Some(mock_url.clone());
env.config.pdf.ocr.model = "qwen2.5vl:3b".to_string();
// Enable ingest log
let log_dir = env.temp.path().join("logs");
std::fs::create_dir_all(&log_dir).unwrap();
env.config.logging = LoggingCfg {
ingest_log_enabled: true,
ingest_log_dir: log_dir.clone(),
..Default::default()
};
// Copy scanned PDF into workspace
let dest = env.workspace_root.join("scanned.pdf");
std::fs::copy(scanned_pdf_src(), &dest).expect("copy scanned PDF");
// Run ingest
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).expect("ingest");
// Read ndjson log
let log_files: Vec<_> = std::fs::read_dir(&log_dir)
.unwrap()
.filter_map(Result::ok)
.filter(|e| {
let name = e.file_name().to_string_lossy().to_string();
name.starts_with("ingest-") && name.ends_with(".ndjson")
})
.collect();
assert_eq!(log_files.len(), 1, "expected 1 ndjson log file");
let body = std::fs::read_to_string(log_files[0].path()).unwrap();
let ocr_lines: Vec<Value> = body
.lines()
.filter_map(|l| serde_json::from_str(l).ok())
.filter(|v: &Value| v.get("kind").and_then(Value::as_str) == Some("ocr"))
.collect();
// Read pdf_ocr_events from SQLite
let db_path = PathBuf::from(&env.config.storage.data_dir).join("kebab.sqlite");
let conn = rusqlite::Connection::open(&db_path).expect("open db");
let rows: Vec<(Option<String>, String)> = {
let mut stmt = conn
.prepare("SELECT doc_id, doc_path FROM pdf_ocr_events ORDER BY id")
.expect("prepare");
stmt.query_map([], |r| Ok((r.get(0)?, r.get(1)?)))
.expect("query")
.map(|r| r.expect("row"))
.collect()
};
(ocr_lines, rows)
})
.await
.expect("spawn_blocking");
let (ocr_lines, rows) = result;
// At least one OCR event must be produced
assert!(!ocr_lines.is_empty(), "expected ≥1 ndjson ocr line");
assert!(!rows.is_empty(), "expected ≥1 pdf_ocr_events row");
// Row counts must match
assert_eq!(
ocr_lines.len(),
rows.len(),
"ndjson ocr lines ({}) must equal pdf_ocr_events rows ({})",
ocr_lines.len(),
rows.len()
);
// doc_id in both sources must be non-null and consistent
for (line, (sql_doc_id, _sql_doc_path)) in ocr_lines.iter().zip(rows.iter()) {
let json_doc_id = line.get("doc_id").and_then(Value::as_str);
assert!(
json_doc_id.is_some(),
"ndjson ocr line should have doc_id: {line}"
);
assert!(
sql_doc_id.is_some(),
"pdf_ocr_events row should have doc_id"
);
assert_eq!(
json_doc_id,
sql_doc_id.as_deref(),
"ndjson doc_id must equal SQLite doc_id"
);
}
}

View File

@@ -46,17 +46,13 @@ fn build_text_pdf(pages: &[Option<&str>]) -> Vec<u8> {
operations: vec![
Operation::new("BT", vec![]),
Operation::new("Tf", vec!["F1".into(), 24.into()]),
Operation::new(
"Td",
vec![Object::Integer(100), Object::Integer(700)],
),
Operation::new("Td", vec![Object::Integer(100), Object::Integer(700)]),
Operation::new("Tj", vec![Object::string_literal(*text)]),
Operation::new("ET", vec![]),
],
};
let stream_data = content.encode().expect("content encode");
let content_id =
doc.add_object(Stream::new(dictionary! {}, stream_data));
let content_id = doc.add_object(Stream::new(dictionary! {}, stream_data));
page_dict.set("Contents", content_id);
}
let page_id = doc.add_object(page_dict);
@@ -76,8 +72,7 @@ fn build_text_pdf(pages: &[Option<&str>]) -> Vec<u8> {
Object::Integer(842),
],
};
doc.objects
.insert(pages_id, Object::Dictionary(pages_dict));
doc.objects.insert(pages_id, Object::Dictionary(pages_dict));
let catalog_id = doc.add_object(dictionary! {
"Type" => "Catalog",
@@ -146,9 +141,8 @@ fn ingest_3_page_pdf_produces_one_doc_and_per_page_chunks() {
write_pdf(&env.workspace_root, "three.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false)
.expect("PDF ingest must succeed");
let report = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false)
.expect("PDF ingest must succeed");
assert_eq!(report.errors, 0);
let items = report.items.as_ref().expect("items present");
@@ -157,23 +151,28 @@ fn ingest_3_page_pdf_produces_one_doc_and_per_page_chunks() {
.find(|i| i.doc_path.0.ends_with("three.pdf"))
.expect("PDF item present");
assert_eq!(pdf_item.kind, IngestItemKind::New);
assert_eq!(pdf_item.block_count, Some(3), "one Block::Paragraph per page");
assert_eq!(pdf_item.chunk_count, Some(3), "one chunk per non-empty page");
assert_eq!(
pdf_item.block_count,
Some(3),
"one Block::Paragraph per page"
);
assert_eq!(
pdf_item.chunk_count,
Some(3),
"one chunk per non-empty page"
);
assert_eq!(
pdf_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("pdf-text-v1")
);
assert_eq!(
pdf_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("pdf-page-v1")
Some("pdf-page-v1.1")
);
// Inspect the stored doc to confirm SourceSpan::Page round-trip.
let doc = kebab_app::inspect_doc_with_config(
cfg,
pdf_item.doc_id.as_ref().unwrap(),
)
.expect("inspect_doc returns the PDF document");
let doc = kebab_app::inspect_doc_with_config(cfg, pdf_item.doc_id.as_ref().unwrap())
.expect("inspect_doc returns the PDF document");
assert_eq!(doc.blocks.len(), 3);
for (i, block) in doc.blocks.iter().enumerate() {
let want_page = (i as u32) + 1;
@@ -202,8 +201,7 @@ fn re_ingest_identical_pdf_produces_unchanged_with_same_doc_id() {
write_pdf(&env.workspace_root, "stable.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report1 =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report1 = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let item1 = report1
.items
.as_ref()
@@ -214,8 +212,7 @@ fn re_ingest_identical_pdf_produces_unchanged_with_same_doc_id() {
.unwrap();
assert_eq!(item1.kind, IngestItemKind::New);
let report2 =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report2 = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let item2 = report2
.items
.unwrap()
@@ -239,8 +236,7 @@ fn re_ingest_edited_pdf_produces_new_doc_id() {
std::fs::write(&path, &bytes_v1).unwrap();
let cfg = cfg_with_pdf(&env);
let report_v1 =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report_v1 = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let id_v1 = report_v1
.items
.as_ref()
@@ -252,12 +248,10 @@ fn re_ingest_edited_pdf_produces_new_doc_id() {
.clone()
.unwrap();
let bytes_v2 =
build_text_pdf(&[Some("VERSION TWO entirely different body content.")]);
let bytes_v2 = build_text_pdf(&[Some("VERSION TWO entirely different body content.")]);
std::fs::write(&path, &bytes_v2).unwrap();
let report_v2 =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report_v2 = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let item_v2 = report_v2
.items
.as_ref()
@@ -282,9 +276,11 @@ fn encrypted_pdf_fails_with_qpdf_hint() {
write_pdf(&env.workspace_root, "secret.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg, env.scope(), false).unwrap();
assert_eq!(report.errors, 1, "encrypted PDF must increment errors exactly once");
let report = kebab_app::ingest_with_config(cfg, env.scope(), false).unwrap();
assert_eq!(
report.errors, 1,
"encrypted PDF must increment errors exactly once"
);
let items = report.items.as_ref().unwrap();
let pdf_item = items
.iter()
@@ -310,9 +306,11 @@ fn corrupt_pdf_fails_without_storing() {
write_pdf(&env.workspace_root, "corrupt.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
assert_eq!(report.errors, 1, "corrupt PDF must increment errors exactly once");
let report = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
assert_eq!(
report.errors, 1,
"corrupt PDF must increment errors exactly once"
);
let items = report.items.as_ref().unwrap();
let pdf_item = items
.iter()
@@ -322,11 +320,8 @@ fn corrupt_pdf_fails_without_storing() {
// Confirm the doc was NOT stored — list_docs returns nothing for
// this path.
let summaries = kebab_app::list_docs_with_config(
cfg,
kebab_core::DocFilter::default(),
)
.unwrap();
let summaries =
kebab_app::list_docs_with_config(cfg, kebab_core::DocFilter::default()).unwrap();
assert!(
!summaries
.iter()
@@ -341,14 +336,15 @@ fn corrupt_pdf_fails_without_storing() {
#[test]
fn mixed_page_pdf_stores_asset_with_scanned_candidate_warning() {
let env = TestEnv::lexical_only();
let bytes =
build_text_pdf(&[Some("first page"), None, Some("third page")]);
let bytes = build_text_pdf(&[Some("first page"), None, Some("third page")]);
write_pdf(&env.workspace_root, "mixed.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
assert_eq!(report.errors, 0, "scanned candidate is a Warning, not Error");
let report = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
assert_eq!(
report.errors, 0,
"scanned candidate is a Warning, not Error"
);
let pdf_item = report
.items
.as_ref()
@@ -365,14 +361,10 @@ fn mixed_page_pdf_stores_asset_with_scanned_candidate_warning() {
assert_eq!(
pdf_item.chunk_count,
Some(2),
"pdf-page-v1 emits 0 chunks for the empty page; total = 2"
"pdf-page-v1.1 emits 0 chunks for the empty page; total = 2"
);
let doc = kebab_app::inspect_doc_with_config(
cfg,
pdf_item.doc_id.as_ref().unwrap(),
)
.unwrap();
let doc = kebab_app::inspect_doc_with_config(cfg, pdf_item.doc_id.as_ref().unwrap()).unwrap();
let warnings: Vec<_> = doc
.provenance
.events
@@ -419,8 +411,7 @@ fn ingest_report_arithmetic_invariant_holds_with_corrupt_pdf() {
write_pdf(&env.workspace_root, "broken.pdf", &corrupt_pdf());
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg, env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(cfg, env.scope(), false).unwrap();
let total = report.new + report.updated + report.skipped + report.errors;
assert_eq!(
report.scanned, total,
@@ -441,14 +432,12 @@ fn long_pdf_round_trips_through_lexical_pipeline() {
let pages: Vec<String> = (1..=50)
.map(|i| format!("Page {i} body — lorem ipsum dolor sit amet."))
.collect();
let page_refs: Vec<Option<&str>> =
pages.iter().map(|s| Some(s.as_str())).collect();
let page_refs: Vec<Option<&str>> = pages.iter().map(|s| Some(s.as_str())).collect();
let bytes = build_text_pdf(&page_refs);
write_pdf(&env.workspace_root, "long.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
assert_eq!(report.errors, 0);
let pdf_item = report
.items
@@ -466,8 +455,7 @@ fn long_pdf_round_trips_through_lexical_pipeline() {
// Round-trip: list_docs sees the long PDF.
let summaries =
kebab_app::list_docs_with_config(cfg, kebab_core::DocFilter::default())
.unwrap();
kebab_app::list_docs_with_config(cfg, kebab_core::DocFilter::default()).unwrap();
assert!(summaries.iter().any(|s| s.doc_path.0.ends_with("long.pdf")));
}
@@ -476,13 +464,11 @@ fn long_pdf_round_trips_through_lexical_pipeline() {
#[test]
fn inspect_doc_surfaces_page_spans() {
let env = TestEnv::lexical_only();
let bytes =
build_text_pdf(&[Some("alpha body"), Some("beta body"), Some("gamma body")]);
let bytes = build_text_pdf(&[Some("alpha body"), Some("beta body"), Some("gamma body")]);
write_pdf(&env.workspace_root, "inspect.pdf", &bytes);
let cfg = cfg_with_pdf(&env);
let report =
kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let report = kebab_app::ingest_with_config(cfg.clone(), env.scope(), false).unwrap();
let pdf_item = report
.items
.as_ref()
@@ -490,19 +476,12 @@ fn inspect_doc_surfaces_page_spans() {
.iter()
.find(|i| i.doc_path.0.ends_with("inspect.pdf"))
.unwrap();
let doc = kebab_app::inspect_doc_with_config(
cfg,
pdf_item.doc_id.as_ref().unwrap(),
)
.unwrap();
let doc = kebab_app::inspect_doc_with_config(cfg, pdf_item.doc_id.as_ref().unwrap()).unwrap();
assert_eq!(doc.parser_version.0, "pdf-text-v1");
assert_eq!(doc.blocks.len(), 3);
for block in &doc.blocks {
match block {
Block::Paragraph(p) => assert!(matches!(
p.common.source_span,
SourceSpan::Page { .. }
)),
Block::Paragraph(p) => assert!(matches!(p.common.source_span, SourceSpan::Page { .. })),
other => panic!("expected Paragraph, got {other:?}"),
}
}

View File

@@ -78,19 +78,15 @@ fn reset_orphans_only_purges_out_of_scope_docs() {
narrow_cfg.workspace.exclude = vec!["b.rs".to_string(), "c.rs".to_string()];
// Run orphans-only reset.
let report = execute(ResetScope::OrphansOnly, &narrow_cfg)
.expect("orphans-only reset must succeed");
let report =
execute(ResetScope::OrphansOnly, &narrow_cfg).expect("orphans-only reset must succeed");
assert_eq!(
report.orphans_purged, 2,
"expected 2 orphans purged (b.rs + c.rs): {report:?}"
);
let mut purged: Vec<String> = report
.purged_paths
.iter()
.map(|p| p.0.clone())
.collect();
let mut purged: Vec<String> = report.purged_paths.iter().map(|p| p.0.clone()).collect();
purged.sort();
assert_eq!(
purged,

View File

@@ -0,0 +1,79 @@
//! Integration tests for Bug #13: schema.v1.models.active_parsers + active_chunkers.
use kebab_app::schema_with_config;
use kebab_config::Config;
use kebab_core::SourceScope;
fn minimal_config(data_dir: &std::path::Path, workspace_root: &std::path::Path) -> Config {
let mut cfg = Config::defaults();
cfg.workspace.root = workspace_root.to_string_lossy().into_owned();
cfg.workspace.exclude.clear();
cfg.storage.data_dir = data_dir.to_string_lossy().into_owned();
cfg.storage.model_dir = data_dir.join("models").to_string_lossy().into_owned();
cfg.models.embedding.provider = "none".to_string();
cfg.models.embedding.dimensions = 0;
cfg.chunking.target_tokens = 80;
cfg.chunking.overlap_tokens = 20;
cfg
}
fn minimal_scope(workspace_root: &std::path::Path) -> SourceScope {
SourceScope {
root: workspace_root.to_path_buf(),
include: vec![],
exclude: vec![],
}
}
#[test]
fn schema_models_active_arrays_empty_on_empty_corpus() {
let dir = tempfile::tempdir().unwrap();
let workspace = dir.path().join("kb");
std::fs::create_dir_all(&workspace).unwrap();
let cfg = minimal_config(dir.path(), &workspace);
let store = kebab_store_sqlite::SqliteStore::open(&cfg).unwrap();
store.run_migrations().unwrap();
drop(store);
let s = schema_with_config(&cfg).unwrap();
assert!(
s.models.active_parsers.is_empty(),
"empty corpus → no parsers"
);
assert!(
s.models.active_chunkers.is_empty(),
"empty corpus → no chunkers"
);
// backward compat: 기존 단일 field 는 markdown default 보존.
assert_eq!(s.models.parser_version, kebab_parse_md::PARSER_VERSION);
}
#[test]
fn schema_emits_active_parsers_and_chunkers_array_after_ingest() {
let dir = tempfile::tempdir().unwrap();
let workspace = dir.path().join("kb");
std::fs::create_dir_all(&workspace).unwrap();
std::fs::write(workspace.join("a.md"), "# A\nhello world\n").unwrap();
let cfg = minimal_config(dir.path(), &workspace);
let scope = minimal_scope(&workspace);
kebab_app::ingest_with_config(cfg.clone(), scope, false).unwrap();
let s = schema_with_config(&cfg).unwrap();
assert!(
!s.models.active_parsers.is_empty(),
"active_parsers populated after ingest"
);
assert!(
!s.models.active_chunkers.is_empty(),
"active_chunkers populated after ingest"
);
// active arrays must be sorted (ORDER BY in SQL).
let mut sorted = s.models.active_parsers.clone();
sorted.sort();
assert_eq!(
s.models.active_parsers, sorted,
"active_parsers must be sorted"
);
}

View File

@@ -57,7 +57,7 @@ fn schema_report_reflects_freshly_ingested_kb() {
schema.wire.schemas
);
assert!(schema.capabilities.json_mode);
assert!(!schema.capabilities.streaming_ask);
assert!(schema.capabilities.streaming_ask); // Bug #9: streaming_ask is now true
assert!(
schema.capabilities.mcp_server,
"mcp_server should be true after fb-30",

View File

@@ -27,7 +27,10 @@ fn search_with_opts_no_budget_matches_search() {
assert_eq!(resp.hits.len(), baseline.len());
assert!(!resp.truncated);
assert!(resp.next_cursor.is_none(), "k=5 against 1 doc → no next page");
assert!(
resp.next_cursor.is_none(),
"k=5 against 1 doc → no next page"
);
}
#[test]
@@ -62,7 +65,11 @@ fn budget_truncates_snippets_when_below_threshold() {
fn cursor_paginates_to_next_page() {
let env = common::TestEnv::new();
for i in 0..6 {
common::ingest_md(&env, &format!("d{i}.md"), &format!("# T{i}\n\nrust topic {i}\n"));
common::ingest_md(
&env,
&format!("d{i}.md"),
&format!("# T{i}\n\nrust topic {i}\n"),
);
}
let app = env.app();
@@ -88,7 +95,10 @@ fn cursor_paginates_to_next_page() {
page1.hits.iter().map(|h| h.chunk_id.0.clone()).collect();
let p2_ids: std::collections::HashSet<_> =
page2.hits.iter().map(|h| h.chunk_id.0.clone()).collect();
assert!(p1_ids.is_disjoint(&p2_ids), "page 2 must not repeat page 1 hits");
assert!(
p1_ids.is_disjoint(&p2_ids),
"page 2 must not repeat page 1 hits"
);
}
#[test]

View File

@@ -75,11 +75,9 @@ fn lexical_multi_token_korean_query_hits() {
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits = kebab_app::search_with_config(
env.config.clone(),
common::lexical_query("해시 충돌"),
)
.expect("search must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), common::lexical_query("해시 충돌"))
.expect("search must succeed");
assert!(
!hits.is_empty(),
@@ -113,11 +111,9 @@ fn lexical_mixed_korean_english_multi_token_query_hits() {
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits = kebab_app::search_with_config(
env.config.clone(),
common::lexical_query("Rust 충돌은"),
)
.expect("search must succeed");
let hits =
kebab_app::search_with_config(env.config.clone(), common::lexical_query("Rust 충돌은"))
.expect("search must succeed");
assert!(
!hits.is_empty(),
@@ -131,3 +127,71 @@ fn lexical_mixed_korean_english_multi_token_query_hits() {
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}
// ── S7 V009 morphological tokenizer end-to-end tests ─────────────────
/// S7 — V009 morphological tokenizer: 한국어 2자 query 가 end-to-end
/// lexical 경로에서 hit. lindera ko-dic 이 '한국어를' → '한국어' 형태소로
/// 분해, '서울은' → '서울' 로 분해하여 tokenized_korean_text column 에
/// 기록 → FTS5 매칭.
#[test]
fn korean_morphological_2char_query_lexical_mode() {
let env = TestEnv::lexical_only();
let doc_path = env.workspace_root.join("korean-wiki.md");
std::fs::write(
&doc_path,
"# 한국어 위키\n\n한국어를 공부합니다.\n서울은 한국의 수도입니다.\n",
)
.expect("write korean-wiki fixture");
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query("한국"))
.expect("search 한국");
assert!(
!hits.is_empty(),
"'한국' 2-char Korean query must return at least one hit (V009 morphological); got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query("서울"))
.expect("search 서울");
assert!(
!hits.is_empty(),
"'서울' 2-char Korean query must return at least one hit; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}
/// S7 — V009 morphological tokenizer: 한-영 혼합 query lexical hit.
/// 'Rust' (English whole-token) + '최적화' (Korean morpheme) 각각 hit.
#[test]
fn korean_morphological_mixed_english_korean_query() {
let env = TestEnv::lexical_only();
let doc_path = env.workspace_root.join("rust-optimization.md");
std::fs::write(
&doc_path,
"# Rust 최적화 노트\n\nRust 최적화는 zero-cost abstraction 을 강조한다.\n",
)
.expect("write rust-optimization fixture");
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query("Rust"))
.expect("search Rust");
assert!(
!hits.is_empty(),
"'Rust' English whole-token must hit; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query("최적화"))
.expect("search 최적화");
assert!(
!hits.is_empty(),
"'최적화' Korean morpheme must hit; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}

View File

@@ -35,8 +35,8 @@ fn lexical_search_returns_hits_after_ingest() {
fn lexical_search_empty_query_returns_empty() {
let env = TestEnv::lexical_only();
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query(" "))
.unwrap();
let hits =
kebab_app::search_with_config(env.config.clone(), common::lexical_query(" ")).unwrap();
assert!(hits.is_empty(), "blank query must short-circuit empty");
}
@@ -107,20 +107,25 @@ fn search_uncached_returns_same_hits_as_cached() {
#[test]
fn first_ingest_bumps_corpus_revision() {
let env = TestEnv::lexical_only();
let store_before =
kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
let store_before = kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
store_before.run_migrations().unwrap();
assert_eq!(store_before.corpus_revision(), 0, "fresh store seeds 0");
// V004 seeds 0; V009 + V010 + V011 migrations each bump by 1 to
// invalidate stale LRU caches (spec §5.2). Baseline before ingest = 3.
// (V012 derivation_cache + V013 drop-chunk-aliases are structural/additive
// — neither bumps corpus_revision.)
let baseline = store_before.corpus_revision();
assert_eq!(baseline, 3, "fresh store post-V011 baseline = 3");
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert!(report.new + report.updated > 0, "first ingest must commit ≥1 doc");
let store_after =
kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert!(
store_after.corpus_revision() >= 1,
"ingest commit must bump corpus_revision (got {})",
report.new + report.updated > 0,
"first ingest must commit ≥1 doc"
);
let store_after = kebab_store_sqlite::SqliteStore::open(&env.config).unwrap();
assert!(
store_after.corpus_revision() > baseline,
"ingest commit must bump corpus_revision past baseline {baseline} (got {})",
store_after.corpus_revision(),
);
}

View File

@@ -29,7 +29,9 @@ fn fresh_doc_is_not_stale_with_default_threshold() {
assert!(
hits.iter().all(|h| !h.stale),
"freshly-ingested doc must not be stale at default 30d threshold: {:?}",
hits.iter().map(|h| (h.doc_path.0.clone(), h.stale)).collect::<Vec<_>>()
hits.iter()
.map(|h| (h.doc_path.0.clone(), h.stale))
.collect::<Vec<_>>()
);
}
@@ -50,7 +52,9 @@ fn threshold_zero_disables_staleness() {
assert!(
hits.iter().all(|h| !h.stale),
"threshold=0 disables staleness even for year-old docs: {:?}",
hits.iter().map(|h| (h.doc_path.0.clone(), h.stale)).collect::<Vec<_>>()
hits.iter()
.map(|h| (h.doc_path.0.clone(), h.stale))
.collect::<Vec<_>>()
);
}

View File

@@ -14,7 +14,8 @@ use common::TestEnv;
fn require_avx_or_panic() {
#[cfg(target_arch = "x86_64")]
{
assert!(std::is_x86_feature_detected!("avx"),
assert!(
std::is_x86_feature_detected!("avx"),
"kb-app vector integration test requires AVX-capable hardware; \
host CPU lacks AVX. Run on an AVX-capable machine."
);
@@ -28,8 +29,7 @@ fn ingest_then_hybrid_search_returns_hits() {
require_avx_or_panic();
let env = TestEnv::with_embeddings();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert_eq!(report.errors, 0, "no per-file errors: {report:?}");
assert_eq!(report.new, 3);
@@ -55,8 +55,7 @@ fn ingest_then_vector_search_carries_embedding_model() {
require_avx_or_panic();
let env = TestEnv::with_embeddings();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
assert_eq!(report.errors, 0, "no per-file errors: {report:?}");
assert_eq!(report.new, 3);

View File

@@ -13,11 +13,7 @@ fn unsupported_extension_skip_carries_warning_and_is_aggregated() {
std::fs::write(workspace_root.join("legacy.docx"), b"unsupported").unwrap();
std::fs::write(workspace_root.join("Makefile"), b"unsupported").unwrap();
let report = kebab_app::ingest_with_config(
env.config.clone(),
env.scope(),
false,
).unwrap();
let report = kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let items = report.items.as_ref().expect("items array populated");
let docx_item = items
@@ -39,5 +35,8 @@ fn unsupported_extension_skip_carries_warning_and_is_aggregated() {
vec!["unsupported media type: <no-ext>".to_string()],
);
assert_eq!(report.skipped_by_extension.get("docx").copied(), Some(1));
assert_eq!(report.skipped_by_extension.get("<no-ext>").copied(), Some(1));
assert_eq!(
report.skipped_by_extension.get("<no-ext>").copied(),
Some(1)
);
}

View File

@@ -44,8 +44,8 @@ fn twin_files_fetch_span_uses_correct_asset() {
std::fs::write(dir_b.join("note.md"), content).unwrap();
// Ingest all files (fixture workspace + our two new twins).
let report = ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
let report =
ingest_with_config(env.config.clone(), env.scope(), false).expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors; report={report:?}");
// Both twin paths must appear as New in the report.
@@ -53,8 +53,7 @@ fn twin_files_fetch_span_uses_correct_asset() {
let twin_items: Vec<_> = items
.iter()
.filter(|i| {
i.doc_path.0.ends_with("src_a/note.md")
|| i.doc_path.0.ends_with("src_b/note.md")
i.doc_path.0.ends_with("src_a/note.md") || i.doc_path.0.ends_with("src_b/note.md")
})
.collect();
assert_eq!(
@@ -149,7 +148,10 @@ fn twin_files_fetch_span_uses_correct_asset() {
// at either twin, making one twin's span fetch behave incorrectly.
let report2 = ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest must succeed");
assert_eq!(report2.errors, 0, "no ingest errors on second run; report={report2:?}");
assert_eq!(
report2.errors, 0,
"no ingest errors on second run; report={report2:?}"
);
// Re-open app after second ingest and verify span still works on both.
let app2 = env.app();

View File

@@ -43,9 +43,7 @@ fn twin_files_second_ingest_is_unchanged() {
let items = first.items.as_ref().expect("items must be present");
let twin_items: Vec<_> = items
.iter()
.filter(|i| {
i.doc_path.0.ends_with("__init__.py")
})
.filter(|i| i.doc_path.0.ends_with("__init__.py"))
.collect();
assert_eq!(
twin_items.len(),
@@ -63,8 +61,14 @@ fn twin_files_second_ingest_is_unchanged() {
// Second ingest — same files, same content → both must be Unchanged.
let second = ingest_with_config(env.config.clone(), env.scope(), false)
.expect("second ingest must succeed");
assert_eq!(second.errors, 0, "second ingest: no errors; report={second:?}");
assert_eq!(second.new, 0, "second ingest: no new docs; report={second:?}");
assert_eq!(
second.errors, 0,
"second ingest: no errors; report={second:?}"
);
assert_eq!(
second.new, 0,
"second ingest: no new docs; report={second:?}"
);
assert_eq!(
second.updated, 0,
"second ingest: no updated docs (twin-file bug would set this to 2); report={second:?}"

View File

@@ -14,6 +14,8 @@ blake3 = { workspace = true }
anyhow = { workspace = true }
tracing = { workspace = true }
serde_yaml = { workspace = true }
lindera = { workspace = true, features = ["embed-ko-dic"] }
lindera-ko-dic = { workspace = true, features = ["embed-ko-dic"] }
[dev-dependencies]
# kb-parse-md / kb-parse-code are dev-only — used by the snapshot integration

View File

@@ -39,17 +39,11 @@ impl Chunker for CodeCAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeCAstV1Chunker only handles code docs (got non-Code block)"
),
_ => anyhow::bail!("CodeCAstV1Chunker only handles code docs (got non-Code block)"),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
@@ -68,9 +62,12 @@ impl Chunker for CodeCAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +81,13 @@ impl Chunker for CodeCAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +95,7 @@ impl Chunker for CodeCAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +103,13 @@ impl Chunker for CodeCAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +145,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +189,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,39 +212,60 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("c".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("c".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("c".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_c_ast_v1() {
assert_eq!(CodeCAstV1Chunker.chunker_version(),
ChunkerVersion("code-c-ast-v1".into()));
assert_eq!(
CodeCAstV1Chunker.chunker_version(),
ChunkerVersion("code-c-ast-v1".into())
);
}
#[test]
@@ -256,7 +283,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-c-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +298,32 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tx{i} = {i};\n")).collect::<String>();
let body = (0..500)
.map(|i| format!("\tx{i} = {i};\n"))
.collect::<String>();
let code = format!("int big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort_unstable(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +337,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodeCAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeCAstV1Chunker"));
@@ -304,11 +347,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "int parse() {}\n")]);
let base: Vec<String> = CodeCAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodeCAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeCAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodeCAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +367,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeCAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodeCAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -39,17 +39,13 @@ impl Chunker for CodeCppAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeCppAstV1Chunker only handles code docs (got non-Code block)"
),
_ => {
anyhow::bail!("CodeCppAstV1Chunker only handles code docs (got non-Code block)")
}
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
@@ -68,9 +64,12 @@ impl Chunker for CodeCppAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodeCppAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodeCppAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodeCppAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,39 +214,60 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("cpp".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("cpp".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("cpp".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_cpp_ast_v1() {
assert_eq!(CodeCppAstV1Chunker.chunker_version(),
ChunkerVersion("code-cpp-ast-v1".into()));
assert_eq!(
CodeCppAstV1Chunker.chunker_version(),
ChunkerVersion("code-cpp-ast-v1".into())
);
}
#[test]
@@ -256,7 +285,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-cpp-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +300,32 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tx{i} = {i};\n")).collect::<String>();
let body = (0..500)
.map(|i| format!("\tx{i} = {i};\n"))
.collect::<String>();
let code = format!("int big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort_unstable(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +339,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodeCppAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeCppAstV1Chunker"));
@@ -304,11 +349,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "int parse() {}\n")]);
let base: Vec<String> = CodeCppAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodeCppAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeCppAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodeCppAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +369,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeCppAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodeCppAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -39,17 +39,13 @@ impl Chunker for CodeGoAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeGoAstV1Chunker only handles code docs (got non-Code block)"
),
_ => {
anyhow::bail!("CodeGoAstV1Chunker only handles code docs (got non-Code block)")
}
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
@@ -68,9 +64,12 @@ impl Chunker for CodeGoAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodeGoAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodeGoAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodeGoAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,46 +214,72 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("go".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("go".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("go".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_go_ast_v1() {
assert_eq!(CodeGoAstV1Chunker.chunker_version(),
ChunkerVersion("code-go-ast-v1".into()));
assert_eq!(
CodeGoAstV1Chunker.chunker_version(),
ChunkerVersion("code-go-ast-v1".into())
);
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "func parse() {\n\t// x\n}"),
("Foo.double", 5, 7, "func double() int {\n\t//\n\treturn 0\n}"),
(
"Foo.double",
5,
7,
"func double() int {\n\t//\n\treturn 0\n}",
),
]);
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
@@ -256,7 +290,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-go-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +305,33 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tx{i} := {i}")).collect::<Vec<_>>().join("\n");
let body = (0..500)
.map(|i| format!("\tx{i} := {i}"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("func big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort_unstable(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +345,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodeGoAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeGoAstV1Chunker"));
@@ -304,11 +355,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "func parse() {}\n")]);
let base: Vec<String> = CodeGoAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodeGoAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeGoAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodeGoAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +375,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeGoAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodeGoAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -39,11 +39,7 @@ impl Chunker for CodeJavaAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
@@ -68,9 +64,12 @@ impl Chunker for CodeJavaAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodeJavaAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodeJavaAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodeJavaAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,39 +214,60 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("java".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("java".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("java".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_java_ast_v1() {
assert_eq!(CodeJavaAstV1Chunker.chunker_version(),
ChunkerVersion("code-java-ast-v1".into()));
assert_eq!(
CodeJavaAstV1Chunker.chunker_version(),
ChunkerVersion("code-java-ast-v1".into())
);
}
#[test]
@@ -256,7 +285,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-java-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +300,33 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tint x{i} = {i};")).collect::<Vec<_>>().join("\n");
let body = (0..500)
.map(|i| format!("\tint x{i} = {i};"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("void big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort_unstable(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +340,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodeJavaAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeJavaAstV1Chunker"));
@@ -304,11 +350,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "void parse() {}\n")]);
let base: Vec<String> = CodeJavaAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodeJavaAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeJavaAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodeJavaAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +370,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeJavaAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodeJavaAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -39,17 +39,13 @@ impl Chunker for CodeJsAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeJsAstV1Chunker only handles code docs (got non-Code block)"
),
_ => {
anyhow::bail!("CodeJsAstV1Chunker only handles code docs (got non-Code block)")
}
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
@@ -68,9 +64,12 @@ impl Chunker for CodeJsAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodeJsAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodeJsAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodeJsAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,46 +214,72 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("javascript".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("javascript".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("javascript".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_js_ast_v1() {
assert_eq!(CodeJsAstV1Chunker.chunker_version(),
ChunkerVersion("code-js-ast-v1".into()));
assert_eq!(
CodeJsAstV1Chunker.chunker_version(),
ChunkerVersion("code-js-ast-v1".into())
);
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "function parse() {\n // x\n}"),
("Foo.double", 5, 7, "function double() {\n //\n return 0;\n}"),
(
"Foo.double",
5,
7,
"function double() {\n //\n return 0;\n}",
),
]);
let chunks = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
@@ -256,7 +290,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-js-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +305,33 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" const x{i} = {i};")).collect::<Vec<_>>().join("\n");
let body = (0..500)
.map(|i| format!(" const x{i} = {i};"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("function big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort_unstable(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +345,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeJsAstV1Chunker"));
@@ -304,11 +355,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "function parse() {}\n")]);
let base: Vec<String> = CodeJsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodeJsAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeJsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodeJsAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +375,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeJsAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodeJsAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -39,11 +39,7 @@ impl Chunker for CodeKotlinAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
@@ -68,9 +64,12 @@ impl Chunker for CodeKotlinAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodeKotlinAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodeKotlinAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodeKotlinAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,46 +214,72 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("kotlin".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("kotlin".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("kotlin".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_kotlin_ast_v1() {
assert_eq!(CodeKotlinAstV1Chunker.chunker_version(),
ChunkerVersion("code-kotlin-ast-v1".into()));
assert_eq!(
CodeKotlinAstV1Chunker.chunker_version(),
ChunkerVersion("code-kotlin-ast-v1".into())
);
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "fun parse() {\n\t// x\n}"),
("Foo.double", 5, 7, "fun double(): Int {\n\t//\n\treturn 0\n}"),
(
"Foo.double",
5,
7,
"fun double(): Int {\n\t//\n\treturn 0\n}",
),
]);
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
@@ -256,7 +290,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-kotlin-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +305,33 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!("\tval x{i} = {i}")).collect::<Vec<_>>().join("\n");
let body = (0..500)
.map(|i| format!("\tval x{i} = {i}"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("fun big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort_unstable(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +345,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodeKotlinAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeKotlinAstV1Chunker"));
@@ -304,11 +355,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "fun parse() {}\n")]);
let base: Vec<String> = CodeKotlinAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodeKotlinAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeKotlinAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodeKotlinAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +375,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeKotlinAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodeKotlinAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -39,11 +39,7 @@ impl Chunker for CodePythonAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
@@ -68,9 +64,12 @@ impl Chunker for CodePythonAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodePythonAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodePythonAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodePythonAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,39 +214,60 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("python".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("python".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("python".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_python_ast_v1() {
assert_eq!(CodePythonAstV1Chunker.chunker_version(),
ChunkerVersion("code-python-ast-v1".into()));
assert_eq!(
CodePythonAstV1Chunker.chunker_version(),
ChunkerVersion("code-python-ast-v1".into())
);
}
#[test]
@@ -256,7 +285,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-python-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +300,33 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" x{i} = {i}")).collect::<Vec<_>>().join("\n");
let body = (0..500)
.map(|i| format!(" x{i} = {i}"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("def big():\n{body}\n");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodePythonAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort_unstable(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +340,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodePythonAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodePythonAstV1Chunker"));
@@ -304,11 +350,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "def parse(): pass\n")]);
let base: Vec<String> = CodePythonAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodePythonAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodePythonAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodePythonAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +370,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodePythonAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodePythonAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -39,11 +39,7 @@ impl Chunker for CodeRustAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
@@ -68,9 +64,12 @@ impl Chunker for CodeRustAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodeRustAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodeRustAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodeRustAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,39 +214,60 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("rust".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("rust".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("rust".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_rust_ast_v1() {
assert_eq!(CodeRustAstV1Chunker.chunker_version(),
ChunkerVersion("code-rust-ast-v1".into()));
assert_eq!(
CodeRustAstV1Chunker.chunker_version(),
ChunkerVersion("code-rust-ast-v1".into())
);
}
#[test]
@@ -256,7 +285,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-rust-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +300,33 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" let x{i} = {i};")).collect::<Vec<_>>().join("\n");
let body = (0..500)
.map(|i| format!(" let x{i} = {i};"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("pub fn big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort_unstable(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +340,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeRustAstV1Chunker"));
@@ -304,11 +350,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "fn parse(){}\n}")]);
let base: Vec<String> = CodeRustAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodeRustAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeRustAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodeRustAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +370,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeRustAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodeRustAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -9,7 +9,7 @@
use crate::tier2_shared::{build_chunk_no_symbol, policy_hash};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion};
pub const VERSION_LABEL: &str = "code-text-paragraph-v1";

View File

@@ -39,17 +39,13 @@ impl Chunker for CodeTsAstV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeTsAstV1Chunker only handles code docs (got non-Code block)"
),
_ => {
anyhow::bail!("CodeTsAstV1Chunker only handles code docs (got non-Code block)")
}
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
@@ -68,9 +64,12 @@ impl Chunker for CodeTsAstV1Chunker {
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
SourceSpan::Code {
line_start,
line_end,
symbol,
lang,
} => (*line_start, *line_end, symbol.clone(), lang.clone()),
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
@@ -84,8 +83,13 @@ impl Chunker for CodeTsAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
None,
span,
cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
@@ -93,9 +97,7 @@ impl Chunker for CodeTsAstV1Chunker {
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let part_sym = symbol.as_ref().map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
@@ -103,8 +105,13 @@ impl Chunker for CodeTsAstV1Chunker {
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
doc,
&chunker_version,
&block_ids,
&base_policy_hash,
Some(part_ls),
span,
text,
));
}
}
@@ -140,6 +147,7 @@ fn make_chunk(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -183,9 +191,9 @@ fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use time::OffsetDateTime;
@@ -206,46 +214,72 @@ mod tests {
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("typescript".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "a".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("typescript".into()),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("typescript".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
chunker_version: ChunkerVersion(VERSION_LABEL.into()),
}
}
#[test]
fn chunker_version_is_code_ts_ast_v1() {
assert_eq!(CodeTsAstV1Chunker.chunker_version(),
ChunkerVersion("code-ts-ast-v1".into()));
assert_eq!(
CodeTsAstV1Chunker.chunker_version(),
ChunkerVersion("code-ts-ast-v1".into())
);
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "function parse(): void {\n // x\n}"),
("Foo.double", 5, 7, "function double(): number {\n //\n return 0;\n}"),
(
"Foo.double",
5,
7,
"function double(): number {\n //\n return 0;\n}",
),
]);
let chunks = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
@@ -256,7 +290,12 @@ mod tests {
assert_eq!(c.chunker_version.0, "code-ts-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
SourceSpan::Code {
symbol,
line_start,
line_end,
..
} => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
@@ -266,22 +305,33 @@ mod tests {
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" const x{i} = {i};")).collect::<Vec<_>>().join("\n");
let body = (0..500)
.map(|i| format!(" const x{i} = {i};"))
.collect::<Vec<_>>()
.join("\n");
let code = format!("function big(): void {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
assert!(
chunks.len() >= 2,
"oversize unit must split, got {}",
chunks.len()
);
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
assert!(
symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}"
);
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort_unstable(); ids.dedup();
let n = ids.len();
ids.sort_unstable();
ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
@@ -295,7 +345,8 @@ mod tests {
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
text: "x".into(),
inlines: vec![],
})];
let err = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeTsAstV1Chunker"));
@@ -304,11 +355,19 @@ mod tests {
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "function parse(): void {}\n")]);
let base: Vec<String> = CodeTsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let base: Vec<String> = CodeTsAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..1000 {
let again: Vec<String> = CodeTsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
let again: Vec<String> = CodeTsAstV1Chunker
.chunk(&doc, &policy())
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, base);
}
}
@@ -316,7 +375,9 @@ mod tests {
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeTsAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
assert_eq!(
CodeTsAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p)
);
}
}

View File

@@ -7,7 +7,7 @@
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion};
pub const VERSION_LABEL: &str = "dockerfile-file-v1";

View File

@@ -8,7 +8,7 @@
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion};
pub const VERSION_LABEL: &str = "k8s-manifest-resource-v1";
@@ -49,19 +49,14 @@ impl Chunker for K8sManifestResourceV1Chunker {
.get("apiVersion")
.and_then(|v| v.as_str())
.unwrap_or("");
let kind = mapping
.get("kind")
.and_then(|v| v.as_str())
.unwrap_or("");
let kind = mapping.get("kind").and_then(|v| v.as_str()).unwrap_or("");
// Skip non-k8s documents.
if api.is_empty() || kind.is_empty() {
continue;
}
let metadata = mapping
.get("metadata")
.and_then(|v| v.as_mapping());
let metadata = mapping.get("metadata").and_then(|v| v.as_mapping());
let name = metadata
.and_then(|m| m.get("name"))
.and_then(|v| v.as_str())
@@ -118,10 +113,7 @@ fn split_yaml_documents(text: &str) -> Vec<YamlSlice<'_>> {
.enumerate()
.filter_map(|(i, l)| {
let trimmed = l.trim_end();
if trimmed == "---"
|| trimmed.starts_with("--- ")
|| trimmed.starts_with("---\t")
{
if trimmed == "---" || trimmed.starts_with("--- ") || trimmed.starts_with("---\t") {
Some(i)
} else {
None

View File

@@ -23,14 +23,14 @@ mod code_js_ast_v1;
mod code_kotlin_ast_v1;
mod code_python_ast_v1;
mod code_rust_ast_v1;
pub mod code_text_paragraph_v1;
mod code_ts_ast_v1;
pub mod dockerfile_file_v1;
pub mod k8s_manifest_resource_v1;
pub mod manifest_file_v1;
mod md_heading_v1;
mod pdf_page_v1;
mod tier2_shared;
pub mod k8s_manifest_resource_v1;
pub mod dockerfile_file_v1;
pub mod manifest_file_v1;
pub mod code_text_paragraph_v1;
pub use code_c_ast_v1::CodeCAstV1Chunker;
pub use code_cpp_ast_v1::CodeCppAstV1Chunker;
@@ -40,10 +40,82 @@ pub use code_js_ast_v1::CodeJsAstV1Chunker;
pub use code_kotlin_ast_v1::CodeKotlinAstV1Chunker;
pub use code_python_ast_v1::CodePythonAstV1Chunker;
pub use code_rust_ast_v1::CodeRustAstV1Chunker;
pub use code_text_paragraph_v1::CodeTextParagraphV1Chunker;
pub use code_ts_ast_v1::CodeTsAstV1Chunker;
pub use dockerfile_file_v1::DockerfileFileV1Chunker;
pub use k8s_manifest_resource_v1::K8sManifestResourceV1Chunker;
pub use manifest_file_v1::ManifestFileV1Chunker;
pub use md_heading_v1::MdHeadingV1Chunker;
pub use pdf_page_v1::PdfPageV1Chunker;
pub use k8s_manifest_resource_v1::K8sManifestResourceV1Chunker;
pub use dockerfile_file_v1::DockerfileFileV1Chunker;
pub use manifest_file_v1::ManifestFileV1Chunker;
pub use code_text_paragraph_v1::CodeTextParagraphV1Chunker;
// ── Korean morphological tokenizer ───────────────────────────────────────────
use lindera::dictionary::{DictionaryKind, load_embedded_dictionary};
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
static KOREAN_TOKENIZER: std::sync::OnceLock<Option<Tokenizer>> = std::sync::OnceLock::new();
/// 한 codepoint 가 한글 음절 또는 자모인지 판정 — N-gram supplement 의 emit 대상 필터링.
fn is_hangul(c: char) -> bool {
matches!(
c,
'\u{AC00}'..='\u{D7A3}' // 한글 음절 (precomposed)
| '\u{1100}'..='\u{11FF}' // 한글 자모
| '\u{3130}'..='\u{318F}' // 한글 호환 자모
)
}
/// 한국어 chunk text 를 lindera ko-dic 으로 형태소 분해해 공백 join 한 결과를 반환.
/// chunker 들이 `Chunk.tokenized_korean_text` pre-fill 에 사용.
/// 분석 실패 시 None — 호출자는 NULL fallback 처리.
/// Tokenizer 는 OnceLock 으로 1회 초기화; dict load 실패 시 영구 None.
///
/// v0.21.0 — N-gram supplement (Option β, post-v0.20.1 enhancement).
/// ko-dic 가 compound noun (`한국정부`, `서울특별시` 등) 을 단일 token 으로
/// 저장하는 정책 의 한계 해소 — morpheme 길이 ≥ 3 인 한글 token 에 대해
/// 2-char sliding window n-gram 도 추가 emit. `'한국정부'` morpheme →
/// `[한국정부, 한국, 국정, 정부]` 의 4 token 으로 expand. 사용자 의 2-char
/// query (`'한국'`) 가 compound chunk 에서도 hit. 영어/숫자 token 은 영향
/// 없음 (is_hangul filter). DB size + ingest latency 의 trade-off 는
/// HOTFIXES 2026-05-28 의 "N-gram supplement (Option β)" 보강 entry.
pub fn tokenize_korean_morphological(text: &str) -> Option<String> {
if text.trim().is_empty() {
return None;
}
let tokenizer = KOREAN_TOKENIZER.get_or_init(|| {
let dict = match load_embedded_dictionary(DictionaryKind::KoDic) {
Ok(d) => d,
Err(e) => {
tracing::warn!(target: "kebab-chunk", "tokenize_korean_morphological: dict load failed: {e}");
return None;
}
};
let segmenter = Segmenter::new(Mode::Normal, dict, None);
Some(Tokenizer::new(segmenter))
});
let tokenizer = tokenizer.as_ref()?;
let tokens = tokenizer.tokenize(text).ok()?;
let mut out_tokens: Vec<String> = Vec::with_capacity(tokens.len() * 2);
for tok in tokens.iter() {
let surface = tok.surface.as_ref();
out_tokens.push(surface.to_string());
// N-gram supplement: 한글 morpheme 의 2-char sliding window.
let chars: Vec<char> = surface.chars().collect();
if chars.len() >= 3 && chars.iter().all(|c| is_hangul(*c)) {
for window in chars.windows(2) {
out_tokens.push(window.iter().collect());
}
}
}
let joined = out_tokens.join(" ");
if joined.is_empty() {
None
} else {
Some(joined)
}
}

View File

@@ -8,7 +8,7 @@
use crate::tier2_shared::{policy_hash, push_chunks_with_oversize};
use anyhow::Result;
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, ChunkerVersion, Chunker};
use kebab_core::{Block, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion};
pub const VERSION_LABEL: &str = "manifest-file-v1";

View File

@@ -1,8 +1,8 @@
//! `md-heading-v1` — heading-aware Markdown chunker.
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker,
ChunkerVersion, DocumentId, SourceSpan, id_for_chunk,
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
/// Version label emitted by [`MdHeadingV1Chunker`]. Bumping this label
@@ -99,11 +99,7 @@ impl Chunker for MdHeadingV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
let policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
@@ -152,22 +148,12 @@ impl Chunker for MdHeadingV1Chunker {
// `collect_overlap_seed` keeps seed ≤ target/2, so
// a flush here never produces a chunk smaller than
// the seed budget.
let would_exceed = acc.text_tokens + next_tokens
> policy.target_tokens
let would_exceed = acc.text_tokens + next_tokens > policy.target_tokens
&& acc.has_non_heading_content();
if would_exceed {
let overlap_seed = collect_overlap_seed(
&acc,
policy.overlap_tokens,
policy.target_tokens,
);
flush(
&mut acc,
doc,
&chunker_version,
&policy_hash,
&mut out,
);
let overlap_seed =
collect_overlap_seed(&acc, policy.overlap_tokens, policy.target_tokens);
flush(&mut acc, doc, &chunker_version, &policy_hash, &mut out);
// Seed next accumulator with the prior chunk's
// tail blocks (paragraph-level overlap). The
// heading is *not* re-included here — it lives
@@ -292,10 +278,11 @@ fn build_chunk(
) -> Chunk {
debug_assert!(!blocks.is_empty(), "build_chunk requires ≥1 block");
let block_ids: Vec<BlockId> =
blocks.iter().map(|b| common(b).block_id.clone()).collect();
let source_spans: Vec<SourceSpan> =
blocks.iter().map(|b| common(b).source_span.clone()).collect();
let block_ids: Vec<BlockId> = blocks.iter().map(|b| common(b).block_id.clone()).collect();
let source_spans: Vec<SourceSpan> = blocks
.iter()
.map(|b| common(b).source_span.clone())
.collect();
// heading_path: pick the first non-Heading block's heading_path
// (which already includes every parent heading per kb-normalize).
@@ -339,17 +326,13 @@ fn build_chunk(
text.len().div_ceil(BYTES_PER_TOKEN)
};
let chunk_id = id_for_chunk(
&doc.doc_id,
chunker_version,
&block_ids,
policy_hash,
);
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, &block_ids, policy_hash);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids,
tokenized_korean_text: crate::tokenize_korean_morphological(&text),
text,
heading_path,
source_spans,
@@ -400,14 +383,8 @@ fn render_block_text(b: &Block) -> String {
} else {
i.alt.clone()
};
let ocr = i
.ocr
.as_ref()
.map_or("", |o| o.joined.as_str());
let cap = i
.caption
.as_ref()
.map_or("", |c| c.text.as_str());
let ocr = i.ocr.as_ref().map_or("", |o| o.joined.as_str());
let cap = i.caption.as_ref().map_or("", |c| c.text.as_str());
[alt.as_str(), ocr, cap]
.iter()
.filter(|s| !s.is_empty())
@@ -447,9 +424,8 @@ fn common(b: &Block) -> &kebab_core::CommonBlock {
mod tests {
use super::*;
use kebab_core::{
AssetId, CodeBlock, CommonBlock, HeadingBlock, ImageRefBlock, Lang,
Metadata, Provenance, SourceType, TableBlock, TextBlock, TrustLevel,
WorkspacePath, id_for_block,
AssetId, CodeBlock, CommonBlock, HeadingBlock, ImageRefBlock, Lang, Metadata, Provenance,
SourceType, TableBlock, TextBlock, TrustLevel, WorkspacePath, id_for_block,
};
use time::OffsetDateTime;
@@ -492,12 +468,7 @@ mod tests {
SourceSpan::Line { start, end }
}
fn common_for(
kind: &str,
heading_path: &[String],
ordinal: u32,
s: SourceSpan,
) -> CommonBlock {
fn common_for(kind: &str, heading_path: &[String], ordinal: u32, s: SourceSpan) -> CommonBlock {
CommonBlock {
block_id: id_for_block(&doc_id(), kind, heading_path, ordinal, &s),
heading_path: heading_path.to_vec(),
@@ -532,12 +503,7 @@ mod tests {
})
}
fn paragraph(
text: &str,
heading_path: &[&str],
ordinal: u32,
line: u32,
) -> Block {
fn paragraph(text: &str, heading_path: &[&str], ordinal: u32, line: u32) -> Block {
let hp: Vec<String> = heading_path.iter().map(|s| (*s).into()).collect();
Block::Paragraph(TextBlock {
common: common_for("paragraph", &hp, ordinal, span(line, line)),
@@ -546,12 +512,7 @@ mod tests {
})
}
fn code_block(
code: &str,
heading_path: &[&str],
ordinal: u32,
s: SourceSpan,
) -> Block {
fn code_block(code: &str, heading_path: &[&str], ordinal: u32, s: SourceSpan) -> Block {
let hp: Vec<String> = heading_path.iter().map(|s| (*s).into()).collect();
Block::Code(CodeBlock {
common: common_for("code", &hp, ordinal, s),
@@ -578,12 +539,7 @@ mod tests {
})
}
fn image_ref(
alt: &str,
heading_path: &[&str],
ordinal: u32,
line: u32,
) -> Block {
fn image_ref(alt: &str, heading_path: &[&str], ordinal: u32, line: u32) -> Block {
let hp: Vec<String> = heading_path.iter().map(|s| (*s).into()).collect();
Block::ImageRef(ImageRefBlock {
common: common_for("imageref", &hp, ordinal, span(line, line)),

View File

@@ -53,18 +53,21 @@
//! one chunk per atomic block. PdfPageV1 cannot.
//!
//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
//! variant `format!("{base_policy_hash}#c{char_start}")` into the
//! recipe's `policy_hash` slot (so distinct chunks distinguish via
//! different policy_hash inputs), while storing the unmodified
//! `base_policy_hash` in `Chunk.policy_hash` so the field still answers
//! "what policy was active". Logged in `tasks/HOTFIXES.md`.
//! variant `format!("{base_policy_hash}#c{segment_start}")` into the
//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap
//! segment boundary, strictly increasing across the returned chunks
//! even when the overlap walk collapses `actual_start` to a previous
//! chunk's `prev_min`. Unmodified `base_policy_hash` is stored in
//! `Chunk.policy_hash` so the field still answers "what policy was
//! active". v1.1 second-iteration patch — logged in
//! `tasks/HOTFIXES.md` (2026-05-27).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "pdf-page-v1";
const VERSION_LABEL: &str = "pdf-page-v1.1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
@@ -89,11 +92,7 @@ impl Chunker for PdfPageV1Chunker {
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
fn chunk(&self, doc: &CanonicalDocument, policy: &ChunkPolicy) -> anyhow::Result<Vec<Chunk>> {
// Validate up front — every block must be a Paragraph carrying
// SourceSpan::Page. A mixed document signals a routing bug in
// the caller (e.g. running this chunker on Markdown) and is
@@ -106,18 +105,13 @@ impl Chunker for PdfPageV1Chunker {
),
};
if !matches!(common.source_span, SourceSpan::Page { .. }) {
anyhow::bail!(
"PdfPageV1Chunker only handles PDF docs (got non-Page source_span)"
);
anyhow::bail!("PdfPageV1Chunker only handles PDF docs (got non-Page source_span)");
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let target_bytes = policy
.target_tokens
.saturating_mul(BYTES_PER_TOKEN)
.max(1);
let target_bytes = policy.target_tokens.saturating_mul(BYTES_PER_TOKEN).max(1);
// Clamp the overlap to half the target. Without this, a policy
// with `overlap_tokens >= target_tokens` would make every chunk
// fully re-emit the previous chunk's text — mirrors
@@ -146,7 +140,7 @@ impl Chunker for PdfPageV1Chunker {
continue;
}
for (char_start, char_end, slice) in
for (segment_start, char_start, char_end, slice) in
chunk_page(&p.text, target_bytes, overlap_bytes)
{
// PDF chars-per-page comfortably fits in u32 (a single
@@ -154,20 +148,20 @@ impl Chunker for PdfPageV1Chunker {
// typography); silent `as u32` truncation would only
// surface on corrupted input, where an explicit panic
// is preferable to an off-by-2^32 span.
let char_start_u32 = u32::try_from(char_start)
.expect("page chars fit in u32");
let char_end_u32 =
u32::try_from(char_end).expect("page chars fit in u32");
let char_start_u32 = u32::try_from(char_start).expect("page chars fit in u32");
let char_end_u32 = u32::try_from(char_end).expect("page chars fit in u32");
let span = SourceSpan::Page {
page: page_num,
char_start: Some(char_start_u32),
char_end: Some(char_end_u32),
};
let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
// Per-chunk policy_hash variant prevents chunk_id
// collision when a page produces multiple chunks. See
// module docs for rationale.
let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
// v0.20.0 sub-item 1 bugfix (#3): per-chunk policy_hash
// variant uses `segment_start` (pre-overlap boundary,
// strictly increasing) instead of `char_start` (post-
// overlap, may collapse to prev_min). See module docs +
// spec §4.1 root cause + HOTFIXES.md 2026-05-27.
let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}");
let chunk_id =
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
let token_estimate = slice.len().div_ceil(BYTES_PER_TOKEN);
@@ -176,6 +170,7 @@ impl Chunker for PdfPageV1Chunker {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids,
tokenized_korean_text: crate::tokenize_korean_morphological(&slice),
text: slice,
heading_path: Vec::new(),
source_spans: vec![span],
@@ -198,18 +193,28 @@ impl Chunker for PdfPageV1Chunker {
}
/// Split a single page's text into ordered chunks, each represented as
/// `(char_start, char_end, text_slice)`. Char positions are within the
/// page text, suitable for `SourceSpan::Page::char_start` / `char_end`.
/// `(segment_start, actual_start, chunk_end, text_slice)`.
///
/// - `segment_start` = pre-overlap segment boundary. Strictly increasing
/// across the returned vec. Use this for chunk_id uniqueness suffixes.
/// - `actual_start` = post-overlap start char index. May collapse to a
/// previous chunk's `actual_start` under aggressive overlap policy.
/// Use this for `SourceSpan::Page::char_start`.
/// - `chunk_end` = chunk's end char index (exclusive).
///
/// Returns an empty vector when `text` is empty or whitespace-only.
fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usize, usize, String)> {
fn chunk_page(
text: &str,
target_bytes: usize,
overlap_bytes: usize,
) -> Vec<(usize, usize, usize, String)> {
let chars: Vec<char> = text.chars().collect();
let n = chars.len();
if n == 0 {
return Vec::new();
}
if text.len() <= target_bytes {
return vec![(0, n, text.to_string())];
return vec![(0, 0, n, text.to_string())];
}
// Build candidate boundary positions (char indices where a chunk
@@ -222,8 +227,7 @@ fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usi
let c = chars[k];
let nx = chars[k + 1];
let is_paragraph_break = c == '\n' && nx == '\n';
let is_sentence_end =
matches!(c, '.' | '?' | '!') && nx.is_whitespace();
let is_sentence_end = matches!(c, '.' | '?' | '!') && nx.is_whitespace();
if (is_paragraph_break || is_sentence_end) && k + 2 <= n {
bounds.push(k + 2);
}
@@ -235,11 +239,9 @@ fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usi
bounds.dedup();
// UTF-8 byte length of the slice between two char indices.
let byte_len = |a: usize, b: usize| -> usize {
chars[a..b].iter().map(|c| c.len_utf8()).sum()
};
let byte_len = |a: usize, b: usize| -> usize { chars[a..b].iter().map(|c| c.len_utf8()).sum() };
let mut chunks: Vec<(usize, usize, String)> = Vec::new();
let mut chunks: Vec<(usize, usize, usize, String)> = Vec::new();
let mut seg_idx: usize = 0;
while seg_idx + 1 < bounds.len() {
let start = bounds[seg_idx];
@@ -264,7 +266,9 @@ fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usi
// have absorbed up to `overlap_bytes` of bytes, but never past
// the previous chunk's start (no full re-emission).
let actual_start = if let Some(prev) = chunks.last() {
let prev_min = prev.0;
// prev tuple shape = (segment_start, actual_start, chunk_end, slice).
// overlap walk floor = previous chunk's actual_start (prev.1).
let prev_min = prev.1;
let mut a = start;
let mut acc_o: usize = 0;
while a > prev_min {
@@ -281,7 +285,7 @@ fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize) -> Vec<(usi
};
let slice: String = chars[actual_start..chunk_end].iter().collect();
chunks.push((actual_start, chunk_end, slice));
chunks.push((start, actual_start, chunk_end, slice));
seg_idx = end_idx;
}
@@ -390,7 +394,11 @@ mod tests {
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.source_spans.len(), 1);
match c.source_spans[0] {
SourceSpan::Page { page, char_start, char_end } => {
SourceSpan::Page {
page,
char_start,
char_end,
} => {
assert_eq!(page, (i as u32) + 1);
assert_eq!(char_start, Some(0));
assert!(char_end.unwrap() > 0);
@@ -435,11 +443,16 @@ mod tests {
// N-1's char_end).
for w in chunks.windows(2) {
let prev_end = match w[0].source_spans[0] {
SourceSpan::Page { char_end: Some(e), .. } => e,
SourceSpan::Page {
char_end: Some(e), ..
} => e,
_ => panic!("missing char_end"),
};
let next_start = match w[1].source_spans[0] {
SourceSpan::Page { char_start: Some(s), .. } => s,
SourceSpan::Page {
char_start: Some(s),
..
} => s,
_ => panic!("missing char_start"),
};
assert!(
@@ -653,11 +666,17 @@ mod tests {
// overlap) is the failure mode.
for w in chunks.windows(2) {
let prev_start = match w[0].source_spans[0] {
SourceSpan::Page { char_start: Some(s), .. } => s,
SourceSpan::Page {
char_start: Some(s),
..
} => s,
_ => panic!("missing char_start"),
};
let next_start = match w[1].source_spans[0] {
SourceSpan::Page { char_start: Some(s), .. } => s,
SourceSpan::Page {
char_start: Some(s),
..
} => s,
_ => panic!("missing char_start"),
};
assert!(
@@ -674,6 +693,43 @@ mod tests {
assert_eq!(ids.len(), total, "chunk_ids must remain unique");
}
#[test]
fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() {
// 한국어 OCR text 의 trigger shape: 10 char "가" + ". " + 500 char "나".
// → first segment [0, 12), second segment [12, n).
// page_text byte_len = 10*3 + 2 + 500*3 = 1532 > target_bytes=1500
// → multi-chunk. overlap_bytes = min(240, 750) = 240 chars=80
// → second chunk 의 actual_start 가 prev_min=0 collapse → same `#c0`.
//
// default_policy(500, 80) — target_tokens=500 → target_bytes=500*3=1500
// (한국어 3byte/char 환산), overlap_tokens=80 → overlap_bytes=min(240, 750)=240.
// verifier round 1 L-3 보강.
let early_seg = "".repeat(10);
let tail = "".repeat(500);
let page_text = format!("{early_seg}. {tail}");
let doc = make_pdf_doc(&[&page_text]);
let policy = default_policy(500, 80); // target=1500 byte, overlap=240 byte
let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
assert!(
chunks.len() >= 2,
"expected ≥2 chunks for {} byte page; got {}",
page_text.len(),
chunks.len()
);
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
ids.sort_unstable();
let total = ids.len();
ids.dedup();
assert_eq!(
ids.len(),
total,
"all chunk_ids must be unique even when overlap walks actual_start back to prev_min"
);
}
#[test]
fn policy_hash_matches_md_heading_v1_for_identical_policy() {
// Cross-chunker policy fingerprint identity — important so a

View File

@@ -113,7 +113,14 @@ pub(crate) fn build_chunk(
symbol: Some(symbol.to_string()),
lang: Some(lang.to_string()),
};
build_chunk_from_span(doc, chunker_version, base_policy_hash, text, span, split_key)
build_chunk_from_span(
doc,
chunker_version,
base_policy_hash,
text,
span,
split_key,
)
}
/// Like `build_chunk` but emits `symbol: None`. Used by Tier 3 (per spec §9.3).
@@ -182,6 +189,7 @@ fn build_chunk_from_span(
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids,
tokenized_korean_text: crate::tokenize_korean_morphological(text),
text: text.to_string(),
heading_path: Vec::new(),
source_spans: vec![span],

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodeCAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -15,9 +15,9 @@ use std::path::PathBuf;
use kebab_chunk::CodeCppAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use kebab_parse_code::CppAstExtractor;
use serde_json::Value;
@@ -171,7 +171,9 @@ fn extract_cpp_fixture() -> CanonicalDocument {
workspace_root: &root,
config: &cfg,
};
CppAstExtractor::new().extract(&ctx, src.as_bytes()).unwrap()
CppAstExtractor::new()
.extract(&ctx, src.as_bytes())
.unwrap()
}
// ---------------------------------------------------------------------------
@@ -261,43 +263,61 @@ fn code_cpp_ast_extractor_snapshot() {
let doc = extract_cpp_fixture();
// Verify the extractor emits all expected named units.
let block_syms: Vec<Option<String>> = doc.blocks.iter().filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, .. } => Some(symbol.clone()),
let block_syms: Vec<Option<String>> = doc
.blocks
.iter()
.filter_map(|b| match b {
Block::Code(c) => match &c.common.source_span {
SourceSpan::Code { symbol, .. } => Some(symbol.clone()),
_ => None,
},
_ => None,
},
_ => None,
}).collect();
})
.collect();
// Must include namespace-qualified class and its methods
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker")),
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker")),
"class unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::MdHeadingV1Chunker")),
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::MdHeadingV1Chunker")),
"ctor unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::~MdHeadingV1Chunker")),
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::~MdHeadingV1Chunker")),
"dtor unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::chunk_doc")),
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::chunk_doc")),
"chunk_doc unit missing: {block_syms:?}"
);
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::operator()")),
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::MdHeadingV1Chunker::operator()")),
"operator() unit missing: {block_syms:?}"
);
// Template function (inside kebab::chunk namespace in the fixture)
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::chunk::identity")),
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::chunk::identity")),
"identity template fn unit missing: {block_syms:?}"
);
// Free function in outer namespace
assert!(
block_syms.iter().any(|s| s.as_deref() == Some("kebab::global_helper")),
block_syms
.iter()
.any(|s| s.as_deref() == Some("kebab::global_helper")),
"global_helper unit missing: {block_syms:?}"
);
// Global main
@@ -312,14 +332,23 @@ fn code_cpp_ast_extractor_snapshot() {
fn code_cpp_ast_extractor_chunks_deterministic() {
let doc1 = extract_cpp_fixture();
let doc2 = extract_cpp_fixture();
assert_eq!(doc1.blocks, doc2.blocks, "extractor output non-deterministic");
assert_eq!(
doc1.blocks, doc2.blocks,
"extractor output non-deterministic"
);
let policy = fixed_policy();
let chunks1 = CodeCppAstV1Chunker.chunk(&doc1, &policy).unwrap();
let chunks2 = CodeCppAstV1Chunker.chunk(&doc2, &policy).unwrap();
assert_eq!(
chunks1.iter().map(|c| c.chunk_id.0.clone()).collect::<Vec<_>>(),
chunks2.iter().map(|c| c.chunk_id.0.clone()).collect::<Vec<_>>(),
chunks1
.iter()
.map(|c| c.chunk_id.0.clone())
.collect::<Vec<_>>(),
chunks2
.iter()
.map(|c| c.chunk_id.0.clone())
.collect::<Vec<_>>(),
"chunker output non-deterministic"
);
}

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodeGoAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodeJavaAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodeJsAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodeKotlinAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodePythonAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodeRustAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -13,9 +13,9 @@ use std::path::PathBuf;
use kebab_chunk::CodeTsAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock,
CommonBlock, Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel,
WorkspacePath, id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;

View File

@@ -124,7 +124,11 @@ fn dockerfile_emits_single_chunk() {
Some("<dockerfile>"),
"symbol must be '<dockerfile>'"
);
assert_eq!(lang.as_deref(), Some("dockerfile"), "lang must be 'dockerfile'");
assert_eq!(
lang.as_deref(),
Some("dockerfile"),
"lang must be 'dockerfile'"
);
}
other => panic!("expected SourceSpan::Code, got {other:?}"),
}

View File

@@ -18,7 +18,8 @@
}
],
"text": "#include <stdio.h>\n#include <stdlib.h>\n\n#define MAX_BUF 4096\n\ntypedef enum {\n OK = 0,\n ERR_PARSE,\n ERR_IO,\n} status_t;\n\ntypedef struct {\n int id;\n char name[64];\n status_t status;\n} record_t;\n\nstatic int counter = 0;",
"token_estimate": 78
"token_estimate": 78,
"tokenized_korean_text": "# include < stdio . h > # include < stdlib . h > # define MAX _ BUF 4096 typedef enum { OK = 0 , ERR _ PARSE , ERR _ IO , } status _ t ; typedef struct { int id ; char name [ 64 ]; status _ t status ; } record _ t ; static int counter = 0 ;"
},
{
"block_ids": [
@@ -39,7 +40,8 @@
}
],
"text": "int parse_record(const char *line, record_t *out) {\n if (line == NULL || out == NULL) return ERR_PARSE;\n return OK;\n}",
"token_estimate": 41
"token_estimate": 41,
"tokenized_korean_text": "int parse _ record ( const char * line , record _ t * out ) { if ( line == NULL || out == NULL ) return ERR _ PARSE ; return OK ; }"
},
{
"block_ids": [
@@ -60,7 +62,8 @@
}
],
"text": "void print_record(const record_t *r) {\n printf(\"[%d] %s (status=%d)\\n\", r->id, r->name, r->status);\n}",
"token_estimate": 35
"token_estimate": 35,
"tokenized_korean_text": "void print _ record ( const record _ t * r ) { printf (\"[% d ] % s ( status =% d )\\ n \", r -> id , r -> name , r -> status ); }"
},
{
"block_ids": [
@@ -81,6 +84,7 @@
}
],
"text": "int main(void) {\n record_t r = { .id = 1, .name = \"foo\", .status = OK };\n print_record(&r);\n return 0;\n}",
"token_estimate": 38
"token_estimate": 38,
"tokenized_korean_text": "int main ( void ) { record _ t r = { . id = 1 , . name = \" foo \", . status = OK }; print _ record (& r ); return 0 ; }"
}
]

File diff suppressed because one or more lines are too long

View File

@@ -18,7 +18,8 @@
}
],
"text": "#include <string>\n#include <vector>\n\nnamespace kebab {",
"token_estimate": 18
"token_estimate": 18,
"tokenized_korean_text": "# include < string > # include < vector > namespace kebab {"
},
{
"block_ids": [
@@ -39,7 +40,8 @@
}
],
"text": "class MdHeadingV1Chunker {\npublic:\n MdHeadingV1Chunker() = default;\n ~MdHeadingV1Chunker() = default;\n\n std::string chunk_doc(const std::string& doc) {\n return doc;\n }\n\n int operator()(int x) const {\n return x * 2;\n }\n\nprivate:\n int counter_ = 0;\n};",
"token_estimate": 95
"token_estimate": 95,
"tokenized_korean_text": "class MdHeadingV 1 Chunker { public : MdHeadingV 1 Chunker ( ) = default ; ~ MdHeadingV 1 Chunker ( ) = default ; std : : string chunk _ doc ( const std : : string & doc ) { return doc ; } int operator ( ) ( int x ) const { return x * 2 ; } private : int counter _ = 0 ; };"
},
{
"block_ids": [
@@ -60,7 +62,8 @@
}
],
"text": "template <typename T>\nT identity(T value) {\n return value;\n}",
"token_estimate": 21
"token_estimate": 21,
"tokenized_korean_text": "template < typename T > T identity ( T value ) { return value ; }"
},
{
"block_ids": [
@@ -81,7 +84,8 @@
}
],
"text": "void global_helper() {\n // free function in kebab namespace\n}",
"token_estimate": 22
"token_estimate": 22,
"tokenized_korean_text": "void global _ helper ( ) { / / free function in kebab namespace }"
},
{
"block_ids": [
@@ -102,6 +106,7 @@
}
],
"text": "int main() {\n kebab::chunk::MdHeadingV1Chunker c;\n return 0;\n}",
"token_estimate": 23
"token_estimate": 23,
"tokenized_korean_text": "int main ( ) { kebab : : chunk : : MdHeadingV 1 Chunker c ; return 0 ; }"
}
]

View File

@@ -18,7 +18,8 @@
}
],
"text": "import (\n\t\"fmt\"\n\t\"os\"\n\t\"strings\"\n)",
"token_estimate": 12
"token_estimate": 12,
"tokenized_korean_text": "import ( \" fmt \" \" os \" \" strings \" )"
},
{
"block_ids": [
@@ -39,7 +40,8 @@
}
],
"text": "func ComputeMRR(scores []float64) float64 {\n\tif len(scores) == 0 {\n\t\treturn 0.0\n\t}\n\t_ = fmt.Sprintf(\"%v\", scores)\n\treturn 1.0 / float64(len(scores))\n}",
"token_estimate": 50
"token_estimate": 50,
"tokenized_korean_text": "func ComputeMRR ( scores [ ] float 64 ) float 64 { if len ( scores ) == 0 { return 0 . 0 } _ = fmt . Sprintf (\"% v \", scores ) return 1 . 0 / float 64 ( len ( scores ) ) }"
},
{
"block_ids": [
@@ -60,7 +62,8 @@
}
],
"text": "type MetricsCollector struct {\n\tScores []float64\n\tLabels []string\n\tCounts map[string]int\n\tTotals map[string]float64\n\tTags []string\n}",
"token_estimate": 45
"token_estimate": 45,
"tokenized_korean_text": "type MetricsCollector struct { Scores [ ] float 64 Labels [ ] string Counts map [ string ] int Totals map [ string ] float 64 Tags [ ] string }"
},
{
"block_ids": [
@@ -81,7 +84,8 @@
}
],
"text": "type BaseEvaluator struct {\n\tName string\n}\n\nfunc (e *BaseEvaluator) Evaluate(data []string) error {\n\t_ = os.Stderr\n\t_ = strings.Join(data, \",\")\n\treturn nil\n}",
"token_estimate": 53
"token_estimate": 53,
"tokenized_korean_text": "type BaseEvaluator struct { Name string } func ( e * BaseEvaluator ) Evaluate ( data [ ] string ) error { _ = os . Stderr _ = strings . Join ( data , \",\") return nil }"
},
{
"block_ids": [
@@ -102,7 +106,8 @@
}
],
"text": "func (m *MetricsCollector) Run(inputs []float64) {\n\tfor _, inp := range inputs {\n\t\tm.Scores = append(\n\t\t\tm.Scores,\n\t\t\tinp,\n\t\t)\n\t}\n}",
"token_estimate": 44
"token_estimate": 44,
"tokenized_korean_text": "func ( m * MetricsCollector ) Run ( inputs [ ] float 64 ) { for _, inp := range inputs { m . Scores = append ( m . Scores , inp , ) } }"
},
{
"block_ids": [
@@ -123,7 +128,8 @@
}
],
"text": "func (m *MetricsCollector) Report() map[string]interface{} {\n\treturn map[string]interface{}{\n\t\t\"mean\": 0.0,\n\t\t\"count\": len(m.Scores),\n\t\t\"tags\": m.Tags,\n\t}\n}",
"token_estimate": 53
"token_estimate": 53,
"tokenized_korean_text": "func ( m * MetricsCollector ) Report ( ) map [ string ] interface {} { return map [ string ] interface {}{ \" mean \": 0 . 0 , \" count \": len ( m . Scores ) , \" tags \": m . Tags , } }"
},
{
"block_ids": [
@@ -144,7 +150,8 @@
}
],
"text": "func BigCompute(data []int) int {\n\tv0 := 0\n\tif 0 < len(data) {\n\t\tv0 = data[0]\n\t}\n\tv1 := 0\n\tif 1 < len(data) {\n\t\tv1 = data[1]\n\t}\n\tv2 := 0\n\tif 2 < len(data) {\n\t\tv2 = data[2]\n\t}\n\tv3 := 0\n\tif 3 < len(data) {\n\t\tv3 = data[3]\n\t}\n\tv4 := 0\n\tif 4 < len(data) {\n\t\tv4 = data[4]\n\t}\n\tv5 := 0\n\tif 5 < len(data) {\n\t\tv5 = data[5]\n\t}\n\tv6 := 0\n\tif 6 < len(data) {\n\t\tv6 = data[6]\n\t}\n\tv7 := 0\n\tif 7 < len(data) {\n\t\tv7 = data[7]\n\t}\n\tv8 := 0\n\tif 8 < len(data) {\n\t\tv8 = data[8]\n\t}\n\tv9 := 0\n\tif 9 < len(data) {\n\t\tv9 = data[9]\n\t}\n\tv10 := 0\n\tif 10 < len(data) {\n\t\tv10 = data[10]\n\t}\n\tv11 := 0\n\tif 11 < len(data) {\n\t\tv11 = data[11]\n\t}\n\tv12 := 0\n\tif 12 < len(data) {\n\t\tv12 = data[12]\n\t}\n\tv13 := 0\n\tif 13 < len(data) {\n\t\tv13 = data[13]\n\t}\n\tv14 := 0\n\tif 14 < len(data) {\n\t\tv14 = data[14]\n\t}\n\tv15 := 0\n\tif 15 < len(data) {\n\t\tv15 = data[15]\n\t}\n\tv16 := 0\n\tif 16 < len(data) {\n\t\tv16 = data[16]\n\t}\n\tv17 := 0\n\tif 17 < len(data) {\n\t\tv17 = data[17]\n\t}\n\tv18 := 0\n\tif 18 < len(data) {\n\t\tv18 = data[18]\n\t}\n\tv19 := 0\n\tif 19 < len(data) {\n\t\tv19 = data[19]\n\t}\n\tv20 := 0\n\tif 20 < len(data) {\n\t\tv20 = data[20]\n\t}\n\tv21 := 0\n\tif 21 < len(data) {\n\t\tv21 = data[21]\n\t}\n\tv22 := 0\n\tif 22 < len(data) {\n\t\tv22 = data[22]\n\t}\n\tv23 := 0\n\tif 23 < len(data) {\n\t\tv23 = data[23]\n\t}\n\tv24 := 0\n\tif 24 < len(data) {\n\t\tv24 = data[24]\n\t}\n\tv25 := 0\n\tif 25 < len(data) {\n\t\tv25 = data[25]\n\t}\n\tv26 := 0\n\tif 26 < len(data) {\n\t\tv26 = data[26]\n\t}\n\tv27 := 0\n\tif 27 < len(data) {\n\t\tv27 = data[27]\n\t}\n\tv28 := 0\n\tif 28 < len(data) {\n\t\tv28 = data[28]\n\t}\n\tv29 := 0\n\tif 29 < len(data) {\n\t\tv29 = data[29]\n\t}\n\tv30 := 0\n\tif 30 < len(data) {\n\t\tv30 = data[30]\n\t}\n\tv31 := 0\n\tif 31 < len(data) {\n\t\tv31 = data[31]\n\t}\n\tv32 := 0\n\tif 32 < len(data) {\n\t\tv32 = data[32]\n\t}\n\tv33 := 0\n\tif 33 < len(data) {\n\t\tv33 = data[33]\n\t}\n\tv34 := 0\n\tif 34 < len(data) {\n\t\tv34 = data[34]\n\t}\n\tv35 := 0\n\tif 35 < len(data) {\n\t\tv35 = data[35]\n\t}\n\tv36 := 0\n\tif 36 < len(data) {\n\t\tv36 = data[36]\n\t}\n\tv37 := 0\n\tif 37 < len(data) {\n\t\tv37 = data[37]\n\t}\n\tv38 := 0\n\tif 38 < len(data) {\n\t\tv38 = data[38]\n\t}\n\tv39 := 0\n\tif 39 < len(data) {\n\t\tv39 = data[39]\n\t}\n\tv40 := 0\n\tif 40 < len(data) {\n\t\tv40 = data[40]\n\t}\n\tv41 := 0\n\tif 41 < len(data) {\n\t\tv41 = data[41]\n\t}\n\tv42 := 0\n\tif 42 < len(data) {\n\t\tv42 = data[42]\n\t}\n\tv43 := 0\n\tif 43 < len(data) {\n\t\tv43 = data[43]\n\t}\n\tv44 := 0\n\tif 44 < len(data) {\n\t\tv44 = data[44]\n\t}\n\tv45 := 0\n\tif 45 < len(data) {\n\t\tv45 = data[45]\n\t}\n\tv46 := 0\n\tif 46 < len(data) {\n\t\tv46 = data[46]\n\t}\n\tv47 := 0\n\tif 47 < len(data) {\n\t\tv47 = data[47]\n\t}\n\tv48 := 0\n\tif 48 < len(data) {\n\t\tv48 = data[48]\n\t}\n\tv49 := 0\n\tif 49 < len(data) {\n\t\tv49 = data[49]",
"token_estimate": 847
"token_estimate": 847,
"tokenized_korean_text": "func BigCompute ( data [ ] int ) int { v 0 := 0 if 0 < len ( data ) { v 0 = data [ 0 ] } v 1 := 0 if 1 < len ( data ) { v 1 = data [ 1 ] } v 2 := 0 if 2 < len ( data ) { v 2 = data [ 2 ] } v 3 := 0 if 3 < len ( data ) { v 3 = data [ 3 ] } v 4 := 0 if 4 < len ( data ) { v 4 = data [ 4 ] } v 5 := 0 if 5 < len ( data ) { v 5 = data [ 5 ] } v 6 := 0 if 6 < len ( data ) { v 6 = data [ 6 ] } v 7 := 0 if 7 < len ( data ) { v 7 = data [ 7 ] } v 8 := 0 if 8 < len ( data ) { v 8 = data [ 8 ] } v 9 := 0 if 9 < len ( data ) { v 9 = data [ 9 ] } v 10 := 0 if 10 < len ( data ) { v 10 = data [ 10 ] } v 11 := 0 if 11 < len ( data ) { v 11 = data [ 11 ] } v 12 := 0 if 12 < len ( data ) { v 12 = data [ 12 ] } v 13 := 0 if 13 < len ( data ) { v 13 = data [ 13 ] } v 14 := 0 if 14 < len ( data ) { v 14 = data [ 14 ] } v 15 := 0 if 15 < len ( data ) { v 15 = data [ 15 ] } v 16 := 0 if 16 < len ( data ) { v 16 = data [ 16 ] } v 17 := 0 if 17 < len ( data ) { v 17 = data [ 17 ] } v 18 := 0 if 18 < len ( data ) { v 18 = data [ 18 ] } v 19 := 0 if 19 < len ( data ) { v 19 = data [ 19 ] } v 20 := 0 if 20 < len ( data ) { v 20 = data [ 20 ] } v 21 := 0 if 21 < len ( data ) { v 21 = data [ 21 ] } v 22 := 0 if 22 < len ( data ) { v 22 = data [ 22 ] } v 23 := 0 if 23 < len ( data ) { v 23 = data [ 23 ] } v 24 := 0 if 24 < len ( data ) { v 24 = data [ 24 ] } v 25 := 0 if 25 < len ( data ) { v 25 = data [ 25 ] } v 26 := 0 if 26 < len ( data ) { v 26 = data [ 26 ] } v 27 := 0 if 27 < len ( data ) { v 27 = data [ 27 ] } v 28 := 0 if 28 < len ( data ) { v 28 = data [ 28 ] } v 29 := 0 if 29 < len ( data ) { v 29 = data [ 29 ] } v 30 := 0 if 30 < len ( data ) { v 30 = data [ 30 ] } v 31 := 0 if 31 < len ( data ) { v 31 = data [ 31 ] } v 32 := 0 if 32 < len ( data ) { v 32 = data [ 32 ] } v 33 := 0 if 33 < len ( data ) { v 33 = data [ 33 ] } v 34 := 0 if 34 < len ( data ) { v 34 = data [ 34 ] } v 35 := 0 if 35 < len ( data ) { v 35 = data [ 35 ] } v 36 := 0 if 36 < len ( data ) { v 36 = data [ 36 ] } v 37 := 0 if 37 < len ( data ) { v 37 = data [ 37 ] } v 38 := 0 if 38 < len ( data ) { v 38 = data [ 38 ] } v 39 := 0 if 39 < len ( data ) { v 39 = data [ 39 ] } v 40 := 0 if 40 < len ( data ) { v 40 = data [ 40 ] } v 41 := 0 if 41 < len ( data ) { v 41 = data [ 41 ] } v 42 := 0 if 42 < len ( data ) { v 42 = data [ 42 ] } v 43 := 0 if 43 < len ( data ) { v 43 = data [ 43 ] } v 44 := 0 if 44 < len ( data ) { v 44 = data [ 44 ] } v 45 := 0 if 45 < len ( data ) { v 45 = data [ 45 ] } v 46 := 0 if 46 < len ( data ) { v 46 = data [ 46 ] } v 47 := 0 if 47 < len ( data ) { v 47 = data [ 47 ] } v 48 := 0 if 48 < len ( data ) { v 48 = data [ 48 ] } v 49 := 0 if 49 < len ( data ) { v 49 = data [ 49 ]"
},
{
"block_ids": [
@@ -165,7 +172,8 @@
}
],
"text": "\t}\n\tv50 := 0\n\tif 50 < len(data) {\n\t\tv50 = data[50]\n\t}\n\tv51 := 0\n\tif 51 < len(data) {\n\t\tv51 = data[51]\n\t}\n\tv52 := 0\n\tif 52 < len(data) {\n\t\tv52 = data[52]\n\t}\n\tv53 := 0\n\tif 53 < len(data) {\n\t\tv53 = data[53]\n\t}\n\tv54 := 0\n\tif 54 < len(data) {\n\t\tv54 = data[54]\n\t}\n\tv55 := 0\n\tif 55 < len(data) {\n\t\tv55 = data[55]\n\t}\n\tv56 := 0\n\tif 56 < len(data) {\n\t\tv56 = data[56]\n\t}\n\tv57 := 0\n\tif 57 < len(data) {\n\t\tv57 = data[57]\n\t}\n\tv58 := 0\n\tif 58 < len(data) {\n\t\tv58 = data[58]\n\t}\n\tv59 := 0\n\tif 59 < len(data) {\n\t\tv59 = data[59]\n\t}\n\tv60 := 0\n\tif 60 < len(data) {\n\t\tv60 = data[60]\n\t}\n\tv61 := 0\n\tif 61 < len(data) {\n\t\tv61 = data[61]\n\t}\n\tv62 := 0\n\tif 62 < len(data) {\n\t\tv62 = data[62]\n\t}\n\tv63 := 0\n\tif 63 < len(data) {\n\t\tv63 = data[63]\n\t}\n\tv64 := 0\n\tif 64 < len(data) {\n\t\tv64 = data[64]\n\t}\n\tv65 := 0\n\tif 65 < len(data) {\n\t\tv65 = data[65]\n\t}\n\tv66 := 0\n\tif 66 < len(data) {\n\t\tv66 = data[66]\n\t}\n\tv67 := 0\n\tif 67 < len(data) {\n\t\tv67 = data[67]\n\t}\n\tv68 := 0\n\tif 68 < len(data) {\n\t\tv68 = data[68]\n\t}\n\tv69 := 0\n\tif 69 < len(data) {\n\t\tv69 = data[69]\n\t}\n\tv70 := 0\n\tif 70 < len(data) {\n\t\tv70 = data[70]\n\t}\n\tv71 := 0\n\tif 71 < len(data) {\n\t\tv71 = data[71]\n\t}\n\tv72 := 0\n\tif 72 < len(data) {\n\t\tv72 = data[72]\n\t}\n\tv73 := 0\n\tif 73 < len(data) {\n\t\tv73 = data[73]\n\t}\n\tv74 := 0\n\tif 74 < len(data) {\n\t\tv74 = data[74]\n\t}\n\tv75 := 0\n\tif 75 < len(data) {\n\t\tv75 = data[75]\n\t}\n\tv76 := 0\n\tif 76 < len(data) {\n\t\tv76 = data[76]\n\t}\n\tv77 := 0\n\tif 77 < len(data) {\n\t\tv77 = data[77]\n\t}\n\tv78 := 0\n\tif 78 < len(data) {\n\t\tv78 = data[78]\n\t}\n\tv79 := 0\n\tif 79 < len(data) {\n\t\tv79 = data[79]\n\t}\n\tv80 := 0\n\tif 80 < len(data) {\n\t\tv80 = data[80]\n\t}\n\tv81 := 0\n\tif 81 < len(data) {\n\t\tv81 = data[81]\n\t}\n\tv82 := 0\n\tif 82 < len(data) {\n\t\tv82 = data[82]\n\t}\n\tv83 := 0\n\tif 83 < len(data) {\n\t\tv83 = data[83]\n\t}\n\tv84 := 0\n\tif 84 < len(data) {\n\t\tv84 = data[84]\n\t}\n\tv85 := 0\n\tif 85 < len(data) {\n\t\tv85 = data[85]\n\t}\n\tv86 := 0\n\tif 86 < len(data) {\n\t\tv86 = data[86]\n\t}\n\tv87 := 0\n\tif 87 < len(data) {\n\t\tv87 = data[87]\n\t}\n\tv88 := 0\n\tif 88 < len(data) {\n\t\tv88 = data[88]\n\t}\n\tv89 := 0\n\tif 89 < len(data) {\n\t\tv89 = data[89]\n\t}\n\tv90 := 0\n\tif 90 < len(data) {\n\t\tv90 = data[90]\n\t}\n\tv91 := 0\n\tif 91 < len(data) {\n\t\tv91 = data[91]\n\t}\n\tv92 := 0\n\tif 92 < len(data) {\n\t\tv92 = data[92]\n\t}\n\tv93 := 0\n\tif 93 < len(data) {\n\t\tv93 = data[93]\n\t}\n\tv94 := 0\n\tif 94 < len(data) {\n\t\tv94 = data[94]\n\t}\n\tv95 := 0\n\tif 95 < len(data) {\n\t\tv95 = data[95]\n\t}\n\tv96 := 0\n\tif 96 < len(data) {\n\t\tv96 = data[96]\n\t}\n\tv97 := 0\n\tif 97 < len(data) {\n\t\tv97 = data[97]\n\t}\n\tv98 := 0\n\tif 98 < len(data) {\n\t\tv98 = data[98]\n\t}\n\tv99 := 0\n\tif 99 < len(data) {\n\t\tv99 = data[99]",
"token_estimate": 850
"token_estimate": 850,
"tokenized_korean_text": "} v 50 := 0 if 50 < len ( data ) { v 50 = data [ 50 ] } v 51 := 0 if 51 < len ( data ) { v 51 = data [ 51 ] } v 52 := 0 if 52 < len ( data ) { v 52 = data [ 52 ] } v 53 := 0 if 53 < len ( data ) { v 53 = data [ 53 ] } v 54 := 0 if 54 < len ( data ) { v 54 = data [ 54 ] } v 55 := 0 if 55 < len ( data ) { v 55 = data [ 55 ] } v 56 := 0 if 56 < len ( data ) { v 56 = data [ 56 ] } v 57 := 0 if 57 < len ( data ) { v 57 = data [ 57 ] } v 58 := 0 if 58 < len ( data ) { v 58 = data [ 58 ] } v 59 := 0 if 59 < len ( data ) { v 59 = data [ 59 ] } v 60 := 0 if 60 < len ( data ) { v 60 = data [ 60 ] } v 61 := 0 if 61 < len ( data ) { v 61 = data [ 61 ] } v 62 := 0 if 62 < len ( data ) { v 62 = data [ 62 ] } v 63 := 0 if 63 < len ( data ) { v 63 = data [ 63 ] } v 64 := 0 if 64 < len ( data ) { v 64 = data [ 64 ] } v 65 := 0 if 65 < len ( data ) { v 65 = data [ 65 ] } v 66 := 0 if 66 < len ( data ) { v 66 = data [ 66 ] } v 67 := 0 if 67 < len ( data ) { v 67 = data [ 67 ] } v 68 := 0 if 68 < len ( data ) { v 68 = data [ 68 ] } v 69 := 0 if 69 < len ( data ) { v 69 = data [ 69 ] } v 70 := 0 if 70 < len ( data ) { v 70 = data [ 70 ] } v 71 := 0 if 71 < len ( data ) { v 71 = data [ 71 ] } v 72 := 0 if 72 < len ( data ) { v 72 = data [ 72 ] } v 73 := 0 if 73 < len ( data ) { v 73 = data [ 73 ] } v 74 := 0 if 74 < len ( data ) { v 74 = data [ 74 ] } v 75 := 0 if 75 < len ( data ) { v 75 = data [ 75 ] } v 76 := 0 if 76 < len ( data ) { v 76 = data [ 76 ] } v 77 := 0 if 77 < len ( data ) { v 77 = data [ 77 ] } v 78 := 0 if 78 < len ( data ) { v 78 = data [ 78 ] } v 79 := 0 if 79 < len ( data ) { v 79 = data [ 79 ] } v 80 := 0 if 80 < len ( data ) { v 80 = data [ 80 ] } v 81 := 0 if 81 < len ( data ) { v 81 = data [ 81 ] } v 82 := 0 if 82 < len ( data ) { v 82 = data [ 82 ] } v 83 := 0 if 83 < len ( data ) { v 83 = data [ 83 ] } v 84 := 0 if 84 < len ( data ) { v 84 = data [ 84 ] } v 85 := 0 if 85 < len ( data ) { v 85 = data [ 85 ] } v 86 := 0 if 86 < len ( data ) { v 86 = data [ 86 ] } v 87 := 0 if 87 < len ( data ) { v 87 = data [ 87 ] } v 88 := 0 if 88 < len ( data ) { v 88 = data [ 88 ] } v 89 := 0 if 89 < len ( data ) { v 89 = data [ 89 ] } v 90 := 0 if 90 < len ( data ) { v 90 = data [ 90 ] } v 91 := 0 if 91 < len ( data ) { v 91 = data [ 91 ] } v 92 := 0 if 92 < len ( data ) { v 92 = data [ 92 ] } v 93 := 0 if 93 < len ( data ) { v 93 = data [ 93 ] } v 94 := 0 if 94 < len ( data ) { v 94 = data [ 94 ] } v 95 := 0 if 95 < len ( data ) { v 95 = data [ 95 ] } v 96 := 0 if 96 < len ( data ) { v 96 = data [ 96 ] } v 97 := 0 if 97 < len ( data ) { v 97 = data [ 97 ] } v 98 := 0 if 98 < len ( data ) { v 98 = data [ 98 ] } v 99 := 0 if 99 < len ( data ) { v 99 = data [ 99 ]"
},
{
"block_ids": [
@@ -186,7 +194,8 @@
}
],
"text": "\t}\n\tv100 := 0\n\tif 100 < len(data) {\n\t\tv100 = data[100]\n\t}\n\tv101 := 0\n\tif 101 < len(data) {\n\t\tv101 = data[101]\n\t}\n\tv102 := 0\n\tif 102 < len(data) {\n\t\tv102 = data[102]\n\t}\n\tv103 := 0\n\tif 103 < len(data) {\n\t\tv103 = data[103]\n\t}\n\tv104 := 0\n\tif 104 < len(data) {\n\t\tv104 = data[104]\n\t}\n\tv105 := 0\n\tif 105 < len(data) {\n\t\tv105 = data[105]\n\t}\n\tv106 := 0\n\tif 106 < len(data) {\n\t\tv106 = data[106]\n\t}\n\tv107 := 0\n\tif 107 < len(data) {\n\t\tv107 = data[107]\n\t}\n\tv108 := 0\n\tif 108 < len(data) {\n\t\tv108 = data[108]\n\t}\n\tv109 := 0\n\tif 109 < len(data) {\n\t\tv109 = data[109]\n\t}\n\tv110 := 0\n\tif 110 < len(data) {\n\t\tv110 = data[110]\n\t}\n\tv111 := 0\n\tif 111 < len(data) {\n\t\tv111 = data[111]\n\t}\n\tv112 := 0\n\tif 112 < len(data) {\n\t\tv112 = data[112]\n\t}\n\tv113 := 0\n\tif 113 < len(data) {\n\t\tv113 = data[113]\n\t}\n\tv114 := 0\n\tif 114 < len(data) {\n\t\tv114 = data[114]\n\t}\n\tv115 := 0\n\tif 115 < len(data) {\n\t\tv115 = data[115]\n\t}\n\tv116 := 0\n\tif 116 < len(data) {\n\t\tv116 = data[116]\n\t}\n\tv117 := 0\n\tif 117 < len(data) {\n\t\tv117 = data[117]\n\t}\n\tv118 := 0\n\tif 118 < len(data) {\n\t\tv118 = data[118]\n\t}\n\tv119 := 0\n\tif 119 < len(data) {\n\t\tv119 = data[119]\n\t}\n\tv120 := 0\n\tif 120 < len(data) {\n\t\tv120 = data[120]\n\t}\n\tv121 := 0\n\tif 121 < len(data) {\n\t\tv121 = data[121]\n\t}\n\tv122 := 0\n\tif 122 < len(data) {\n\t\tv122 = data[122]\n\t}\n\tv123 := 0\n\tif 123 < len(data) {\n\t\tv123 = data[123]\n\t}\n\tv124 := 0\n\tif 124 < len(data) {\n\t\tv124 = data[124]\n\t}\n\tv125 := 0\n\tif 125 < len(data) {\n\t\tv125 = data[125]\n\t}\n\tv126 := 0\n\tif 126 < len(data) {\n\t\tv126 = data[126]\n\t}\n\tv127 := 0\n\tif 127 < len(data) {\n\t\tv127 = data[127]\n\t}\n\tv128 := 0\n\tif 128 < len(data) {\n\t\tv128 = data[128]\n\t}\n\tv129 := 0\n\tif 129 < len(data) {\n\t\tv129 = data[129]\n\t}\n\tv130 := 0\n\tif 130 < len(data) {\n\t\tv130 = data[130]\n\t}\n\tv131 := 0\n\tif 131 < len(data) {\n\t\tv131 = data[131]\n\t}\n\tv132 := 0\n\tif 132 < len(data) {\n\t\tv132 = data[132]\n\t}\n\tv133 := 0\n\tif 133 < len(data) {\n\t\tv133 = data[133]\n\t}\n\tv134 := 0\n\tif 134 < len(data) {\n\t\tv134 = data[134]\n\t}\n\tv135 := 0\n\tif 135 < len(data) {\n\t\tv135 = data[135]\n\t}\n\tv136 := 0\n\tif 136 < len(data) {\n\t\tv136 = data[136]\n\t}\n\tv137 := 0\n\tif 137 < len(data) {\n\t\tv137 = data[137]\n\t}\n\tv138 := 0\n\tif 138 < len(data) {\n\t\tv138 = data[138]\n\t}\n\tv139 := 0\n\tif 139 < len(data) {\n\t\tv139 = data[139]\n\t}\n\tv140 := 0\n\tif 140 < len(data) {\n\t\tv140 = data[140]\n\t}\n\tv141 := 0\n\tif 141 < len(data) {\n\t\tv141 = data[141]\n\t}\n\tv142 := 0\n\tif 142 < len(data) {\n\t\tv142 = data[142]\n\t}\n\tv143 := 0\n\tif 143 < len(data) {\n\t\tv143 = data[143]\n\t}\n\tv144 := 0\n\tif 144 < len(data) {\n\t\tv144 = data[144]\n\t}\n\tv145 := 0\n\tif 145 < len(data) {\n\t\tv145 = data[145]\n\t}\n\tv146 := 0\n\tif 146 < len(data) {\n\t\tv146 = data[146]\n\t}\n\tv147 := 0\n\tif 147 < len(data) {\n\t\tv147 = data[147]\n\t}\n\tv148 := 0\n\tif 148 < len(data) {\n\t\tv148 = data[148]\n\t}\n\tv149 := 0\n\tif 149 < len(data) {\n\t\tv149 = data[149]",
"token_estimate": 917
"token_estimate": 917,
"tokenized_korean_text": "} v 100 := 0 if 100 < len ( data ) { v 100 = data [ 100 ] } v 101 := 0 if 101 < len ( data ) { v 101 = data [ 101 ] } v 102 := 0 if 102 < len ( data ) { v 102 = data [ 102 ] } v 103 := 0 if 103 < len ( data ) { v 103 = data [ 103 ] } v 104 := 0 if 104 < len ( data ) { v 104 = data [ 104 ] } v 105 := 0 if 105 < len ( data ) { v 105 = data [ 105 ] } v 106 := 0 if 106 < len ( data ) { v 106 = data [ 106 ] } v 107 := 0 if 107 < len ( data ) { v 107 = data [ 107 ] } v 108 := 0 if 108 < len ( data ) { v 108 = data [ 108 ] } v 109 := 0 if 109 < len ( data ) { v 109 = data [ 109 ] } v 110 := 0 if 110 < len ( data ) { v 110 = data [ 110 ] } v 111 := 0 if 111 < len ( data ) { v 111 = data [ 111 ] } v 112 := 0 if 112 < len ( data ) { v 112 = data [ 112 ] } v 113 := 0 if 113 < len ( data ) { v 113 = data [ 113 ] } v 114 := 0 if 114 < len ( data ) { v 114 = data [ 114 ] } v 115 := 0 if 115 < len ( data ) { v 115 = data [ 115 ] } v 116 := 0 if 116 < len ( data ) { v 116 = data [ 116 ] } v 117 := 0 if 117 < len ( data ) { v 117 = data [ 117 ] } v 118 := 0 if 118 < len ( data ) { v 118 = data [ 118 ] } v 119 := 0 if 119 < len ( data ) { v 119 = data [ 119 ] } v 120 := 0 if 120 < len ( data ) { v 120 = data [ 120 ] } v 121 := 0 if 121 < len ( data ) { v 121 = data [ 121 ] } v 122 := 0 if 122 < len ( data ) { v 122 = data [ 122 ] } v 123 := 0 if 123 < len ( data ) { v 123 = data [ 123 ] } v 124 := 0 if 124 < len ( data ) { v 124 = data [ 124 ] } v 125 := 0 if 125 < len ( data ) { v 125 = data [ 125 ] } v 126 := 0 if 126 < len ( data ) { v 126 = data [ 126 ] } v 127 := 0 if 127 < len ( data ) { v 127 = data [ 127 ] } v 128 := 0 if 128 < len ( data ) { v 128 = data [ 128 ] } v 129 := 0 if 129 < len ( data ) { v 129 = data [ 129 ] } v 130 := 0 if 130 < len ( data ) { v 130 = data [ 130 ] } v 131 := 0 if 131 < len ( data ) { v 131 = data [ 131 ] } v 132 := 0 if 132 < len ( data ) { v 132 = data [ 132 ] } v 133 := 0 if 133 < len ( data ) { v 133 = data [ 133 ] } v 134 := 0 if 134 < len ( data ) { v 134 = data [ 134 ] } v 135 := 0 if 135 < len ( data ) { v 135 = data [ 135 ] } v 136 := 0 if 136 < len ( data ) { v 136 = data [ 136 ] } v 137 := 0 if 137 < len ( data ) { v 137 = data [ 137 ] } v 138 := 0 if 138 < len ( data ) { v 138 = data [ 138 ] } v 139 := 0 if 139 < len ( data ) { v 139 = data [ 139 ] } v 140 := 0 if 140 < len ( data ) { v 140 = data [ 140 ] } v 141 := 0 if 141 < len ( data ) { v 141 = data [ 141 ] } v 142 := 0 if 142 < len ( data ) { v 142 = data [ 142 ] } v 143 := 0 if 143 < len ( data ) { v 143 = data [ 143 ] } v 144 := 0 if 144 < len ( data ) { v 144 = data [ 144 ] } v 145 := 0 if 145 < len ( data ) { v 145 = data [ 145 ] } v 146 := 0 if 146 < len ( data ) { v 146 = data [ 146 ] } v 147 := 0 if 147 < len ( data ) { v 147 = data [ 147 ] } v 148 := 0 if 148 < len ( data ) { v 148 = data [ 148 ] } v 149 := 0 if 149 < len ( data ) { v 149 = data [ 149 ]"
},
{
"block_ids": [
@@ -207,7 +216,8 @@
}
],
"text": "\t}\n\tv150 := 0\n\tif 150 < len(data) {\n\t\tv150 = data[150]\n\t}\n\tv151 := 0\n\tif 151 < len(data) {\n\t\tv151 = data[151]\n\t}\n\tv152 := 0\n\tif 152 < len(data) {\n\t\tv152 = data[152]\n\t}\n\tv153 := 0\n\tif 153 < len(data) {\n\t\tv153 = data[153]\n\t}\n\tv154 := 0\n\tif 154 < len(data) {\n\t\tv154 = data[154]\n\t}\n\tv155 := 0\n\tif 155 < len(data) {\n\t\tv155 = data[155]\n\t}\n\tv156 := 0\n\tif 156 < len(data) {\n\t\tv156 = data[156]\n\t}\n\tv157 := 0\n\tif 157 < len(data) {\n\t\tv157 = data[157]\n\t}\n\tv158 := 0\n\tif 158 < len(data) {\n\t\tv158 = data[158]\n\t}\n\tv159 := 0\n\tif 159 < len(data) {\n\t\tv159 = data[159]\n\t}\n\tv160 := 0\n\tif 160 < len(data) {\n\t\tv160 = data[160]\n\t}\n\tv161 := 0\n\tif 161 < len(data) {\n\t\tv161 = data[161]\n\t}\n\tv162 := 0\n\tif 162 < len(data) {\n\t\tv162 = data[162]\n\t}\n\tv163 := 0\n\tif 163 < len(data) {\n\t\tv163 = data[163]\n\t}\n\tv164 := 0\n\tif 164 < len(data) {\n\t\tv164 = data[164]\n\t}\n\tv165 := 0\n\tif 165 < len(data) {\n\t\tv165 = data[165]\n\t}\n\tv166 := 0\n\tif 166 < len(data) {\n\t\tv166 = data[166]\n\t}\n\tv167 := 0\n\tif 167 < len(data) {\n\t\tv167 = data[167]\n\t}\n\tv168 := 0\n\tif 168 < len(data) {\n\t\tv168 = data[168]\n\t}\n\tv169 := 0\n\tif 169 < len(data) {\n\t\tv169 = data[169]\n\t}\n\tv170 := 0\n\tif 170 < len(data) {\n\t\tv170 = data[170]\n\t}\n\tv171 := 0\n\tif 171 < len(data) {\n\t\tv171 = data[171]\n\t}\n\tv172 := 0\n\tif 172 < len(data) {\n\t\tv172 = data[172]\n\t}\n\tv173 := 0\n\tif 173 < len(data) {\n\t\tv173 = data[173]\n\t}\n\tv174 := 0\n\tif 174 < len(data) {\n\t\tv174 = data[174]\n\t}\n\tv175 := 0\n\tif 175 < len(data) {\n\t\tv175 = data[175]\n\t}\n\tv176 := 0\n\tif 176 < len(data) {\n\t\tv176 = data[176]\n\t}\n\tv177 := 0\n\tif 177 < len(data) {\n\t\tv177 = data[177]\n\t}\n\tv178 := 0\n\tif 178 < len(data) {\n\t\tv178 = data[178]\n\t}\n\tv179 := 0\n\tif 179 < len(data) {\n\t\tv179 = data[179]\n\t}\n\tv180 := 0\n\tif 180 < len(data) {\n\t\tv180 = data[180]\n\t}\n\tv181 := 0\n\tif 181 < len(data) {\n\t\tv181 = data[181]\n\t}\n\tv182 := 0\n\tif 182 < len(data) {\n\t\tv182 = data[182]\n\t}\n\tv183 := 0\n\tif 183 < len(data) {\n\t\tv183 = data[183]\n\t}\n\tv184 := 0\n\tif 184 < len(data) {\n\t\tv184 = data[184]\n\t}\n\tv185 := 0\n\tif 185 < len(data) {\n\t\tv185 = data[185]\n\t}\n\tv186 := 0\n\tif 186 < len(data) {\n\t\tv186 = data[186]\n\t}\n\tv187 := 0\n\tif 187 < len(data) {\n\t\tv187 = data[187]\n\t}\n\tv188 := 0\n\tif 188 < len(data) {\n\t\tv188 = data[188]\n\t}\n\tv189 := 0\n\tif 189 < len(data) {\n\t\tv189 = data[189]\n\t}\n\tv190 := 0\n\tif 190 < len(data) {\n\t\tv190 = data[190]\n\t}\n\tv191 := 0\n\tif 191 < len(data) {\n\t\tv191 = data[191]\n\t}\n\tv192 := 0\n\tif 192 < len(data) {\n\t\tv192 = data[192]\n\t}\n\tv193 := 0\n\tif 193 < len(data) {\n\t\tv193 = data[193]\n\t}\n\tv194 := 0\n\tif 194 < len(data) {\n\t\tv194 = data[194]\n\t}\n\tv195 := 0\n\tif 195 < len(data) {\n\t\tv195 = data[195]\n\t}\n\tv196 := 0\n\tif 196 < len(data) {\n\t\tv196 = data[196]\n\t}\n\tv197 := 0\n\tif 197 < len(data) {\n\t\tv197 = data[197]\n\t}\n\tv198 := 0\n\tif 198 < len(data) {\n\t\tv198 = data[198]\n\t}\n\tv199 := 0\n\tif 199 < len(data) {\n\t\tv199 = data[199]",
"token_estimate": 917
"token_estimate": 917,
"tokenized_korean_text": "} v 150 := 0 if 150 < len ( data ) { v 150 = data [ 150 ] } v 151 := 0 if 151 < len ( data ) { v 151 = data [ 151 ] } v 152 := 0 if 152 < len ( data ) { v 152 = data [ 152 ] } v 153 := 0 if 153 < len ( data ) { v 153 = data [ 153 ] } v 154 := 0 if 154 < len ( data ) { v 154 = data [ 154 ] } v 155 := 0 if 155 < len ( data ) { v 155 = data [ 155 ] } v 156 := 0 if 156 < len ( data ) { v 156 = data [ 156 ] } v 157 := 0 if 157 < len ( data ) { v 157 = data [ 157 ] } v 158 := 0 if 158 < len ( data ) { v 158 = data [ 158 ] } v 159 := 0 if 159 < len ( data ) { v 159 = data [ 159 ] } v 160 := 0 if 160 < len ( data ) { v 160 = data [ 160 ] } v 161 := 0 if 161 < len ( data ) { v 161 = data [ 161 ] } v 162 := 0 if 162 < len ( data ) { v 162 = data [ 162 ] } v 163 := 0 if 163 < len ( data ) { v 163 = data [ 163 ] } v 164 := 0 if 164 < len ( data ) { v 164 = data [ 164 ] } v 165 := 0 if 165 < len ( data ) { v 165 = data [ 165 ] } v 166 := 0 if 166 < len ( data ) { v 166 = data [ 166 ] } v 167 := 0 if 167 < len ( data ) { v 167 = data [ 167 ] } v 168 := 0 if 168 < len ( data ) { v 168 = data [ 168 ] } v 169 := 0 if 169 < len ( data ) { v 169 = data [ 169 ] } v 170 := 0 if 170 < len ( data ) { v 170 = data [ 170 ] } v 171 := 0 if 171 < len ( data ) { v 171 = data [ 171 ] } v 172 := 0 if 172 < len ( data ) { v 172 = data [ 172 ] } v 173 := 0 if 173 < len ( data ) { v 173 = data [ 173 ] } v 174 := 0 if 174 < len ( data ) { v 174 = data [ 174 ] } v 175 := 0 if 175 < len ( data ) { v 175 = data [ 175 ] } v 176 := 0 if 176 < len ( data ) { v 176 = data [ 176 ] } v 177 := 0 if 177 < len ( data ) { v 177 = data [ 177 ] } v 178 := 0 if 178 < len ( data ) { v 178 = data [ 178 ] } v 179 := 0 if 179 < len ( data ) { v 179 = data [ 179 ] } v 180 := 0 if 180 < len ( data ) { v 180 = data [ 180 ] } v 181 := 0 if 181 < len ( data ) { v 181 = data [ 181 ] } v 182 := 0 if 182 < len ( data ) { v 182 = data [ 182 ] } v 183 := 0 if 183 < len ( data ) { v 183 = data [ 183 ] } v 184 := 0 if 184 < len ( data ) { v 184 = data [ 184 ] } v 185 := 0 if 185 < len ( data ) { v 185 = data [ 185 ] } v 186 := 0 if 186 < len ( data ) { v 186 = data [ 186 ] } v 187 := 0 if 187 < len ( data ) { v 187 = data [ 187 ] } v 188 := 0 if 188 < len ( data ) { v 188 = data [ 188 ] } v 189 := 0 if 189 < len ( data ) { v 189 = data [ 189 ] } v 190 := 0 if 190 < len ( data ) { v 190 = data [ 190 ] } v 191 := 0 if 191 < len ( data ) { v 191 = data [ 191 ] } v 192 := 0 if 192 < len ( data ) { v 192 = data [ 192 ] } v 193 := 0 if 193 < len ( data ) { v 193 = data [ 193 ] } v 194 := 0 if 194 < len ( data ) { v 194 = data [ 194 ] } v 195 := 0 if 195 < len ( data ) { v 195 = data [ 195 ] } v 196 := 0 if 196 < len ( data ) { v 196 = data [ 196 ] } v 197 := 0 if 197 < len ( data ) { v 197 = data [ 197 ] } v 198 := 0 if 198 < len ( data ) { v 198 = data [ 198 ] } v 199 := 0 if 199 < len ( data ) { v 199 = data [ 199 ]"
},
{
"block_ids": [
@@ -228,6 +238,7 @@
}
],
"text": "\t}\n\tv200 := 0\n\tif 200 < len(data) {\n\t\tv200 = data[200]\n\t}\n\tv201 := 0\n\tif 201 < len(data) {\n\t\tv201 = data[201]\n\t}\n\tv202 := 0\n\tif 202 < len(data) {\n\t\tv202 = data[202]\n\t}\n\tv203 := 0\n\tif 203 < len(data) {\n\t\tv203 = data[203]\n\t}\n\tv204 := 0\n\tif 204 < len(data) {\n\t\tv204 = data[204]\n\t}\n\tv205 := 0\n\tif 205 < len(data) {\n\t\tv205 = data[205]\n\t}\n\tv206 := 0\n\tif 206 < len(data) {\n\t\tv206 = data[206]\n\t}\n\tv207 := 0\n\tif 207 < len(data) {\n\t\tv207 = data[207]\n\t}\n\tv208 := 0\n\tif 208 < len(data) {\n\t\tv208 = data[208]\n\t}\n\tv209 := 0\n\tif 209 < len(data) {\n\t\tv209 = data[209]\n\t}\n\treturn len(data)\n}",
"token_estimate": 191
"token_estimate": 191,
"tokenized_korean_text": "} v 200 := 0 if 200 < len ( data ) { v 200 = data [ 200 ] } v 201 := 0 if 201 < len ( data ) { v 201 = data [ 201 ] } v 202 := 0 if 202 < len ( data ) { v 202 = data [ 202 ] } v 203 := 0 if 203 < len ( data ) { v 203 = data [ 203 ] } v 204 := 0 if 204 < len ( data ) { v 204 = data [ 204 ] } v 205 := 0 if 205 < len ( data ) { v 205 = data [ 205 ] } v 206 := 0 if 206 < len ( data ) { v 206 = data [ 206 ] } v 207 := 0 if 207 < len ( data ) { v 207 = data [ 207 ] } v 208 := 0 if 208 < len ( data ) { v 208 = data [ 208 ] } v 209 := 0 if 209 < len ( data ) { v 209 = data [ 209 ] } return len ( data ) }"
}
]

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -110,13 +110,11 @@ fn k8s_multi_doc_emits_one_chunk_per_resource() {
let symbols: Vec<&str> = chunks
.iter()
.map(|c| {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
symbol.as_deref().expect("symbol must be Some for k8s chunks")
}
other => panic!("expected Code span, got {other:?}"),
}
.map(|c| match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => symbol
.as_deref()
.expect("symbol must be Some for k8s chunks"),
other => panic!("expected Code span, got {other:?}"),
})
.collect();
@@ -270,7 +268,11 @@ fn k8s_oversize_splits_into_line_windows_sharing_symbol() {
let ranges: Vec<(u32, u32)> = chunks
.iter()
.map(|c| match &c.source_spans[0] {
SourceSpan::Code { line_start, line_end, .. } => (*line_start, *line_end),
SourceSpan::Code {
line_start,
line_end,
..
} => (*line_start, *line_end),
other => panic!("expected Code span, got {other:?}"),
})
.collect();

View File

@@ -15,7 +15,7 @@ use std::path::PathBuf;
use kebab_chunk::MdHeadingV1Chunker;
use kebab_core::{
AssetId, AssetStorage, Checksum, ChunkPolicy, ChunkerVersion, Chunker, MediaType,
AssetId, AssetStorage, Checksum, ChunkPolicy, Chunker, ChunkerVersion, MediaType,
ParserVersion, RawAsset, SourceUri, WorkspacePath,
};
use kebab_parse_md::{BodyHints, build_canonical_document, parse_blocks, parse_frontmatter};
@@ -65,8 +65,7 @@ fn long_section_chunks_snapshot() {
Some(span) => bytes[..span.end].iter().filter(|b| **b == b'\n').count() as u32 + 1,
None => 1,
};
let (blocks, parse_warns) =
parse_blocks(&bytes, body_offset_lines).expect("blocks parse");
let (blocks, parse_warns) = parse_blocks(&bytes, body_offset_lines).expect("blocks parse");
// Pin parser_version so doc_id / block_ids are reproducible.
let parser_version = ParserVersion("kb-chunk-snapshot-test-0".into());
@@ -74,9 +73,8 @@ fn long_section_chunks_snapshot() {
metadata.aliases.sort();
metadata.tags.sort();
let doc =
build_canonical_document(&asset, metadata, blocks, &parser_version, parse_warns)
.expect("build_canonical_document");
let doc = build_canonical_document(&asset, metadata, blocks, &parser_version, parse_warns)
.expect("build_canonical_document");
// Pin policy so policy_hash and chunk_ids are reproducible.
let policy = ChunkPolicy {
@@ -102,8 +100,7 @@ fn long_section_chunks_snapshot() {
baseline_path.display()
),
};
let expected: Value =
serde_json::from_str(&baseline_text).expect("baseline parses as json");
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
@@ -154,14 +151,8 @@ fn long_section_chunks_are_deterministic() {
let mut metadata = metadata;
metadata.aliases.sort();
metadata.tags.sort();
let doc = build_canonical_document(
&asset,
metadata,
blocks,
&parser_version,
parse_warns,
)
.expect("build_canonical_document");
let doc = build_canonical_document(&asset, metadata, blocks, &parser_version, parse_warns)
.expect("build_canonical_document");
let ids: Vec<String> = MdHeadingV1Chunker
.chunk(&doc, &policy)
.unwrap()

View File

@@ -107,9 +107,7 @@ fn cargo_toml_single_chunk_with_toml_lang() {
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("toml", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
let chunks = ManifestFileV1Chunker.chunk(&doc, &policy()).expect("chunk");
assert_eq!(
chunks.len(),
@@ -149,9 +147,7 @@ fn package_json_single_chunk_with_json_lang() {
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("json", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
let chunks = ManifestFileV1Chunker.chunk(&doc, &policy()).expect("chunk");
assert_eq!(
chunks.len(),
@@ -191,9 +187,7 @@ fn pom_xml_single_chunk_with_xml_lang() {
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("xml", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
let chunks = ManifestFileV1Chunker.chunk(&doc, &policy()).expect("chunk");
assert_eq!(
chunks.len(),
@@ -233,9 +227,7 @@ fn go_mod_single_chunk_with_go_mod_lang() {
.unwrap_or_else(|e| panic!("cannot read fixture {}: {e}", fixture_path.display()));
let doc = manifest_doc("go-mod", &text);
let chunks = ManifestFileV1Chunker
.chunk(&doc, &policy())
.expect("chunk");
let chunks = ManifestFileV1Chunker.chunk(&doc, &policy()).expect("chunk");
assert_eq!(
chunks.len(),

View File

@@ -0,0 +1,81 @@
#[test]
fn tokenize_korean_morphological_splits_2char_word() {
let out = kebab_chunk::tokenize_korean_morphological("한국 문화는 오래되었다").unwrap();
let tokens: Vec<&str> = out.split_whitespace().collect();
assert!(tokens.contains(&"한국"), "tokens = {tokens:?}");
}
#[test]
fn tokenize_korean_morphological_empty_returns_none() {
assert!(kebab_chunk::tokenize_korean_morphological("").is_none());
assert!(kebab_chunk::tokenize_korean_morphological(" ").is_none());
}
/// v0.21.0 N-gram supplement (Option β): morpheme 길이 ≥ 3 인 한글 token
/// (ko-dic 가 단일 compound 으로 저장한 case) 에 대해 sliding window
/// 2-gram 보충 emit. ko-dic 가 이미 `한국정부` → `[한국, 정부]` 처럼 잘
/// 분해하는 경우는 2-char morpheme 이라 supplement 안 함 (filter 의도).
#[test]
fn tokenize_korean_morphological_emits_2gram_for_long_morpheme() {
// ko-dic 의 분해 정책 검증: 어떤 input 이 3+자 morpheme 을 emit 하는지.
// 본 test 는 lindera ko-dic 의 segmentation 의존이라 구체 fixture 는
// morpheme list 가 ≥ 3 char token 을 포함하는 case 를 사용.
let probe_inputs: &[&str] = &[
"한국문화", // ko-dic 가 단일 명사로 등록 가능 → 3+ char morpheme
"주민등록번호", // 4+ char compound — supplement 대상
"서울특별시", // 3+ char
"대한민국", // 3+ char
"오래되었다", // 동사 활용형 — 일부 3+ char morpheme 가능
];
let mut found_supplement = false;
for input in probe_inputs {
let out = kebab_chunk::tokenize_korean_morphological(input).unwrap_or_default();
let tokens: Vec<&str> = out.split_whitespace().collect();
let unique: std::collections::HashSet<&&str> = tokens.iter().collect();
// supplement 가 작동했다면 distinct token 수가 lindera output 의 morpheme 수보다 많음.
// 또는 input 의 2-char prefix 가 별도 token 으로 존재.
let prefix: String = input.chars().take(2).collect();
if tokens.contains(&prefix.as_str()) && tokens.iter().any(|t| t.chars().count() >= 3) {
found_supplement = true;
println!("supplement fired for input '{input}' → tokens = {tokens:?}");
}
// 영어/숫자 prefix 가 emit 되지 않음 (한글만 supplement 대상).
// 무조건 unique token 수 ≥ 1.
assert!(!unique.is_empty(), "input '{input}' produced empty token list");
}
// 최소 1개 fixture 에서 supplement 동작 확인.
// 만약 ko-dic 가 모든 probe 를 2-char 단위로만 분해하면 found_supplement=false 가능.
// 그때는 본 test 는 ko-dic 정책상 N-gram supplement 가 marginal 임을 demonstrate (warning only).
if !found_supplement {
eprintln!(
"WARNING: ko-dic 가 모든 probe input 을 2-char morpheme 으로 분해. \
N-gram supplement 의 marginal benefit 은 corpus 의 morpheme 길이 분포 의존."
);
}
}
/// N-gram supplement 는 한국어 (한글) morpheme 에만 적용. 영어/숫자/혼합
/// token 은 sliding window emit 없음 (false positive 회피).
#[test]
fn tokenize_korean_morphological_no_2gram_for_english() {
let out = kebab_chunk::tokenize_korean_morphological("Rust optimization").unwrap();
let tokens: Vec<&str> = out.split_whitespace().collect();
// Rust 와 optimization 자체는 token 으로 존재해야 함 (lindera output).
assert!(
tokens.iter().any(|t| t.eq_ignore_ascii_case("rust") || t.eq_ignore_ascii_case("optimization")),
"lindera 의 영어 token 자체는 emit 되어야 함 — tokens = {tokens:?}"
);
// 영어 substring (`Rus`, `imi`, `tion` 등) 는 N-gram emit 안 됨.
let supplements: Vec<&&str> = tokens
.iter()
.filter(|t| matches!(t.chars().count(), 2 | 3) && t.chars().all(|c| c.is_ascii_alphabetic()))
.collect();
// empty 또는 lindera 가 emit 한 짧은 ASCII token 만 — 우리가 추가 emit 한 substring 은 없음.
assert!(
supplements.iter().all(|t| !t.contains("Rus") && !t.contains("ust") && !t.contains("imi")),
"영어 N-gram supplement 가 emit 됨 — false positive 위험. tokens = {tokens:?}"
);
}

View File

@@ -51,5 +51,10 @@ tempfile = { workspace = true }
rusqlite = { workspace = true }
time = { workspace = true }
[features]
# opt-in (macOS): build the `kebab` binary with candle on the Apple Silicon GPU.
# cargo build --release --features embed_metal
embed_metal = ["kebab-app/embed_metal"]
[lints]
workspace = true

View File

@@ -60,6 +60,12 @@ enum Cmd {
force: bool,
},
/// config.toml 관리 (스키마 마이그레이션 등).
Config {
#[command(subcommand)]
what: ConfigWhat,
},
/// Scan the workspace and ingest new/updated documents.
Ingest {
/// Workspace root override.
@@ -156,7 +162,7 @@ enum Cmd {
/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated.
/// Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`,
/// `image`, `audio`, `other`. Unknown values match nothing.
/// `image`, `audio`, `code`, `other`. Unknown values match nothing.
#[arg(long, value_delimiter = ',')]
media: Vec<String>,
@@ -179,7 +185,12 @@ enum Cmd {
/// canonical). Repeatable or comma-separated.
/// Examples: `rust`, `python`, `typescript`.
/// Unknown values produce empty hits.
#[arg(long = "code-lang", value_name = "LANG", num_args = 1, value_delimiter = ',')]
#[arg(
long = "code-lang",
value_name = "LANG",
num_args = 1,
value_delimiter = ','
)]
code_lang: Vec<String>,
/// p9-fb-37: emit pre-fusion lexical / vector / RRF candidate
@@ -341,6 +352,16 @@ enum Cmd {
},
}
#[derive(Subcommand, Debug)]
enum ConfigWhat {
/// 기존 config.toml 을 새 스키마로 마이그레이션(빠진 섹션 추가 + 멱등 + .bak 백업).
Migrate {
/// 변경만 출력하고 파일은 수정하지 않는다.
#[arg(long)]
dry_run: bool,
},
}
#[derive(Subcommand, Debug)]
enum ListWhat {
/// List documents currently indexed.
@@ -353,6 +374,17 @@ enum InspectWhat {
Doc { id: String },
/// Inspect a single chunk by ID.
Chunk { id: String },
/// Corpus-wide OCR statistics (total events, latency percentiles, engine breakdown).
OcrStats,
/// Recent OCR failures, optionally filtered by document ID.
OcrFailures {
/// Filter failures to a single document UUID.
#[arg(long)]
doc_id: Option<String>,
/// Maximum number of failure rows to return.
#[arg(long, default_value_t = 10)]
limit: usize,
},
}
#[derive(Subcommand, Debug)]
@@ -406,6 +438,14 @@ enum EvalWhat {
/// into `eval_runs.aggregate_json` (P5-2).
Aggregate { run_id: String },
/// Compute variant-consistency metrics for a stored run and print
/// a Markdown report (or JSON with `--json`).
Variants {
run_id: String,
#[arg(long)]
json: bool,
},
/// Diff two stored runs (P5-2). Default output is a Markdown
/// summary; use `--json` (top-level flag) for the raw report.
Compare {
@@ -464,7 +504,9 @@ fn parse_bool_env(s: &str) -> Result<bool, String> {
match s.to_ascii_lowercase().as_str() {
"1" | "true" | "yes" | "on" => Ok(true),
"0" | "false" | "no" | "off" => Ok(false),
other => Err(format!("expected 1/0/true/false/yes/no/on/off, got {other:?}")),
other => Err(format!(
"expected 1/0/true/false/yes/no/on/off, got {other:?}"
)),
}
}
@@ -551,9 +593,18 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
"created {}",
kebab_config::Config::xdg_config_path().display()
);
println!("created {}", kebab_config::Config::xdg_data_dir().display());
println!("created {}", kebab_config::Config::xdg_state_dir().display());
println!(
"created {}",
kebab_config::Config::xdg_data_dir().display()
);
println!(
"created {}",
kebab_config::Config::xdg_state_dir().display()
);
println!("hint edit the config above, then `kebab ingest`");
println!(
"hint remote Ollama 사용 시 config 의 `[models.llm] endpoint` 를 갱신 (기본 http://127.0.0.1:11434)"
);
}
Ok(())
}
@@ -565,7 +616,9 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
} => {
let cfg = kebab_config::Config::load(cli.config.as_deref())?;
let scope = kebab_core::SourceScope {
root: root.clone().unwrap_or_else(|| PathBuf::from(&cfg.workspace.root)),
root: root
.clone()
.unwrap_or_else(|| PathBuf::from(&cfg.workspace.root)),
exclude: cfg.workspace.exclude.clone(),
..Default::default()
};
@@ -579,10 +632,27 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
.map(|v| v.eq_ignore_ascii_case("plain"))
.unwrap_or(false);
let mode = progress::ProgressMode::from_flags(cli.json, cli.quiet, plain_env);
// Surface the active embedding backend/device on the terminal so the
// user sees it without grepping kb.log (the per-device tracing line
// only lands in the log file at --verbose). Suppressed under
// --json/--quiet. The Metal note reflects the build (`embed_metal`);
// the confirmed runtime device is in kb.log (`candle device = ...`).
if !cli.json && !cli.quiet {
let backend = match cfg.models.embedding.provider.as_str() {
"candle" if cfg!(feature = "embed_metal") => "candle (Metal/GPU 빌드)",
"candle" => "candle (CPU, 순수 Rust)",
"fastembed" | "onnx" | "" => "fastembed (onnxruntime)",
"none" => "비활성 (lexical-only)",
other => other,
};
eprintln!("임베딩 백엔드: {backend} · 모델 {} ({}-dim)",
cfg.models.embedding.model, cfg.models.embedding.dimensions);
}
let (tx, rx) = std::sync::mpsc::channel::<kebab_app::IngestEvent>();
let display_handle = std::thread::spawn(move || {
progress::ProgressDisplay::new(mode).run(rx)
});
let display_handle =
std::thread::spawn(move || progress::ProgressDisplay::new(mode).run(rx));
// p9-fb-04: register a Ctrl-C handler that flips the same
// AtomicBool the facade polls at each step boundary. The
@@ -614,7 +684,8 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
if cli.json {
println!("{}", serde_json::to_string(&wire::wire_ingest(&report))?);
} else {
let skipped_breakdown = kebab_app::render_skipped_breakdown(&report.skipped_by_extension);
let skipped_breakdown =
kebab_app::render_skipped_breakdown(&report.skipped_by_extension);
let purged_suffix = if report.purged_deleted_files > 0 {
format!(" purged {}", report.purged_deleted_files)
} else {
@@ -640,10 +711,13 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
let cfg = kebab_config::Config::load(cli.config.as_deref())?;
let docs = kebab_app::list_docs_with_config(cfg, kebab_core::DocFilter::default())?;
if cli.json {
println!("{}", serde_json::to_string(&wire::wire_doc_summaries(&docs))?);
println!(
"{}",
serde_json::to_string(&wire::wire_doc_summaries(&docs))?
);
} else {
for d in &docs {
println!("{}\t{}", d.doc_id, d.doc_path.0);
println!("{}", wire::format_doc_row(d));
}
}
Ok(())
@@ -667,7 +741,25 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
let cfg = kebab_config::Config::load(cli.config.as_deref())?;
let chunk_id: kebab_core::ChunkId = id.parse()?;
let chunk = kebab_app::inspect_chunk_with_config(cfg, &chunk_id)?;
println!("{}", serde_json::to_string(&wire::wire_chunk_inspection(&chunk))?);
println!(
"{}",
serde_json::to_string(&wire::wire_chunk_inspection(&chunk))?
);
Ok(())
}
InspectWhat::OcrStats => {
let cfg = kebab_config::Config::load(cli.config.as_deref())?;
let app = kebab_app::App::open_with_config(cfg.clone())?;
let stats = app.inspect_ocr_stats_with_config(&cfg)?;
println!("{}", serde_json::to_string(&stats)?);
Ok(())
}
InspectWhat::OcrFailures { doc_id, limit } => {
let cfg = kebab_config::Config::load(cli.config.as_deref())?;
let app = kebab_app::App::open_with_config(cfg.clone())?;
let failures =
app.inspect_ocr_failures_with_config(&cfg, doc_id.as_deref(), *limit)?;
println!("{}", serde_json::to_string(&failures)?);
Ok(())
}
},
@@ -708,7 +800,10 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
};
let result = kebab_app::fetch_with_config(cfg, query, opts)?;
if cli.json {
println!("{}", serde_json::to_string(&wire::wire_fetch_result(&result))?);
println!(
"{}",
serde_json::to_string(&wire::wire_fetch_result(&result))?
);
} else {
render_fetch_plain(&result);
}
@@ -752,30 +847,21 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
if line.trim().is_empty() {
continue;
}
let v: serde_json::Value =
serde_json::from_str(&line).map_err(|e| {
anyhow::Error::new(kebab_app::StructuredError(
kebab_app::ErrorV1 {
schema_version: kebab_app::ERROR_V1_ID
.to_string(),
code: "config_invalid".to_string(),
message: format!(
"stdin ndjson line {} parse error: {e}",
lineno + 1
),
details: serde_json::Value::Null,
hint: Some(
"each line must be a JSON object with at least `query`"
.to_string(),
),
},
))
})?;
let v: serde_json::Value = serde_json::from_str(&line).map_err(|e| {
anyhow::Error::new(kebab_app::StructuredError(kebab_app::ErrorV1 {
schema_version: kebab_app::ERROR_V1_ID.to_string(),
code: "config_invalid".to_string(),
message: format!("stdin ndjson line {} parse error: {e}", lineno + 1),
details: serde_json::Value::Null,
hint: Some(
"each line must be a JSON object with at least `query`".to_string(),
),
}))
})?;
raw_items.push(v);
}
let (items, summary) =
kebab_app::bulk_search_with_config(cfg, raw_items)?;
let (items, summary) = kebab_app::bulk_search_with_config(cfg, raw_items)?;
if cli.json {
let mut stdout = std::io::stdout().lock();
@@ -799,11 +885,7 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
if let Some(err) = &item.error {
writeln!(stdout, "error: {err}")?;
} else if let Some(resp) = &item.response {
writeln!(
stdout,
"{}",
serde_json::to_string_pretty(resp)?
)?;
writeln!(stdout, "{}", serde_json::to_string_pretty(resp)?)?;
}
writeln!(stdout)?;
}
@@ -819,6 +901,17 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
// p9-fb-42: bulk mode requires no query; single-query mode requires query.
let query_text = match query.as_ref() {
Some(q) if q.trim().is_empty() => {
return Err(anyhow::Error::new(kebab_app::StructuredError(
kebab_app::ErrorV1 {
schema_version: kebab_app::ERROR_V1_ID.to_string(),
code: "invalid_input".to_string(),
message: "query is empty; provide a non-empty search term or use --bulk".into(),
details: serde_json::Value::Null,
hint: Some("e.g. `kebab search 'rust async'` or `kebab search --bulk < queries.ndjson`".into()),
},
)));
}
Some(q) => q.clone(),
None => {
return Err(anyhow::anyhow!("query is required unless --bulk is set"));
@@ -832,8 +925,7 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
other => other.to_string(),
}
}
let media_norm: Vec<String> =
media.iter().map(|s| normalize_media_alias(s)).collect();
let media_norm: Vec<String> = media.iter().map(|s| normalize_media_alias(s)).collect();
// p9-fb-36: parse --ingested-after as RFC3339; structured error on failure.
let ingested_after_parsed: Option<time::OffsetDateTime> =
@@ -845,8 +937,8 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
) {
Ok(ts) => Some(ts),
Err(e) => {
return Err(anyhow::Error::new(
kebab_app::StructuredError(kebab_app::ErrorV1 {
return Err(anyhow::Error::new(kebab_app::StructuredError(
kebab_app::ErrorV1 {
schema_version: kebab_app::ERROR_V1_ID.to_string(),
code: "config_invalid".to_string(),
message: format!(
@@ -856,8 +948,8 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
hint: Some(
"expected format like 2026-04-01T00:00:00Z".to_string(),
),
}),
));
},
)));
}
}
}
@@ -932,11 +1024,7 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
};
println!(
"{:>2}. {:.4} {}{}{}",
h.rank,
h.retrieval.fusion_score,
stale_tag,
h.doc_path.0,
heading,
h.rank, h.retrieval.fusion_score, stale_tag, h.doc_path.0, heading,
);
}
// p9-fb-34: truncation hint goes to stderr so it
@@ -958,15 +1046,33 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
if let Some(t) = &resp.trace {
eprintln!();
eprintln!("Trace:");
eprintln!(" lexical ({} hits, {}ms):", t.lexical.len(), t.timing.lexical_ms);
eprintln!(
" lexical ({} hits, {}ms):",
t.lexical.len(),
t.timing.lexical_ms
);
for c in t.lexical.iter().take(3) {
eprintln!(" rank={} score={:.4} chunk={}", c.rank, c.score, c.chunk_id.0);
eprintln!(
" rank={} score={:.4} chunk={}",
c.rank, c.score, c.chunk_id.0
);
}
eprintln!(" vector ({} hits, {}ms):", t.vector.len(), t.timing.vector_ms);
eprintln!(
" vector ({} hits, {}ms):",
t.vector.len(),
t.timing.vector_ms
);
for c in t.vector.iter().take(3) {
eprintln!(" rank={} score={:.4} chunk={}", c.rank, c.score, c.chunk_id.0);
eprintln!(
" rank={} score={:.4} chunk={}",
c.rank, c.score, c.chunk_id.0
);
}
eprintln!(" fusion ({} inputs, {}ms)", t.rrf_inputs.len(), t.timing.fusion_ms);
eprintln!(
" fusion ({} inputs, {}ms)",
t.rrf_inputs.len(),
t.timing.fusion_ms
);
eprintln!(" total: {}ms", t.timing.total_ms);
}
}
@@ -988,6 +1094,17 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
multi_hop,
} => {
let cfg = kebab_config::Config::load(cli.config.as_deref())?;
if query.trim().is_empty() {
return Err(anyhow::Error::new(kebab_app::StructuredError(
kebab_app::ErrorV1 {
schema_version: kebab_app::ERROR_V1_ID.to_string(),
code: "invalid_input".to_string(),
message: "query is empty; provide a non-empty prompt".into(),
details: serde_json::Value::Null,
hint: Some("e.g. `kebab ask \"explain this code\"`".into()),
},
)));
}
if *stream {
// p9-fb-33: streaming branch. Background thread runs
// ask_with_config (which calls into the rag pipeline);
@@ -1017,16 +1134,12 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
let cfg2 = cfg.clone();
let q = query.clone();
let session2 = session.clone();
let handle = std::thread::spawn(
move || -> anyhow::Result<kebab_core::Answer> {
match session2.as_deref() {
Some(sid) => kebab_app::ask_with_session_with_config(
cfg2, sid, &q, opts,
),
None => kebab_app::ask_with_config(cfg2, &q, opts),
}
},
);
let handle = std::thread::spawn(move || -> anyhow::Result<kebab_core::Answer> {
match session2.as_deref() {
Some(sid) => kebab_app::ask_with_session_with_config(cfg2, sid, &q, opts),
None => kebab_app::ask_with_config(cfg2, &q, opts),
}
});
// Drain receiver, write ndjson to stderr until
// completion or BrokenPipe.
@@ -1231,6 +1344,42 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
Ok(())
}
Cmd::Config { what } => match what {
ConfigWhat::Migrate { dry_run } => {
let report =
kebab_app::config_migrate_with_config_path(cli.config.as_deref(), *dry_run)?;
if cli.json {
println!(
"{}",
serde_json::to_string(&wire::wire_config_migration(&report))?
);
} else if !report.changed {
println!(
"config 이미 최신입니다 (schema v{}).",
report.to_schema_version
);
} else {
let verb = if report.dry_run { "변경 예정" } else { "적용됨" };
println!(
"config 마이그레이션 {verb}: v{} → v{} ({} changes)",
report.from_schema_version,
report.to_schema_version,
report.changes.len()
);
for c in &report.changes {
println!(" - [{:?}] {}{}", c.kind, c.path, c.detail);
}
if let Some(bak) = &report.backup_path {
println!("백업: {bak}");
}
if report.dry_run {
println!("(--dry-run: 파일 미수정. 적용하려면 --dry-run 없이 재실행)");
}
}
Ok(())
}
},
Cmd::Doctor => {
let report = kebab_app::doctor_with_config_path(cli.config.as_deref())?;
if cli.json {
@@ -1266,84 +1415,106 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
app.run()
}
Cmd::Eval { what } => match what {
EvalWhat::Run {
suite,
mode,
k,
with_rag,
temperature,
seed,
} => {
let opts = kebab_eval::EvalRunOpts {
suite: suite.clone(),
mode: (*mode).into(),
with_rag: *with_rag,
k: *k,
temperature: *temperature,
seed: *seed,
};
let run = kebab_eval::run_eval(&opts)?;
if cli.json {
println!("{}", serde_json::to_string_pretty(&run)?);
} else {
println!("run_id: {}", run.run_id);
println!("queries: {}", run.per_query.len());
let failed = run.per_query.iter().filter(|q| q.error.is_some()).count();
println!("failed: {failed}");
}
Ok(())
}
EvalWhat::Aggregate { run_id } => {
let agg = kebab_eval::compute_aggregate(run_id)?;
kebab_eval::store_aggregate(run_id, &agg)?;
if cli.json {
println!("{}", serde_json::to_string_pretty(&agg)?);
} else {
println!("run_id: {run_id}");
println!("queries: {} ({} failed)", agg.total_queries, agg.failed_queries);
println!("hit@1: {:.4}", agg.hit_at_k.get(&1).copied().unwrap_or(0.0));
println!("hit@5: {:.4}", agg.hit_at_k.get(&5).copied().unwrap_or(0.0));
println!("MRR: {:.4}", agg.mrr);
}
Ok(())
}
EvalWhat::Compare {
run_a,
run_b,
strict_chunker_version,
write_report,
} => {
let cfg = kebab_config::Config::load(None)?;
let opts = kebab_eval::CompareOpts {
strict_chunker_version: *strict_chunker_version,
};
let report = kebab_eval::compare_runs_with_config(&cfg, run_a, run_b, &opts)?;
let md = kebab_eval::render_report_md(&report);
if cli.json {
println!("{}", serde_json::to_string_pretty(&report)?);
} else {
print!("{md}");
}
if *write_report {
let resolved_data_dir = kebab_config::expand_path(&cfg.storage.data_dir, "");
let runs_dir = kebab_config::expand_path(
&cfg.storage.runs_dir,
&resolved_data_dir.to_string_lossy(),
);
let dir = runs_dir.join(run_b);
std::fs::create_dir_all(&dir)?;
let path = dir.join("report.md");
std::fs::write(&path, &md)?;
if !cli.json {
eprintln!("wrote {}", path.display());
Cmd::Eval { what } => {
let cfg = kebab_config::Config::load(cli.config.as_deref())?;
match what {
EvalWhat::Run {
suite,
mode,
k,
with_rag,
temperature,
seed,
} => {
let opts = kebab_eval::EvalRunOpts {
suite: suite.clone(),
mode: (*mode).into(),
with_rag: *with_rag,
k: *k,
temperature: *temperature,
seed: *seed,
};
let run = kebab_eval::run_eval_with_config(&cfg, &opts)?;
if cli.json {
println!("{}", serde_json::to_string_pretty(&run)?);
} else {
println!("run_id: {}", run.run_id);
println!("queries: {}", run.per_query.len());
let failed = run.per_query.iter().filter(|q| q.error.is_some()).count();
println!("failed: {failed}");
}
Ok(())
}
EvalWhat::Aggregate { run_id } => {
let agg = kebab_eval::compute_aggregate_with_config(&cfg, run_id)?;
kebab_eval::store_aggregate_with_config(&cfg, run_id, &agg)?;
if cli.json {
println!("{}", serde_json::to_string_pretty(&agg)?);
} else {
println!("run_id: {run_id}");
println!(
"queries: {} ({} failed)",
agg.total_queries, agg.failed_queries
);
println!(
"hit@1: {:.4}",
agg.hit_at_k.get(&1).copied().unwrap_or(0.0)
);
println!(
"hit@5: {:.4}",
agg.hit_at_k.get(&5).copied().unwrap_or(0.0)
);
println!("MRR: {:.4}", agg.mrr);
}
Ok(())
}
EvalWhat::Variants { run_id, json } => {
let rep = kebab_eval::compute_variant_consistency_with_config(&cfg, run_id)?;
if *json {
println!("{}", serde_json::to_string_pretty(&rep)?);
} else {
print!("{}", kebab_eval::render_variants_md(&rep));
}
Ok(())
}
EvalWhat::Compare {
run_a,
run_b,
strict_chunker_version,
write_report,
} => {
let opts = kebab_eval::CompareOpts {
strict_chunker_version: *strict_chunker_version,
};
let report = kebab_eval::compare_runs_with_config(&cfg, run_a, run_b, &opts)?;
let md = kebab_eval::render_report_md(&report);
if cli.json {
println!("{}", serde_json::to_string_pretty(&report)?);
} else {
print!("{md}");
}
if *write_report {
let resolved_data_dir =
kebab_config::expand_path(&cfg.storage.data_dir, "");
let runs_dir = kebab_config::expand_path(
&cfg.storage.runs_dir,
&resolved_data_dir.to_string_lossy(),
);
let dir = runs_dir.join(run_b);
std::fs::create_dir_all(&dir)?;
let path = dir.join("report.md");
std::fs::write(&path, &md)?;
if !cli.json {
eprintln!("wrote {}", path.display());
}
}
Ok(())
}
Ok(())
}
},
}
Cmd::IngestFile { path } => {
let cfg = kebab_config::Config::load(cli.config.as_deref())?;
@@ -1354,8 +1525,12 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
} else {
println!(
"ingest-file: scanned={} new={} updated={} unchanged={} skipped={} errors={}",
report.scanned, report.new, report.updated,
report.unchanged, report.skipped, report.errors
report.scanned,
report.new,
report.updated,
report.unchanged,
report.skipped,
report.errors
);
}
Ok(())
@@ -1368,20 +1543,20 @@ fn run(cli: &Cli) -> anyhow::Result<()> {
.read_to_string(&mut body)
.context("kebab ingest-stdin: read stdin")?;
let cfg = kebab_config::Config::load(cli.config.as_deref())?;
let report = kebab_app::ingest_stdin_with_config(
cfg,
&body,
title,
source_uri.as_deref(),
)?;
let report =
kebab_app::ingest_stdin_with_config(cfg, &body, title, source_uri.as_deref())?;
if cli.json {
let v = wire::wire_ingest(&report);
println!("{}", serde_json::to_string(&v)?);
} else {
println!(
"ingest-stdin: scanned={} new={} updated={} unchanged={} skipped={} errors={}",
report.scanned, report.new, report.updated,
report.unchanged, report.skipped, report.errors
report.scanned,
report.new,
report.updated,
report.unchanged,
report.skipped,
report.errors
);
}
Ok(())
@@ -1410,10 +1585,7 @@ fn render_ask_plain_citations(
writeln!(w)?;
writeln!(w, "근거:")?;
for (idx, c) in ans.citations.iter().enumerate() {
let marker = c
.marker
.clone()
.unwrap_or_else(|| format!("{}", idx + 1));
let marker = c.marker.clone().unwrap_or_else(|| format!("{}", idx + 1));
// p9-fb-32: `[stale]` prefix on the URI for citations whose
// `stale: true`. Yellow on TTY, plain otherwise — mirrors the
// search-plain renderer in `Cmd::Search`.
@@ -1474,7 +1646,10 @@ fn print_schema_text(s: &kebab_app::SchemaV1) {
println!(" parser_version {}", s.models.parser_version);
println!(" chunker_version {}", s.models.chunker_version);
println!(" embedding_version {}", s.models.embedding_version);
println!(" prompt_template_version {}", s.models.prompt_template_version);
println!(
" prompt_template_version {}",
s.models.prompt_template_version
);
println!(" index_version {}", s.models.index_version);
println!(" corpus_revision {}", s.models.corpus_revision);
println!();
@@ -1523,9 +1698,7 @@ fn confirm_destructive(
/// Confirm prompt for `--orphans-only`: shows the orphan count + a
/// sample of up to 5 paths so the user knows what will be purged before
/// committing. No filesystem paths are removed — only store records.
fn confirm_orphans_only(
orphan_paths: &[kebab_core::WorkspacePath],
) -> anyhow::Result<bool> {
fn confirm_orphans_only(orphan_paths: &[kebab_core::WorkspacePath]) -> anyhow::Result<bool> {
use std::io::Write;
let n = orphan_paths.len();
let mut out = std::io::stderr().lock();
@@ -1538,11 +1711,7 @@ fn confirm_orphans_only(
return Ok(true);
}
let sample: Vec<&str> = orphan_paths
.iter()
.take(5)
.map(|p| p.0.as_str())
.collect();
let sample: Vec<&str> = orphan_paths.iter().take(5).map(|p| p.0.as_str()).collect();
let sample_str = sample.join(", ");
let ellipsis = if n > 5 { ", …" } else { "" };
@@ -1571,19 +1740,28 @@ fn render_fetch_plain(r: &kebab_core::FetchResult) {
if !r.context_before.is_empty() {
println!("\n=== before ===");
for c in &r.context_before {
let heading = c.heading_path.last().map_or("", std::string::String::as_str);
let heading = c
.heading_path
.last()
.map_or("", std::string::String::as_str);
println!("[{} § {}]\n{}\n", c.chunk_id.0, heading, c.text);
}
}
if let Some(c) = &r.chunk {
println!("\n=== target ===");
let heading = c.heading_path.last().map_or("", std::string::String::as_str);
let heading = c
.heading_path
.last()
.map_or("", std::string::String::as_str);
println!("[{} § {}]\n{}\n", c.chunk_id.0, heading, c.text);
}
if !r.context_after.is_empty() {
println!("\n=== after ===");
for c in &r.context_after {
let heading = c.heading_path.last().map_or("", std::string::String::as_str);
let heading = c
.heading_path
.last()
.map_or("", std::string::String::as_str);
println!("[{} § {}]\n{}\n", c.chunk_id.0, heading, c.text);
}
}
@@ -1615,8 +1793,8 @@ mod tests {
//! against a synthetic `Answer` instead.
use super::*;
use kebab_core::{
Answer, AnswerCitation, AnswerRetrievalSummary, Citation, ModelRef,
PromptTemplateVersion, SearchMode, TokenUsage, TraceId, WorkspacePath,
Answer, AnswerCitation, AnswerRetrievalSummary, Citation, ModelRef, PromptTemplateVersion,
SearchMode, TokenUsage, TraceId, WorkspacePath,
};
use time::OffsetDateTime;
@@ -1712,4 +1890,3 @@ mod tests {
);
}
}

View File

@@ -124,11 +124,9 @@ impl ProgressDisplay {
bar.set_length(u64::from(*total));
bar.set_position(0);
bar.set_style(
ProgressStyle::with_template(
"ingest [{bar:30}] {pos}/{len} {wide_msg}",
)
.unwrap()
.progress_chars("=> "),
ProgressStyle::with_template("ingest [{bar:30}] {pos}/{len} {wide_msg}")
.unwrap()
.progress_chars("=> "),
);
bar.set_message("");
}
@@ -159,6 +157,42 @@ impl ProgressDisplay {
// in Completed handles the final state. No per-asset bar update
// here avoids the duplicate-frame artifact in TTY scrollback.
}
// v0.24.0: asset-internal phase visibility. AssetChunked uses the
// bar *message* (live sub-progress for the current asset) —
// distinct from the per-file position draw, so a single large
// document no longer looks frozen. AssetTimings prints a one-line
// breakdown when the asset finishes.
IngestEvent::AssetChunked { idx, total, chunks } => {
if let Some(bar) = self.bar.as_ref() {
bar.set_message(format!("{chunks} chunks"));
}
if !tty && !quiet {
let mut err = std::io::stderr().lock();
let _ = writeln!(err, "ingest: {idx}/{total} → {chunks} chunks");
}
}
IngestEvent::AssetTimings {
parse_ms,
chunk_ms,
embed_ms,
store_ms,
..
} => {
if let Some(bar) = self.bar.as_ref() {
bar.set_message("");
}
if !quiet {
let mut err = std::io::stderr().lock();
let _ = writeln!(
err,
" ⏱ parse {} · chunk {} · embed {} · store {}",
fmt_ms(*parse_ms),
fmt_ms(*chunk_ms),
fmt_ms(*embed_ms),
fmt_ms(*store_ms),
);
}
}
IngestEvent::Completed { counts } => {
if let Some(bar) = self.bar.take() {
bar.finish_and_clear();
@@ -170,11 +204,7 @@ impl ProgressDisplay {
let _ = writeln!(
err,
"ingest: complete (scanned={} new={} updated={} skipped={} errors={})",
counts.scanned,
counts.new,
counts.updated,
counts.skipped,
counts.errors,
counts.scanned, counts.new, counts.updated, counts.skipped, counts.errors,
);
}
}
@@ -193,14 +223,42 @@ impl ProgressDisplay {
let _ = writeln!(
err,
"ingest: aborted (scanned={} new={} updated={} skipped={} errors={})",
counts.scanned,
counts.new,
counts.updated,
counts.skipped,
counts.errors,
counts.scanned, counts.new, counts.updated, counts.skipped, counts.errors,
);
}
}
// v0.20.0 sub-item 1: per-page PDF OCR events — sub-progress lines
// under AssetStarted for scanned PDF. spec §4.6.1 line 1085-1086.
// skipped=true 시 (DCTDecode 부재 또는 engine fail) skip line.
IngestEvent::PdfOcrStarted { page } => {
if !quiet {
let mut err = std::io::stderr().lock();
let _ = writeln!(err, " 📷 OCR page {page}...");
}
}
IngestEvent::PdfOcrFinished {
page,
ms,
chars,
ocr_engine,
skipped,
..
} => {
if !quiet {
let mut err = std::io::stderr().lock();
if *skipped {
let _ = writeln!(
err,
" ⊘ OCR page {page} skipped (no DCTDecode or engine fail, {ms}ms)"
);
} else {
let _ = writeln!(
err,
" ✓ OCR page {page} ({chars} chars, {ms}ms via {ocr_engine})"
);
}
}
}
}
Ok(())
}
@@ -217,6 +275,17 @@ fn emit_json(event: &IngestEvent) -> anyhow::Result<()> {
Ok(())
}
/// Render a phase duration (milliseconds) compactly for the human-mode
/// `AssetTimings` line: `< 1000ms` stays in `ms`, larger spans collapse to
/// one-decimal seconds so a 45-second embed reads `45.0s`, not `45000ms`.
fn fmt_ms(ms: u64) -> String {
if ms >= 1000 {
format!("{:.1}s", ms as f64 / 1000.0)
} else {
format!("{ms}ms")
}
}
/// Format the current wall-clock as RFC 3339 — used by `wire_ingest_progress`
/// so every emitted event carries an `ts` field per §2.4a / the wire schema.
pub(crate) fn now_rfc3339() -> anyhow::Result<String> {
@@ -231,7 +300,10 @@ mod tests {
#[test]
fn from_flags_json_takes_priority_over_tty() {
assert_eq!(ProgressMode::from_flags(true, false, false), ProgressMode::Json);
assert_eq!(
ProgressMode::from_flags(true, false, false),
ProgressMode::Json
);
}
#[test]
@@ -260,6 +332,15 @@ mod tests {
}
}
#[test]
fn fmt_ms_switches_unit_at_one_second() {
assert_eq!(fmt_ms(0), "0ms");
assert_eq!(fmt_ms(999), "999ms");
assert_eq!(fmt_ms(1000), "1.0s");
assert_eq!(fmt_ms(45_000), "45.0s");
assert_eq!(fmt_ms(1500), "1.5s");
}
#[test]
fn now_rfc3339_parses_back() {
let s = now_rfc3339().unwrap();

Some files were not shown because too many files have changed in this diff Show More