별칭을 줄별 개별 dense 벡터(sentinel `{chunk}#alias#N`)로 색인하고
boilerplate 청크는 별칭 생성을 skip. 묶음 1벡터 방식은 평균화로 특정
표현이 희석돼 오히려 회귀(13/18)했던 것을 폐기. 변형 일관성 14/18 →
16/18, mean_spread@10 0.222 → 0.111 (나무위키 ~1000 문서 CS corpus).
`kebab-core::strip_alias_suffix` 가 suffix 형과 per-alias 형 둘 다 처리.
파생물 캐시(V012): embedding 벡터 + 별칭 LLM 결과를 청크 내용 해시
키로 캐싱해 재색인 시 내용 불변 청크의 재계산을 skip. cache_key =
blake3(kind ‖ text_blake3 ‖ version_key)[:32], version_key 에
model/prompt/dimensions 포함 → §9 cascade 와 정합(버전 bump 시 자동
miss). 측정: 정답 3개 cold 1879s → warm 13s ≈ 145배. 순수 가산이라
corpus_revision bump 없음. search/ask 는 kebab.sqlite+lancedb 만으로
동작 → 외부 서버 색인 후 DB 만 복사하는 이식 워크플로 가능.
V012 schema migration + 신규 surface 로 workspace version 0.20.2 →
0.21.0 (minor) bump. README/HANDOFF/ARCHITECTURE/HOTFIXES sync.
known limitation: stack·svm 설명형 2개 잔존 + grounded 판정이 부분
인용을 grounded 로 오분류(후속 후보).
측정 상세: docs/superpowers/handoffs/2026-05-31-namu-wiki-alias-cache-study.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
별칭 dense 벡터를 sentinel chunk_id({orig}#alias)로 색인하려면 chunks 에 없는
chunk_id 가 embedding_records 에 들어가야 한다. V001 의 chunk_id REFERENCES chunks
ON DELETE CASCADE FK 가 이를 SQLite 787 로 막으므로 테이블을 FK 없이 재생성한다.
status/vector_committed(V003) + 3개 인덱스 보존, chunks_bd_tombstone_embeddings
trigger 무수정. DROP→RENAME 시 dangling trigger 재파싱을 피하려 legacy_alter_table=ON.
사라진 CASCADE 는 put_chunks + purge 두 경로(purge_orphan_at_workspace_path,
purge_deleted_workspace_path)의 명시 DELETE 로 대체 — chunks 삭제 직전 원본 +
{id}#alias sentinel embedding_records 를 함께 정리. corpus_revision baseline 2→3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M1: chunk_aliases trigger 가드에 AND aliases <> '' (빈 문자열 미색인)
- M2: 재색인 멱등 테스트 (재-put 후 별칭 행 1개)
- N1: 본문 격리 음성 단언 (별칭 term 이 chunks_fts 로 누출 안 됨)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
V007 trigram tokenizer 의 한국어 2자 query 0-hit 한계 (Bug #8) 해소를
위한 V009 migration 추가. unicode61 tokenizer 로 환원 + 한국어 형태소
분해 결과를 별 column `tokenized_korean_text` 에 pre-fill 하는 방식.
- migrations/V009__fts_korean_morphological.sql 신규: column ADD,
chunks_fts DROP+재정의, 3 trigger CASE expression, backfill INSERT,
corpus_revision bump.
- design §5.5 갱신: trigram → unicode61 + 형태소 column. CASE
expression trigger 본문.
- crates/kebab-store-sqlite/tests/fts.rs: V007 verbatim test 를
V009 source-of-truth 로 rename. v009_bumps_corpus_revision unit
test 추가.
- store.rs: clippy bool_to_int_with_if + cast_lossless 기존 경고 수정
(pdf_ocr_events 관련 코드, S1 작업 중 발견).
영어 substring 매칭은 V002 (whole-token only) 로 회귀 — spec §3
Non-Goals + 후속 release notes (v0.20.1) 에서 정직히 기술.
Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add migrations/V008__pdf_ocr_events.sql with the events table + 3
indices (doc_id, run_id, ts). SqliteStore gains two pub fn:
record_pdf_ocr_event (insert one OCR sample) and prune_pdf_ocr_events
(delete rows older than retention_days; returns the affected row
count). Both follow the existing Mutex<Connection> lock pattern.
Wiring into ingest path lands in the next commit.
Closure r1 F2: explicit lock acquisition in both methods.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add V006__incremental_ingest.sql to persist last_chunker_version and
last_embedding_version on the documents table. Wire both columns into
upsert_document (INSERT + ON CONFLICT UPDATE) and get_document (SELECT +
row mapper), replacing the previous hardcoded None. Add two round-trip
tests in tests/incremental_ingest.rs covering the set and None cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `App::build_retriever(mode) -> Result<Arc<dyn Retriever>>` 추출.
`ask` 와 `ask_with_session` 모두 사용. 35+ 줄 retriever stack 중복
제거 — 미래 retriever 변경이 한 곳만.
- V005 migration `chat_sessions.sql` 의 `citations_json` doc 수정:
`Vec<Citation>` → `Vec<AnswerCitation>` (실제 stored type 과 일치).
AnswerCitation 가 marker + Citation 등 포함하므로 deserialize 시
type mismatch 회피.
15 app lib + 9 store chat_sessions + clippy 통과.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
도그푸딩 item 15 — TUI / 같은 process 안에서 동일 query 반복 시 SQLite
FTS + Lance + RRF 재계산이 매번 발생하던 비용 해소. in-process LRU
캐시 + 모노토닉 corpus_revision 카운터로 ingest commit 발생 시 모든
entry 자동 stale.
## 핵심 변경
- **SQLite V004 migration**: `kv (key TEXT PRIMARY KEY, value TEXT)
STRICT` + `corpus_revision = '0'` seed. 미래의 다른 scalar 도 같은
테이블에 들어갈 수 있는 generic shape.
- **`SqliteStore::corpus_revision()` / `bump_corpus_revision()`** —
`UPDATE ... CAST AS INTEGER + 1` atomic. INSERT-OR-IGNORE 도 함께
실행 (V004 seed 가 무슨 이유로 누락된 케이스 paranoid).
- **`kebab-app::ingest_with_config_cancellable`** — `new + updated > 0`
시 bump, no-op (skipped-only) reingest 는 cache 보존.
- **`App.search_cache: Option<Mutex<LruCache<SearchCacheKey, Vec<
SearchHit>>>>`** — `config.search.cache_capacity` (default 256, 0
비활성). `lru = "0.12"` workspace dep 추가.
- **`SearchCacheKey`** = `query_norm` (NFKC + trim + lowercase) +
`mode` + `k` + `snippet_chars` + `embedding_version` (vector/hybrid
만, lexical 은 빈 문자열) + `chunker_version` + `corpus_revision`
snapshot.
- **`App::search`** rewrite — cache 활성 시 lookup → miss 면 기존
`search_uncached` 호출 후 put. cache 비활성이거나 lock 실패면
straight-line.
- **`App::search_uncached`** (rename of pre-fb-19 `search` body) +
`search_uncached_with_config` facade — CLI `kebab search --no-cache`
로 진입.
- **`Config.search.cache_capacity: usize`** field, `#[serde(default)]`
로 기존 config 호환.
- **CLI `--no-cache`** flag — 디버깅용 (CLI 는 매 호출이 새 process
라 사실상 no-op 이지만 spec 명시 + 향후 long-lived process 호환).
- **frozen design §9 versioning** 표에 `corpus_revision` row 추가
(기존 `index_version` 라벨과 다른 차원: 라벨은 retrieval 형상,
corpus_revision 은 ingest commit ack).
## 테스트
- `kebab-store-sqlite` 신규 3 unit (fresh=0, monotonic bump, persist
across reopen)
- `kebab-app` 신규 4 integration (cached repeat 같은 hits, NFKC 정규화
로 case/whitespace collapse, --no-cache parity, first ingest bumps
corpus_revision)
- 워크스페이스 전체 `cargo test --workspace --no-fail-fast -j 1` exit 0
- `cargo clippy --workspace --all-targets -- -D warnings` clean
## 문서
- README `kebab search` 행: 캐시 동작 + `--no-cache` 안내 + corpus_
revision 무효화 메커니즘
- docs/SMOKE.md `[search]` 절에 `cache_capacity` 라인 추가
- HANDOFF: 2026-05-03 entry
- spec status planned → in_progress
## Out of scope
- patch-and-merge incremental (RRF 정규화 전체 hit set 기준이라 어려움)
- SQLite 영속 cache (P+)
- 다른 process 간 cache 공유 (in-process 만 — corpus_revision 이
cross-process 무효화는 O(1))
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First VectorStore implementation. Per-model Lance tables under
config.storage.vector_dir, two-phase upsert (SQLite-pending → Lance
MergeInsert → SQLite-committed) with crash-safe retry, search via
cosine distance with the spec's score-shift (preserves negative
similarity ranking signal that clamping would crush).
V003 migration:
- Adds status (CHECK constraint pending|committed|tombstone, default
pending) and vector_committed columns to embedding_records.
- BEFORE DELETE trigger on chunks flips dependent rows to tombstone.
Currently overshadowed by V001's ON DELETE CASCADE FK; trigger UPDATE
runs first then row vanishes via CASCADE. Spec-faithful tombstone
preservation requires recreating embedding_records to drop the
CASCADE — deferred to a P+ migration since no production rows exist
yet (P3-3 is the first writer). V003 SQL comment explains.
LanceVectorStore:
- ensure_table is idempotent: opens existing or creates with the
Arrow schema (chunk_id, doc_id, embedding FixedSizeList<Float32,
dim>, model_id, embedding_version, text, heading_path, created_at).
- IndexId computed via id_for_index with collection="chunk_embeddings",
index_kind="flat", params_hash = blake3(descriptor JSON). Schema
bumps automatically rotate the IndexId.
- upsert: phase-1 INSERT OR REPLACE INTO embedding_records (status=
'pending') in a single SQLite tx; phase-2 Lance MergeInsert keyed
on chunk_id (idempotent re-run); phase-3 UPDATE status='committed',
vector_committed=1. If phase-2 fails the rows stay 'pending' and
the next upsert call retries idempotently.
- search joins embedding_records WHERE status='committed' so partial-
write rows never surface. Cosine distance from Lance ∈ [0, 2] →
similarity = 1 - distance ∈ [-1, 1] → score = (similarity + 1)/2 ∈
[0, 1]. NaN coerced to 0 with tracing::warn. Filter by SearchFilters
via SqliteStore::filter_chunks (added in this commit).
- Sync trait + async LanceDB bridged by an embedded current-thread
tokio runtime. Doc-comment on the struct flags the "do NOT call
from inside another tokio runtime" panic (block_on cannot nest).
kb-app's job scheduler is sync today.
kb-store-sqlite additions:
- pub fn put_embedding_records_pending(&[EmbeddingRecordRow]) — phase-1
INSERT OR REPLACE (status='pending', vector_committed=0).
- pub fn mark_embedding_records_committed(&[EmbeddingId]) — phase-3
single UPDATE … WHERE embedding_id IN (?, ?, …) via
params_from_iter, guarded by WHERE status='pending' so tombstones
don't get clobbered.
- pub fn filter_chunks(&[ChunkId], &SearchFilters) → Vec<ChunkId>
consolidates the JOIN against documents/document_tags/
embedding_records + path_glob via globset. Lets kb-store-vector
honor SearchFilters without depending on rusqlite or globset
directly. (kb-search's filter logic is structurally different —
interleaved with the FTS5 SELECT — so it stays as-is for now;
consolidation is a P+ refactor.)
- 4 new unit tests cover the phase-1 round-trip, empty batch,
replay reset of pending rows, and the WHERE-status-pending guard.
Tests:
- 9 lib unit tests in kb-store-vector covering paths/sanitization,
arrow_batch dim validation + descriptor hash, bm25-style cosine
score shift math.
- 4 new kb-store-sqlite unit tests on filter_chunks (committed-only,
tags/lang/trust/path_glob, order preservation, empty input).
- 4 new kb-store-sqlite unit tests on the embedding_records helpers.
- 8 integration tests in upsert_search.rs and 1 snapshot test marked
#[ignore = "requires AVX-capable hardware (LanceDB)"]. They invoke
require_avx_or_panic() at the top of each body so a missing-AVX
--ignored run fails loudly instead of silently passing. This dev
host (qemu64 model) lacks AVX so these were NOT exercised end-to-
end here — first CI lane on AVX hardware will validate them.
- Snapshot fixture tests/fixtures/vector/run-1.json is a placeholder
with an _comment marker. Snapshot test panics until the placeholder
is replaced via KB_UPDATE_SNAPSHOTS=1 on AVX hardware.
- Workspace 241 passed, 19 ignored, 0 failed; cargo clippy --workspace
--all-targets -- -D warnings clean.
Allowed deps respected (kb-core, kb-config, kb-store-sqlite, lancedb,
arrow + arrow-array + arrow-schema, serde, serde_json, tracing,
thiserror) plus forced waivers — anyhow (trait return type), tokio
+ futures (LanceDB async-only API), blake3 (params_hash). rusqlite
and globset are NOT direct deps of kb-store-vector — confirmed via
cargo metadata --no-deps. rusqlite stays in [dev-dependencies] for
the test fixture seeder only.
Out of scope: IVF/PQ index tuning (P+), image vectors (P6), kb-app
embed_index orchestration (P3-4 facade).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds FTS5 lexical index for chunks per design §5.5: chunks_fts virtual
table (unicode61 remove_diacritics 2 tokenizer, contentless w/ UNINDEXED
chunk_id+doc_id) plus chunks_ai/chunks_ad/chunks_au triggers that mirror
every chunks mutation into chunks_fts inside the host transaction.
V002 ships the verbatim §5.5 SQL block plus a one-shot backfill INSERT
so existing P1 databases gain searchability without re-ingest. Refinery
bookkeeping makes double-apply naturally idempotent.
Adds rebuild_chunks_fts(&Connection) escape hatch for kb index
--rebuild-fts (CLI wiring deferred to a later task). Uses SAVEPOINT
instead of Transaction so callers can invoke from inside an outer
transaction; WAL serializes writers so no DELETE/INSERT race vs.
concurrent chunks mutators is possible.
Tests (10): V001-only → V002 cold-upgrade backfill (literal path),
chunks_ai/ad/au trigger sync, MATCH-token verification, rebuild
idempotency, drift recovery, double-run no-op, V002 ↔ design §5.5
verbatim diff guard (anchored extraction from both files), WAL/SHM
release on store drop. All 185 workspace tests pass.
Allowed deps respected (kb-core, kb-config, rusqlite, refinery — no
new deps). FTS query implementation deferred to p2-2 (lexical-retriever).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>