feat(p10-r2): V007 trigram migration + design §5.5 + fts diff-check
Task A2 + A3 한 묶음. migrations/V007__fts_trigram.sql 신규: - chunks_fts shadow 를 DROP + 재생성 (tokenize = trigram). - chunks_ai/ad/au trigger 재생성 (V002 와 동일). - chunks 에서 backfill INSERT — 사용자 re-ingest 불필요, V007 자동. - V002 는 historical cold-upgrade replay 위해 그대로 유지. design §5.5 갱신: - verbatim block 의 tokenize 만 trigram 으로 교체. - §5.5 본문 상단에 한국어 채택 사유 + trade-off (영어 lexical 변경, BM25 분포, 디스크 ~2-10x, contentless 아님) prose 한 단락 추가. crates/kebab-store-sqlite/tests/fts.rs: - fts_v002_matches_design_section_5_5_verbatim → fts_v007_matches_design_section_5_5_verbatim 으로 rename. - extract_migration_5_5_verbatim_block() 의 include_str! path 를 V007__fts_trigram.sql 로 변경. 주석/assertion msg V007 로. - V002 cold-upgrade test 들 (fts_v002_backfill_*) 은 그대로 유지. 검증: cargo test -p kebab-store-sqlite --test fts → 10/10 PASS (`fts_v007_matches_design_section_5_5_verbatim` 포함). Codex round 1/2 의 design §5.5 contentless 정정·trigram tokenizer 채택 사유 명시 발견 반영. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
60
migrations/V007__fts_trigram.sql
Normal file
60
migrations/V007__fts_trigram.sql
Normal file
@@ -0,0 +1,60 @@
|
||||
-- V007__fts_trigram.sql — Replace chunks_fts tokenizer: unicode61 → trigram.
|
||||
--
|
||||
-- Per design §5.5 (chunks_fts virtual table + chunks_ai/ad/au triggers).
|
||||
-- The CREATE VIRTUAL TABLE / CREATE TRIGGER block below is reproduced
|
||||
-- VERBATIM from `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md`
|
||||
-- §5.5; CI diff-checks this against the design doc (test
|
||||
-- `fts_v007_matches_design_section_5_5_verbatim` in
|
||||
-- `crates/kebab-store-sqlite/tests/fts.rs`).
|
||||
--
|
||||
-- Tokenizer choice: trigram. Korean is agglutinative — unicode61 tokenizes
|
||||
-- whole eojeol (조사·어미 attached) so substring matching fails. trigram
|
||||
-- indexes 3-character grams, enabling Korean partial matches. Trade-offs:
|
||||
-- DB size grows (~2-10×), English lexical also moves to substring match
|
||||
-- (recall↑, precision↓), BM25 score distribution shifts. See
|
||||
-- `tasks/HOTFIXES.md` (2026-05-22) and the v0.17.0 design doc.
|
||||
--
|
||||
-- chunks_fts is a shadow of chunks (NOT contentless — V002 DDL has no
|
||||
-- `content=''`); this migration drops the old shadow, recreates it with
|
||||
-- the new tokenizer, recreates the sync triggers (verbatim from V002),
|
||||
-- and backfills from `chunks`. The `chunks` table and embeddings are
|
||||
-- untouched, so users do NOT need to re-ingest after upgrading to
|
||||
-- v0.17.0 — the migration is fully automatic.
|
||||
|
||||
DROP TRIGGER IF EXISTS chunks_au;
|
||||
DROP TRIGGER IF EXISTS chunks_ad;
|
||||
DROP TRIGGER IF EXISTS chunks_ai;
|
||||
DROP TABLE IF EXISTS chunks_fts;
|
||||
|
||||
-- ── §5.5 verbatim block ────────────────────────────────────────────────
|
||||
|
||||
CREATE VIRTUAL TABLE chunks_fts USING fts5(
|
||||
chunk_id UNINDEXED,
|
||||
doc_id UNINDEXED,
|
||||
heading_path,
|
||||
text,
|
||||
tokenize = 'trigram'
|
||||
);
|
||||
|
||||
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
|
||||
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
|
||||
VALUES (new.chunk_id, new.doc_id, new.heading_path_json, new.text);
|
||||
END;
|
||||
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
|
||||
DELETE FROM chunks_fts WHERE chunk_id = old.chunk_id;
|
||||
END;
|
||||
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
|
||||
DELETE FROM chunks_fts WHERE chunk_id = old.chunk_id;
|
||||
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
|
||||
VALUES (new.chunk_id, new.doc_id, new.heading_path_json, new.text);
|
||||
END;
|
||||
|
||||
-- ── End §5.5 verbatim block ───────────────────────────────────────────
|
||||
|
||||
-- One-shot backfill from existing chunks. Mirrors the V002 backfill
|
||||
-- pattern — direct INSERT into chunks_fts bypasses chunks_ai trigger
|
||||
-- (trigger fires on chunks INSERT, not chunks_fts INSERT), so no
|
||||
-- double-insert. Refinery runs V007 exactly once via its bookkeeping
|
||||
-- table, so this is naturally idempotent across restarts.
|
||||
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
|
||||
SELECT chunk_id, doc_id, heading_path_json, text FROM chunks;
|
||||
Reference in New Issue
Block a user