Files
kebab/migrations/V007__fts_trigram.sql
altair823 8dcedc4b11 feat(p10-r2): V007 trigram migration + design §5.5 + fts diff-check
Task A2 + A3 한 묶음.

migrations/V007__fts_trigram.sql 신규:
- chunks_fts shadow 를 DROP + 재생성 (tokenize = trigram).
- chunks_ai/ad/au trigger 재생성 (V002 와 동일).
- chunks 에서 backfill INSERT — 사용자 re-ingest 불필요, V007 자동.
- V002 는 historical cold-upgrade replay 위해 그대로 유지.

design §5.5 갱신:
- verbatim block 의 tokenize 만 trigram 으로 교체.
- §5.5 본문 상단에 한국어 채택 사유 + trade-off (영어 lexical 변경,
  BM25 분포, 디스크 ~2-10x, contentless 아님) prose 한 단락 추가.

crates/kebab-store-sqlite/tests/fts.rs:
- fts_v002_matches_design_section_5_5_verbatim →
  fts_v007_matches_design_section_5_5_verbatim 으로 rename.
- extract_migration_5_5_verbatim_block() 의 include_str! path 를
  V007__fts_trigram.sql 로 변경. 주석/assertion msg V007 로.
- V002 cold-upgrade test 들 (fts_v002_backfill_*) 은 그대로 유지.

검증: cargo test -p kebab-store-sqlite --test fts → 10/10 PASS
(`fts_v007_matches_design_section_5_5_verbatim` 포함).

Codex round 1/2 의 design §5.5 contentless 정정·trigram tokenizer
채택 사유 명시 발견 반영.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:52:40 +00:00

61 lines
2.9 KiB
SQL
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
-- V007__fts_trigram.sql — Replace chunks_fts tokenizer: unicode61 → trigram.
--
-- Per design §5.5 (chunks_fts virtual table + chunks_ai/ad/au triggers).
-- The CREATE VIRTUAL TABLE / CREATE TRIGGER block below is reproduced
-- VERBATIM from `docs/superpowers/specs/2026-04-27-kebab-final-form-design.md`
-- §5.5; CI diff-checks this against the design doc (test
-- `fts_v007_matches_design_section_5_5_verbatim` in
-- `crates/kebab-store-sqlite/tests/fts.rs`).
--
-- Tokenizer choice: trigram. Korean is agglutinative — unicode61 tokenizes
-- whole eojeol (조사·어미 attached) so substring matching fails. trigram
-- indexes 3-character grams, enabling Korean partial matches. Trade-offs:
-- DB size grows (~2-10×), English lexical also moves to substring match
-- (recall↑, precision↓), BM25 score distribution shifts. See
-- `tasks/HOTFIXES.md` (2026-05-22) and the v0.17.0 design doc.
--
-- chunks_fts is a shadow of chunks (NOT contentless — V002 DDL has no
-- `content=''`); this migration drops the old shadow, recreates it with
-- the new tokenizer, recreates the sync triggers (verbatim from V002),
-- and backfills from `chunks`. The `chunks` table and embeddings are
-- untouched, so users do NOT need to re-ingest after upgrading to
-- v0.17.0 — the migration is fully automatic.
DROP TRIGGER IF EXISTS chunks_au;
DROP TRIGGER IF EXISTS chunks_ad;
DROP TRIGGER IF EXISTS chunks_ai;
DROP TABLE IF EXISTS chunks_fts;
-- ── §5.5 verbatim block ────────────────────────────────────────────────
CREATE VIRTUAL TABLE chunks_fts USING fts5(
chunk_id UNINDEXED,
doc_id UNINDEXED,
heading_path,
text,
tokenize = 'trigram'
);
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
VALUES (new.chunk_id, new.doc_id, new.heading_path_json, new.text);
END;
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
DELETE FROM chunks_fts WHERE chunk_id = old.chunk_id;
END;
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
DELETE FROM chunks_fts WHERE chunk_id = old.chunk_id;
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
VALUES (new.chunk_id, new.doc_id, new.heading_path_json, new.text);
END;
-- ── End §5.5 verbatim block ───────────────────────────────────────────
-- One-shot backfill from existing chunks. Mirrors the V002 backfill
-- pattern — direct INSERT into chunks_fts bypasses chunks_ai trigger
-- (trigger fires on chunks INSERT, not chunks_fts INSERT), so no
-- double-insert. Refinery runs V007 exactly once via its bookkeeping
-- table, so this is naturally idempotent across restarts.
INSERT INTO chunks_fts(chunk_id, doc_id, heading_path, text)
SELECT chunk_id, doc_id, heading_path_json, text FROM chunks;