feat(fts): add V009 korean morphological tokenizer migration

V007 trigram tokenizer 의 한국어 2자 query 0-hit 한계 (Bug #8) 해소를
위한 V009 migration 추가. unicode61 tokenizer 로 환원 + 한국어 형태소
분해 결과를 별 column `tokenized_korean_text` 에 pre-fill 하는 방식.

- migrations/V009__fts_korean_morphological.sql 신규: column ADD,
  chunks_fts DROP+재정의, 3 trigger CASE expression, backfill INSERT,
  corpus_revision bump.
- design §5.5 갱신: trigram → unicode61 + 형태소 column. CASE
  expression trigger 본문.
- crates/kebab-store-sqlite/tests/fts.rs: V007 verbatim test 를
  V009 source-of-truth 로 rename. v009_bumps_corpus_revision unit
  test 추가.
- store.rs: clippy bool_to_int_with_if + cast_lossless 기존 경고 수정
  (pdf_ocr_events 관련 코드, S1 작업 중 발견).

영어 substring 매칭은 V002 (whole-token only) 로 회귀 — spec §3
Non-Goals + 후속 release notes (v0.20.1) 에서 정직히 기술.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 09:48:46 +00:00
parent 43366b1b15
commit b106120e93
4 changed files with 149 additions and 29 deletions

View File

@@ -1028,7 +1028,7 @@ impl SqliteStore {
image_height,
ms,
chars,
if success { 1i32 } else { 0i32 },
i32::from(success),
reason,
ocr_engine
],
@@ -1042,7 +1042,7 @@ impl SqliteStore {
/// means "delete everything older than now" (i.e. all past rows).
pub fn prune_pdf_ocr_events(&self, retention_days: u32) -> anyhow::Result<u64> {
use time::format_description::well_known::Rfc3339;
let cutoff = time::OffsetDateTime::now_utc() - time::Duration::days(retention_days as i64);
let cutoff = time::OffsetDateTime::now_utc() - time::Duration::days(i64::from(retention_days));
let cutoff_ts = cutoff
.format(&Rfc3339)
.unwrap_or_else(|_| "1970-01-01T00:00:00Z".to_string());