test(fts,app): V009 morphological tokenizer integration tests

신규 4 test 추가:

- crates/kebab-store-sqlite/tests/fts.rs:
  - fts_v009_korean_morphological_2char_query_hits: tokenized_korean_text
    column 이 채워진 chunk 의 '한국' 2-char query hit.
  - fts_v009_english_whole_token_only: V007 trigram substring 매칭
    회귀 (Path A) — 'token' query 가 'tokenizer' chunk 에서 0-hit.
- crates/kebab-app/tests/search_korean.rs:
  - korean_morphological_2char_query_lexical_mode: end-to-end
    한국어 wiki fixture ingest → '한국' / '서울' query hit.
  - korean_morphological_mixed_english_korean_query: 'Rust' English
    whole-token + '최적화' Korean morpheme hit.

crates/kebab-search/src/lexical.rs:
  - build_match_string() 의 MIN_TRIGRAM_CHARS(3) → MIN_QUERY_CHARS(2).
    V009 unicode61 은 최소 token 길이 제한 없어 2자 한국어 morpheme
    query 가 통과되어야 함. 1자 단독은 여전히 필터.
  - 관련 unit test 2개 V009 동작으로 갱신.

fixture text 는 lindera ko-dic 의 실제 segmentation 동작에 의존
(spec Appendix B prior-knowledge 예측). 실측 시 fixture 조정 가능.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9.1, §9.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S7)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 11:38:52 +00:00
parent f94e0c4a9b
commit c5de5f812b
3 changed files with 177 additions and 34 deletions

View File

@@ -127,3 +127,72 @@ fn lexical_mixed_korean_english_multi_token_query_hits() {
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}
// ── S7 V009 morphological tokenizer end-to-end tests ─────────────────
/// S7 — V009 morphological tokenizer: 한국어 2자 query 가 end-to-end
/// lexical 경로에서 hit. lindera ko-dic 이 '한국어를' → '한국어' 형태소로
/// 분해, '서울은' → '서울' 로 분해하여 tokenized_korean_text column 에
/// 기록 → FTS5 매칭.
#[test]
fn korean_morphological_2char_query_lexical_mode() {
let env = TestEnv::lexical_only();
let doc_path = env.workspace_root.join("korean-wiki.md");
std::fs::write(
&doc_path,
"# 한국어 위키\n\n한국어를 공부합니다.\n서울은 한국의 수도입니다.\n",
)
.expect("write korean-wiki fixture");
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query("한국"))
.expect("search 한국");
assert!(
!hits.is_empty(),
"'한국' 2-char Korean query must return at least one hit (V009 morphological); got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query("서울"))
.expect("search 서울");
assert!(
!hits.is_empty(),
"'서울' 2-char Korean query must return at least one hit; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}
/// S7 — V009 morphological tokenizer: 한-영 혼합 query lexical hit.
/// 'Rust' (English whole-token) + '최적화' (Korean morpheme) 각각 hit.
#[test]
fn korean_morphological_mixed_english_korean_query() {
let env = TestEnv::lexical_only();
let doc_path = env.workspace_root.join("rust-optimization.md");
std::fs::write(
&doc_path,
"# Rust 최적화 노트\n\nRust 최적화는 zero-cost abstraction 을 강조한다.\n",
)
.expect("write rust-optimization fixture");
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest must succeed");
let hits = kebab_app::search_with_config(env.config.clone(), common::lexical_query("Rust"))
.expect("search Rust");
assert!(
!hits.is_empty(),
"'Rust' English whole-token must hit; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
let hits =
kebab_app::search_with_config(env.config.clone(), common::lexical_query("최적화"))
.expect("search 최적화");
assert!(
!hits.is_empty(),
"'최적화' Korean morpheme must hit; got {:?}",
hits.iter().map(|h| &h.doc_path.0).collect::<Vec<_>>()
);
}