feat(chunk): integrate lindera korean morphological tokenizer

V009 의 tokenized_korean_text column 에 들어갈 morpheme sequence
를 lindera ko-dic 으로 분해. chunk builder pipeline 의 chunk 생성
직후 시점에서 호출 → chunk struct 의 field 에 pre-fill → store
의 put_chunks 가 단일 transaction 안에서 INSERT.

- crates/kebab-core/src/chunk.rs: Chunk struct 에
  tokenized_korean_text: Option<String> field 추가 (#[serde(default)]).
- crates/kebab-chunk/src/lib.rs: tokenize_korean_morphological()
  helper + OnceLock 캐싱 + fallback (None) 정책.
- crates/kebab-chunk/Cargo.toml: lindera features = ["embed-ko-dic"]
  추가 (DictionaryKind::KoDic 활성화에 필요).
- 모든 chunker (tier2_shared, md_heading_v1, pdf_page_v1, 9개
  code AST v1): Chunk 리터럴에 tokenized_korean_text pre-fill.
- crates/kebab-store-sqlite/src/documents.rs::put_chunks: INSERT
  SQL column list + placeholder + binding 갱신 (12번째 column).
- crates/kebab-chunk/tests/tokenize_korean.rs: 단위 테스트 2개.

lindera 3.0.7 API 정정: load_dictionary_from_kind →
load_embedded_dictionary, Token.text → Token.surface.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 10:22:15 +00:00
parent 597d8b70ad
commit b134ae9dd5
20 changed files with 80 additions and 3 deletions

View File

@@ -105,8 +105,9 @@ impl kebab_core::DocumentStore for SqliteStore {
"INSERT INTO chunks (
chunk_id, doc_id, text, heading_path_json,
section_label, source_spans_json, token_estimate,
chunker_version, policy_hash, block_ids_json, created_at
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
chunker_version, policy_hash, block_ids_json, created_at,
tokenized_korean_text
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
)
.map_err(StoreError::from)?;
for chunk in chunks {
@@ -134,6 +135,7 @@ impl kebab_core::DocumentStore for SqliteStore {
chunk.policy_hash,
block_ids,
now,
chunk.tokenized_korean_text.as_deref(),
])
.map_err(StoreError::from)?;
}
@@ -247,6 +249,7 @@ impl kebab_core::DocumentStore for SqliteStore {
token_estimate: row.token_estimate as usize,
chunker_version: kebab_core::ChunkerVersion(row.chunker_version),
policy_hash: row.policy_hash,
tokenized_korean_text: None,
}))
}

View File

@@ -97,6 +97,7 @@ fn make_chunks(doc_id: &DocumentId) -> Vec<Chunk> {
token_estimate: 5,
chunker_version: ChunkerVersion("md-heading-v1".into()),
policy_hash: "deadbeefdeadbeef".into(),
tokenized_korean_text: None,
}]
}