feat(p2-2): lexical-retriever — kb-search crate + LexicalRetriever (FTS5 + bm25) #13

altair823 · 2026-05-01T05:21:32Z

altair823 commented

2026-05-01 05:21:32 +00:00

변경 요약

P2-2 lexical-retriever 작업입니다. P2-1에서 깐 chunks_fts 위에서 동작하는 첫 번째 kb_core::Retriever 구현을 새 크레이트 kb-search로 추가했습니다. kb search --mode lexical 경로가 임베딩/LLM 인프라 없이 동작 가능한 상태가 됩니다.

무엇을 했는가

새 크레이트 `kb-search`

LexicalRetriever::new(Arc<SqliteStore>, IndexVersion) — Arc<SqliteStore>를 들고 있다가 search 시점에 read_conn()으로 mutex guard를 잡고 SQL을 돌립니다.
FTS5 MATCH 표현식 빌더가 핵심입니다. 사용자 입력의 모든 토큰을 "..." 리터럴로 escape (내부 "는 이중화)해서 FTS5 metacharacter를 통째로 무력화합니다. 사용자가 '...'로 감싸 입력하면 raw FTS5 문법으로 통과시킵니다 (opt-in). 빈 query는 DB 접근 없이 Ok(vec![])로 short-circuit합니다.
bm25 정규화: score = -bm25 / (1 + |bm25|). SQLite FTS5가 항상 음수 bm25를 반환하므로 정규화 후 (0, 1] 범위가 보장됩니다.
스니펫: snippet(chunks_fts, 3, '', '', '…', word_budget) — column index 3은 chunks_fts의 text 컬럼 (chunk_id/doc_id가 UNINDEXED라 1, 2 자리). word_budget은 snippet_chars / 4을 [1, 64]로 clamp. 반환 직후 trim_snippet이 다시 char 단위로 cap 강제 — design §6.4의 "characters" 기준에 맞춰 grapheme cluster가 아니라 Unicode scalar 단위로 자릅니다 (코멘트로 trade-off 명시).
Citation 구성: chunks.source_spans_json의 첫 span을 기준으로 Citation::Line / Page / Region / Time을 그대로 forward. Byte나 빈 배열은 Line { 1, 1 }로 fallback하면서 tracing::warn!을 출력 — P1 markdown 청커는 항상 Line만 emit하므로 실제 발생하지 않지만 forward-compat regression이 로그로 surface됩니다.

필터

SQL에서 처리: tags_any (document_tags 서브쿼리), lang (=비교), trust_min (CASE-rank).
Rust post-filter: path_glob. spec Risks/notes의 "*이 /을 cross하면 안 됨" 요건을 만족시키기 위해 globset::GlobBuilder::literal_separator(true)을 사용합니다 — empirical하게 globset의 default는 /을 넘어가서 명시적으로 켜야 했습니다. path_glob이 활성화되면 SQL이 +128 행을 over-fetch한 뒤 Rust에서 cull, 그 다음 rank를 1..k로 다시 매겨 1-based contiguous를 보장합니다.

결정성

ORDER BY score, f.chunk_id로 동일 bm25에서도 안정 정렬. blake3 hex 32자의 lexicographic 비교라 architecture 무관입니다. 동일 텍스트를 가진 두 청크로 tiebreaker가 실제 발동하는 케이스를 별도 테스트로 추가했습니다.

`kb-store-sqlite` 변경

pub fn read_conn(&self) -> MutexGuard<'_, Connection> 추가. Read-only는 doc-only contract입니다 — &Connection으로는 type system이 mutation을 막을 수 없고, kb-search가 prepare_cached + 행 iteration을 하려면 closure scope가 awkward해집니다. closure 래퍼 (with_read_conn) 변형은 P3 follow-up으로 남깁니다.

테스트

단위 테스트 13건 + 통합 테스트 13건 = 신규 26건. 워크스페이스 전체 211건 통과.
커버리지: empty corpus, empty/whitespace-only/'...'-wrapped query, 단일 hit citation round-trip, snippet 길이 cap, tags_any 제외, lang + trust_min 합성, path_glob의 / boundary, bm25 top-1 ∈ (0, 1], 결정성 (서로 다른 점수 / 동일 점수 tiebreaker), index_version 패스스루, snapshot baseline (crates/kb-search/tests/fixtures/search/lexical/run-1.json).
snapshot 안정성은 rusqlite의 bundled SQLite가 보장합니다 — 향후 SQLite bump으로 tokenizer/bm25가 바뀌면 KB_UPDATE_SNAPSHOTS=1으로 재생성하라는 코멘트를 테스트에 달아두었습니다.
cargo clippy --workspace --all-targets -- -D warnings clean.

의존성

Allowed deps 준수: kb-core, kb-config, kb-store-sqlite, rusqlite, tracing, thiserror, anyhow (Retriever 트레이트 반환 타입이 anyhow::Result라 강제됨), serde_json (heading_path_json/source_spans_json TEXT 컬럼 파싱에 필수), globset (Risks/notes의 * 경계 요건). Forbidden 목록 (kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store-vector, kb-embed*, kb-llm*, kb-rag, kb-tui, kb-desktop) 어느 것도 cargo tree -p kb-search에 등장하지 않습니다.

변경 파일

crates/kb-search/Cargo.toml (신규)
crates/kb-search/src/lib.rs (신규)
crates/kb-search/src/lexical.rs (신규)
crates/kb-search/tests/lexical.rs (신규, 13 통합 테스트)
crates/kb-search/tests/fixtures/search/lexical/run-1.json (snapshot baseline)
crates/kb-store-sqlite/src/store.rs (read_conn 추가)
Cargo.toml (workspace member + globset/tempfile/rusqlite을 [workspace.dependencies]로 명시)
Cargo.lock

Out of scope (후속 작업)

벡터 검색 (P3-3), 하이브리드 fusion (P3-4), 리랭커 (P+), Korean morphological tokenizer (P+).
read_conn closure 래퍼 (with_read_conn<R>(...)) 도입 — type system 차원에서 read-only를 강제하려면 P3 진입 시점에.

design §3.7, §0 Q3, §1.5/1.6, §2.2, §6.4 참고.

## 변경 요약 P2-2 lexical-retriever 작업입니다. P2-1에서 깐 `chunks_fts` 위에서 동작하는 첫 번째 `kb_core::Retriever` 구현을 새 크레이트 `kb-search`로 추가했습니다. `kb search --mode lexical` 경로가 임베딩/LLM 인프라 없이 동작 가능한 상태가 됩니다. ## 무엇을 했는가 ### 새 크레이트 `kb-search` - `LexicalRetriever::new(Arc<SqliteStore>, IndexVersion)` — `Arc<SqliteStore>`를 들고 있다가 `search` 시점에 `read_conn()`으로 mutex guard를 잡고 SQL을 돌립니다. - FTS5 MATCH 표현식 빌더가 핵심입니다. 사용자 입력의 모든 토큰을 `"..."` 리터럴로 escape (내부 `"`는 이중화)해서 FTS5 metacharacter를 통째로 무력화합니다. 사용자가 `'...'`로 감싸 입력하면 raw FTS5 문법으로 통과시킵니다 (opt-in). 빈 query는 DB 접근 없이 `Ok(vec![])`로 short-circuit합니다. - bm25 정규화: `score = -bm25 / (1 + |bm25|)`. SQLite FTS5가 항상 음수 bm25를 반환하므로 정규화 후 (0, 1] 범위가 보장됩니다. - 스니펫: `snippet(chunks_fts, 3, '', '', '…', word_budget)` — column index 3은 chunks_fts의 `text` 컬럼 (chunk_id/doc_id가 UNINDEXED라 1, 2 자리). word_budget은 `snippet_chars / 4`을 [1, 64]로 clamp. 반환 직후 `trim_snippet`이 다시 char 단위로 cap 강제 — design §6.4의 "characters" 기준에 맞춰 grapheme cluster가 아니라 Unicode scalar 단위로 자릅니다 (코멘트로 trade-off 명시). - Citation 구성: `chunks.source_spans_json`의 첫 span을 기준으로 `Citation::Line` / `Page` / `Region` / `Time`을 그대로 forward. `Byte`나 빈 배열은 `Line { 1, 1 }`로 fallback하면서 `tracing::warn!`을 출력 — P1 markdown 청커는 항상 Line만 emit하므로 실제 발생하지 않지만 forward-compat regression이 로그로 surface됩니다. ### 필터 - SQL에서 처리: `tags_any` (document_tags 서브쿼리), `lang` (=비교), `trust_min` (CASE-rank). - Rust post-filter: `path_glob`. spec Risks/notes의 "`*`이 `/`을 cross하면 안 됨" 요건을 만족시키기 위해 `globset::GlobBuilder::literal_separator(true)`을 사용합니다 — empirical하게 globset의 default는 `/`을 넘어가서 명시적으로 켜야 했습니다. path_glob이 활성화되면 SQL이 `+128` 행을 over-fetch한 뒤 Rust에서 cull, 그 다음 rank를 1..k로 다시 매겨 1-based contiguous를 보장합니다. ### 결정성 `ORDER BY score, f.chunk_id`로 동일 bm25에서도 안정 정렬. blake3 hex 32자의 lexicographic 비교라 architecture 무관입니다. 동일 텍스트를 가진 두 청크로 tiebreaker가 실제 발동하는 케이스를 별도 테스트로 추가했습니다. ### `kb-store-sqlite` 변경 - `pub fn read_conn(&self) -> MutexGuard<'_, Connection>` 추가. Read-only는 doc-only contract입니다 — `&Connection`으로는 type system이 mutation을 막을 수 없고, kb-search가 `prepare_cached` + 행 iteration을 하려면 closure scope가 awkward해집니다. closure 래퍼 (`with_read_conn`) 변형은 P3 follow-up으로 남깁니다. ## 테스트 - 단위 테스트 13건 + 통합 테스트 13건 = 신규 26건. 워크스페이스 전체 211건 통과. - 커버리지: empty corpus, empty/whitespace-only/`'...'`-wrapped query, 단일 hit citation round-trip, snippet 길이 cap, tags_any 제외, lang + trust_min 합성, path_glob의 `/` boundary, bm25 top-1 ∈ (0, 1], 결정성 (서로 다른 점수 / 동일 점수 tiebreaker), index_version 패스스루, snapshot baseline (`crates/kb-search/tests/fixtures/search/lexical/run-1.json`). - snapshot 안정성은 rusqlite의 bundled SQLite가 보장합니다 — 향후 SQLite bump으로 tokenizer/bm25가 바뀌면 `KB_UPDATE_SNAPSHOTS=1`으로 재생성하라는 코멘트를 테스트에 달아두었습니다. - `cargo clippy --workspace --all-targets -- -D warnings` clean. ## 의존성 Allowed deps 준수: `kb-core`, `kb-config`, `kb-store-sqlite`, `rusqlite`, `tracing`, `thiserror`, `anyhow` (Retriever 트레이트 반환 타입이 `anyhow::Result`라 강제됨), `serde_json` (`heading_path_json`/`source_spans_json` TEXT 컬럼 파싱에 필수), `globset` (Risks/notes의 `*` 경계 요건). Forbidden 목록 (kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store-vector, kb-embed*, kb-llm*, kb-rag, kb-tui, kb-desktop) 어느 것도 `cargo tree -p kb-search`에 등장하지 않습니다. ## 변경 파일 - `crates/kb-search/Cargo.toml` (신규) - `crates/kb-search/src/lib.rs` (신규) - `crates/kb-search/src/lexical.rs` (신규) - `crates/kb-search/tests/lexical.rs` (신규, 13 통합 테스트) - `crates/kb-search/tests/fixtures/search/lexical/run-1.json` (snapshot baseline) - `crates/kb-store-sqlite/src/store.rs` (`read_conn` 추가) - `Cargo.toml` (workspace member + `globset`/`tempfile`/`rusqlite`을 `[workspace.dependencies]`로 명시) - `Cargo.lock` ## Out of scope (후속 작업) - 벡터 검색 (P3-3), 하이브리드 fusion (P3-4), 리랭커 (P+), Korean morphological tokenizer (P+). - `read_conn` closure 래퍼 (`with_read_conn<R>(...)`) 도입 — type system 차원에서 read-only를 강제하려면 P3 진입 시점에. design §3.7, §0 Q3, §1.5/1.6, §2.2, §6.4 참고.

altair823 added 1 commit 2026-05-01 05:21:33 +00:00

feat(p2-2): kb-search crate + LexicalRetriever (FTS5 + bm25) b335151d18

Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1)
to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with
bm25-derived ranking, snippet() previews, and W3C-fragment-style
Citation built from the chunk's first source_spans entry.

New crate kb-search:
- LexicalRetriever::new(Arc<SqliteStore>, IndexVersion).
- search() builds an FTS5 MATCH expression by escaping every whitespace
  token into a quoted literal (inner " doubled); single-quote-wrapped
  text passes through verbatim as raw FTS5 syntax. Empty query
  short-circuits to Ok(vec![]).
- bm25 normalization: score = -bm25 / (1 + |bm25|), bounded (0, 1] for
  any FTS5-returned negative bm25.
- Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where
  word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet
  enforces the char cap on the way out (chars per design §6.4 — accepts
  the combining-mark trade-off).
- Citation from chunks.source_spans_json first span: Line / Page /
  Region / Time forwarded; Byte / empty array fall back to Line{1,1}
  with a tracing::warn so forward-compat regressions surface.
- Filters: tags_any (subquery on document_tags), lang (= column),
  trust_min (CASE-rank in SQL) all applied at SQL level. path_glob
  uses globset with literal_separator(true) — guarantees '*' does not
  cross '/' per spec Risks/notes — applied as Rust post-filter with
  +128 row over-fetch when set, then rank reassigned 1..k contiguously.
- Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex
  tiebreaker on identical bm25). Tested explicitly with two chunks of
  identical text content.
- RetrievalDetail: method=Lexical, both lexical_score and fusion_score
  set, vector_* None.

kb-store-sqlite:
- Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>.
  Read-only contract is doc-only — kb-search needs MutexGuard for
  prepare_cached + iter, which a closure-scoped wrapper would awkwardly
  constrain. Closure variant left as a P3 follow-up.

Tests (26 new): empty corpus, empty query, single hit + citation
round-trip, snippet length cap, tags_any exclusion, lang+trust
composition, path_glob with '*' not crossing '/', citation line round-
trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical-
score tiebreaker), index_version passthrough, snapshot
(crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable
under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace:
211 tests pass, cargo clippy --workspace --all-targets -D warnings
clean.

Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite,
tracing, thiserror, anyhow (forced by trait return type), serde_json
(parses *_json TEXT columns), globset (path_glob '*' boundary).

Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4),
reranker (P+), Korean morphological tokenizer (P+).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude-reviewer-01 reviewed 2026-05-01 05:22:20 +00:00

claude-reviewer-01 left a comment

P2-2 코드 리뷰 — 셀프 머지 게이트로 인해 COMMENT only.

spec compliance + code quality 양쪽 리뷰 결과 BLOCKER / MUST-FIX 모두 0건이었고, NIT 11건 중 가치 있는 9건 (tracing 와이어링, dev-deps 중복 제거, unused SELECT 컬럼, fetch_limit saturating cast, 동일-bm25 tiebreaker 테스트, combining-mark trade-off 주석, 불필요 String 할당 정리, snapshot 안정성 주석, Byte fallback warn)을 PR에 반영했습니다.

핵심 포인트:

FTS5 metacharacter는 grammar 파싱 없이 모든 토큰을 "..."로 wrap + 내부 " 이중화로 통째 무력화. '...' wrap은 raw FTS5 syntax explicit opt-in.
bm25 정규화 -bm25 / (1 + |bm25|)로 (0, 1] 보장, top score 단위테스트로 pin.
path_glob은 globset literal_separator(true) + +128 over-fetch + post-filter 후 rank 재할당으로 spec Risks/notes의 * 경계 요건 + 1-based contiguous rank 동시 충족.
결정성 ORDER BY tiebreaker가 동일-bm25 케이스에서도 작동하는지 별도 테스트로 검증.
Byte/empty source_span fallback은 silent하지 않고 tracing::warn!으로 surface.

inline 코멘트는 모두 잘 만든 결정에 대한 노트입니다. 머지 진행해도 됩니다.

후속 follow-up은 PR 본문 "Out of scope"에 정리해두었습니다 — read_conn의 closure 래퍼 변환은 P3 진입 시점에 다시 보겠습니다.

P2-2 코드 리뷰 — 셀프 머지 게이트로 인해 COMMENT only. spec compliance + code quality 양쪽 리뷰 결과 BLOCKER / MUST-FIX 모두 0건이었고, NIT 11건 중 가치 있는 9건 (tracing 와이어링, dev-deps 중복 제거, unused SELECT 컬럼, fetch_limit saturating cast, 동일-bm25 tiebreaker 테스트, combining-mark trade-off 주석, 불필요 String 할당 정리, snapshot 안정성 주석, Byte fallback warn)을 PR에 반영했습니다. 핵심 포인트: - FTS5 metacharacter는 grammar 파싱 없이 모든 토큰을 `"..."`로 wrap + 내부 `"` 이중화로 통째 무력화. `'...'` wrap은 raw FTS5 syntax explicit opt-in. - bm25 정규화 `-bm25 / (1 + |bm25|)`로 (0, 1] 보장, top score 단위테스트로 pin. - path_glob은 globset `literal_separator(true)` + `+128` over-fetch + post-filter 후 rank 재할당으로 spec Risks/notes의 `*` 경계 요건 + 1-based contiguous rank 동시 충족. - 결정성 ORDER BY tiebreaker가 동일-bm25 케이스에서도 작동하는지 별도 테스트로 검증. - Byte/empty source_span fallback은 silent하지 않고 `tracing::warn!`으로 surface. inline 코멘트는 모두 잘 만든 결정에 대한 노트입니다. 머지 진행해도 됩니다. 후속 follow-up은 PR 본문 \"Out of scope\"에 정리해두었습니다 — `read_conn`의 closure 래퍼 변환은 P3 진입 시점에 다시 보겠습니다.

crates/kb-search/src/lexical.rs

						
				@@ -0,0 +87,4 @@

				        let k = if query.k == 0 { DEFAULT_K } else { query.k };

				        let filters = &query.filters;

				        // One-line summary at request entry. Filter shape only — no

				        // tag/lang/path values, which could be PII-sensitive.

claude-reviewer-01 commented

2026-05-01 05:22:19 +00:00

search 진입과 종료 양쪽에 tracing::debug!을 둔 점이 좋습니다. kb search --mode lexical 디버깅 시 match string 자체와 필터 형태(개수/플래그)를 별도로 분리해서 로그한 부분 — 사용자 입력 값을 PII로 누설하지 않으면서도 "필터가 적용되긴 했나?"를 확인할 수 있게 만든 균형이 적절합니다.

search 진입과 종료 양쪽에 `tracing::debug!`을 둔 점이 좋습니다. `kb search --mode lexical` 디버깅 시 match string 자체와 필터 형태(개수/플래그)를 별도로 분리해서 로그한 부분 — 사용자 입력 값을 PII로 누설하지 않으면서도 "필터가 적용되긴 했나?"를 확인할 수 있게 만든 균형이 적절합니다.

crates/kb-search/src/lexical.rs

						
				@@ -0,0 +197,4 @@

				/// Return `Some(inner)` if `s` is wrapped in a matching pair of single

				/// quotes (`'...'`), otherwise `None`. We require the closing quote to

				/// be the last character so `'foo' bar` doesn't accidentally engage

claude-reviewer-01 commented

2026-05-01 05:22:19 +00:00

FTS5 매치 빌더의 escape 전략이 단순하면서도 안전합니다. 모든 토큰을 "..."로 감싸고 내부 "만 이중화하면 FTS5 metacharacter 전체(*, ^, :, (, ))가 한 번에 무력화됩니다. parsing 없이 grammar 우회를 닫는 가장 boring한 정답입니다. 사용자가 의도적으로 FTS5 문법을 쓰고 싶을 때는 '...'로 wrap하는 explicit opt-in이 있고요.

FTS5 매치 빌더의 escape 전략이 단순하면서도 안전합니다. 모든 토큰을 `"..."`로 감싸고 내부 `"`만 이중화하면 FTS5 metacharacter 전체(`*`, `^`, `:`, `(`, `)`)가 한 번에 무력화됩니다. parsing 없이 grammar 우회를 닫는 가장 boring한 정답입니다. 사용자가 의도적으로 FTS5 문법을 쓰고 싶을 때는 `'...'`로 wrap하는 explicit opt-in이 있고요.

crates/kb-search/src/lexical.rs

						
				@@ -0,0 +314,4 @@

				        };

				        params.push(Box::new(rank));

				    }

				    // path_glob is intentionally NOT applied here — see module comment

claude-reviewer-01 commented

2026-05-01 05:22:20 +00:00

i64::try_from(fetch_limit).unwrap_or(i64::MAX) 캐스트 — usize::MAX 입력에서 negative LIMIT으로 wrap되어 SQLite가 에러 내는 시나리오를 차단합니다. 실제로 발생할 일은 거의 없지만 robustness side에 둔 게 옳습니다.

`i64::try_from(fetch_limit).unwrap_or(i64::MAX)` 캐스트 — `usize::MAX` 입력에서 negative LIMIT으로 wrap되어 SQLite가 에러 내는 시나리오를 차단합니다. 실제로 발생할 일은 거의 없지만 robustness side에 둔 게 옳습니다.

crates/kb-search/src/lexical.rs

						
				@@ -0,0 +457,4 @@

				                None => "empty array",

				            };

				            tracing::warn!(

				                chunk_id,

claude-reviewer-01 commented

2026-05-01 05:22:20 +00:00

trim_snippet이 .chars() (Unicode scalar value) 단위로 자르는 trade-off를 코멘트로 남긴 점이 좋습니다. design §6.4의 "characters" 정의가 USV 기준이라 spec에 부합하고, 동시에 Hebrew niqqud / Devanagari combining mark가 orphan될 수 있는 corner case를 미래 reader가 "버그"로 오해해서 grapheme cluster로 "고치는" 회귀를 막아둡니다.

trim_snippet이 `.chars()` (Unicode scalar value) 단위로 자르는 trade-off를 코멘트로 남긴 점이 좋습니다. design §6.4의 "characters" 정의가 USV 기준이라 spec에 부합하고, 동시에 Hebrew niqqud / Devanagari combining mark가 orphan될 수 있는 corner case를 미래 reader가 "버그"로 오해해서 grapheme cluster로 "고치는" 회귀를 막아둡니다.

crates/kb-search/src/lexical.rs

						
				@@ -0,0 +467,4 @@

				                end: 1,

				                section,

				            }

				        }

claude-reviewer-01 commented

2026-05-01 05:22:20 +00:00

Byte/empty-array source_span에 Citation::Line { 1, 1 } fallback + tracing::warn! 패턴이 의도를 �� 표현합니다. 데이터 무결성 issue를 silent하게 가리지 않고 (warn으로 surface) 동시에 retrieval 자체는 멈추지 않게 (fallback으로 forward) — forward-compat regression이 로그에 잡힙니다.

Byte/empty-array source_span에 `Citation::Line { 1, 1 }` fallback + `tracing::warn!` 패턴이 의도를 �� 표현합니다. 데이터 무결성 issue를 silent하게 가리지 않고 (warn으로 surface) 동시에 retrieval 자체는 멈추지 않게 (fallback으로 forward) — forward-compat regression이 로그에 잡힙니다.

crates/kb-search/src/lexical.rs

						
				@@ -0,0 +469,4 @@

				            }

				        }

				    }

				}

claude-reviewer-01 commented

2026-05-01 05:22:20 +00:00

globset의 literal_separator(true). 이 한 줄이 spec Risks/notes의 "*이 /을 cross하면 안 됨" 요건을 만족시킵니다. globset default가 cross하는지 empirical하게 확인하고 명시적으로 켠 점이 좋고, compile_glob_star_does_not_cross_slash 단위 테스트로 invariant을 pin한 것도 적절합니다.

globset의 `literal_separator(true)`. 이 한 줄이 spec Risks/notes의 "`*`이 `/`을 cross하면 안 됨" 요건을 만족시킵니다. globset default가 cross하는지 empirical하게 확인하고 명시적으로 켠 점이 좋고, `compile_glob_star_does_not_cross_slash` 단위 테스트로 invariant을 pin한 것도 적절합니다.

crates/kb-search/tests/fixtures/search/lexical/run-1.json

						
				@@ -0,0 +1,60 @@

				[

claude-reviewer-01 commented

2026-05-01 05:22:20 +00:00

snapshot baseline의 stability 메모(rusqlite bundled SQLite + KB_UPDATE_SNAPSHOTS=1 재생성 절차)를 테스트 코드 측에 둔 점이 적절합니다. 향후 SQLite bump으로 bm25 알고리즘이나 tokenizer가 바뀔 때 reviewer가 무엇을 해야 하는지 즉시 파악 가능합니다.

crates/kb-search/tests/lexical.rs

						
				@@ -0,0 +552,4 @@

				#[test]

				fn lexical_determinism_chunk_id_tiebreaker_on_equal_bm25() {

				    // Two chunks with byte-identical text + length → identical bm25 scores

claude-reviewer-01 commented

2026-05-01 05:22:20 +00:00

동일 bm25를 발생시키는 별도 결정성 테스트를 추가한 게 결정적입니다. 일반 결정성 테스트(varied scores)만 있으면 chunk_id tiebreaker 경로가 한 번도 실행되지 않은 채로 "determinism PASS"가 나오는데, 동일 텍스트 두 청크 케이스로 그 경로를 실제로 exercise합니다.

crates/kb-store-sqlite/src/store.rs

						
				@@ -112,0 +122,4 @@

				    ///

				    /// Poisoning is recovered the same way as [`Self::lock_conn`].

				    pub fn read_conn(&self) -> MutexGuard<'_, Connection> {

				        self.conn.lock().unwrap_or_else(|p| p.into_inner())

claude-reviewer-01 commented

2026-05-01 05:22:20 +00:00

read_conn doc-comment가 contract 의도는 명확하지만 &Connection 타입으로는 mutation을 막을 수 없다는 점은 PR 본문에 솔직하게 적어두셨습니다. closure scope (with_read_conn<R>(&self, f: impl FnOnce(&Connection) -> R) -> R) 변형이 type-system 차원에서 더 단단하긴 한데 prepare_cached + 행 iterator 패턴과 lifetimes가 awkward해진다는 trade-off도 있습니다 — P3 follow-up으로 미룬 결정에 동의합니다.

`read_conn` doc-comment가 contract 의도는 명확하지만 `&Connection` 타입으로는 mutation을 막을 수 없다는 점은 PR 본문에 솔직하게 적어두셨습니다. closure scope (`with_read_conn<R>(&self, f: impl FnOnce(&Connection) -> R) -> R`) 변형이 type-system 차원에서 더 단단하긴 한데 prepare_cached + 행 iterator 패턴과 lifetimes가 awkward해진다는 trade-off도 있습니다 — P3 follow-up으로 미룬 결정에 동의합니다.