kebab-search/tests/lexical.rs 의 alias 채널 테스트 + insert_chunk_with_aliases
헬퍼 제거(body 회수 회귀 테스트로 대체). Chunk 리터럴 aliases: None 제거
(embedding_records_fk/idempotency/inspect). chunk 스냅샷 fixture 의 aliases
키 제거. config_migrate 는 ingest.code 앵커로, corpus_revision/search_lexical
주석은 V013 비-bump 명시로 갱신.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
v0.17.0 trigram tokenizer entry 가 미수정으로 남겨둔
heading_path_json JSON 노이즈 (HOTFIXES 2026-05-24) closure.
trigram 이 chunks_fts.heading_path 컬럼 (V002/V007 트리거가
chunks.heading_path_json 그대로 INSERT) 의 JSON 표기 + 안의 path
세그먼트 (app, src) 까지 3-gram 색인해서 query 가 우연히 false
positive hit 하는 문제. column filter 채택 — heading 색인 유지
(V007 verbatim 불변), 매칭 대상만 text 컬럼 한정.
- build_match_string 가 non-raw 분기에서 combined expression 을
`text : (<expr>)` 로 wrap. FTS5 column filter syntax 가 OR/AND
sub-expression 허용.
- Raw mode (`'...'`) 는 그대로 — 사용자가 명시 의도로
`'heading_path : agent'` 같은 explicit opt-in 가능 (escape hatch).
- 8 기존 build_match_string unit test expected string 갱신 +
`build_match_string_raw_mode_preserves_heading_filter` 신규.
- `lexical_heading_only_token_does_not_hit_default_mode` 신규 회귀 핀
(heading-only unique token 이 default mode 에서 0 hit).
- `lexical_raw_mode_can_opt_into_heading_path_filter` 신규 — 같은
fixture 가 raw mode 로 hit 확인 (escape hatch 동작 핀).
사용자 영향: lexical / hybrid 검색의 본문 precision ↑. recall
변화 없음 (text 본문 token 매칭은 동일). re-ingest 불필요 (FTS
query 시점 매칭만 변경). lexical_snapshot_run_1 + hybrid_snapshot
도 fixture regenerate 불필요 (text 본문 매칭 query 라 BM25 동일).
HOTFIXES: 2026-05-24 v0.17.0 entry 의 `heading_path_json` 노이즈
항목 closure 표기 + 새 2026-05-25 post-v0.17.1 dogfood entry 추가.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
p10-1A-1 (PR #139) added SearchFilters.code_lang + .repo fields and the CLI
--code-lang / --repo flags propagate them correctly into SearchFilters, but
neither the lexical retriever's FTS SQL nor the shared filter_chunks helper
(used by the vector retriever) ever applied them — so a code-lang-filtered
search returned all-doc hits (markdown / pdf / code mixed).
Discovered while dogfooding p10-1B with httpx + zod + lodash clones:
`kebab search 'AsyncClient' --code-lang python --json` returned markdown
hits from httpx/docs/ first.
Fix: add IN-list filters on json_extract(d.metadata_json, '$.code_lang')
and '$.repo' to both lexical.rs and filters.rs, mirroring the existing
media filter pattern. Two regression tests added in each crate covering
the new filter behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit unblocks Tasks 3 and 4 of fb-38:
- VectorRetriever::build_hit now labels hits with ScoreKind::Cosine
- Hybrid retriever test helpers (mk_hit functions) label synthetic hits with ScoreKind::Rrf
- Updated lexical snapshot fixture to reflect new score_kind field in output
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add ScoreKind::Bm25 to LexicalRetriever::build_hit SearchHit construction
- Import ScoreKind from kebab_core in lexical.rs
- Add integration test lexical_retriever_hits_carry_bm25_score_kind to verify all
hits from LexicalRetriever carry score_kind == ScoreKind::Bm25
- Update lexical snapshot test baseline to include new score_kind field
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
filter_chunks helper in kebab-store-sqlite extended with the same 3
WHERE clauses as lexical. Vector still over-fetches k*2 then
post-filters via SqliteStore::filter_chunks; small k can return < k
hits when filters drop a lot — agent is expected to widen k or
paginate. AND combinator with existing filters.
- kebab-store-sqlite/src/filters.rs: media IN-list subquery, ingested_after
lexicographic >= compare, doc_id equality; mirrors lexical SQL arms
- 3 direct unit tests (filter_chunks_media_type/ingested_after/doc_id)
that run without AVX/Lance
- common/mod.rs: insert_doc / insert_doc_with_media / run_vector_search
helpers on HybridEnv for integration-test use
- hybrid.rs: 2 new #[ignore = "requires AVX..."] integration tests
(vector_filter_by_media, vector_filter_by_doc_id)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQL WHERE clause extension. media uses CASE WHEN json_type='text'
to handle both unit (\`"markdown"\`) and tuple (\`{"image":"png"}\`)
MediaType serde shapes. ingested_after relies on RFC3339 lexicographic
ordering with UTC Z (per fb-32 ingest invariant). doc_id is a simple
equality. AND combinator with existing tags / lang / trust filters.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JOIN documents.updated_at. stale defaults to false; App facade
post-processes against config threshold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>