kebab-search/tests/lexical.rs 의 alias 채널 테스트 + insert_chunk_with_aliases
헬퍼 제거(body 회수 회귀 테스트로 대체). Chunk 리터럴 aliases: None 제거
(embedding_records_fk/idempotency/inspect). chunk 스냅샷 fixture 의 aliases
키 제거. config_migrate 는 ingest.code 앵커로, corpus_revision/search_lexical
주석은 V013 비-bump 명시로 갱신.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
별칭 dense 벡터({orig}#alias) hit 을 원본 chunk_id 로 strip 해 hydrate,
body+alias 중복은 첫(높은 score) 하나만 유지. overfetch 2→3 (dedup 후 k
확보). wire/RetrievalDetail 무변경. vector/hybrid 회귀 0, clippy green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
V009 morphological tokenizer 작업 (S3 chunk + S4 backfill + S5
short_query_hint 제거 + S7 신규 tests) 의 형식 정리. 동작 변경 없음.
Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S11)
신규 4 test 추가:
- crates/kebab-store-sqlite/tests/fts.rs:
- fts_v009_korean_morphological_2char_query_hits: tokenized_korean_text
column 이 채워진 chunk 의 '한국' 2-char query hit.
- fts_v009_english_whole_token_only: V007 trigram substring 매칭
회귀 (Path A) — 'token' query 가 'tokenizer' chunk 에서 0-hit.
- crates/kebab-app/tests/search_korean.rs:
- korean_morphological_2char_query_lexical_mode: end-to-end
한국어 wiki fixture ingest → '한국' / '서울' query hit.
- korean_morphological_mixed_english_korean_query: 'Rust' English
whole-token + '최적화' Korean morpheme hit.
crates/kebab-search/src/lexical.rs:
- build_match_string() 의 MIN_TRIGRAM_CHARS(3) → MIN_QUERY_CHARS(2).
V009 unicode61 은 최소 token 길이 제한 없어 2자 한국어 morpheme
query 가 통과되어야 함. 1자 단독은 여전히 필터.
- 관련 unit test 2개 V009 동작으로 갱신.
fixture text 는 lindera ko-dic 의 실제 segmentation 동작에 의존
(spec Appendix B prior-knowledge 예측). 실측 시 fixture 조정 가능.
Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §9.1, §9.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S7)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
V009 unicode61 + 형태소 tokenizer 환경에서 2-char 한국어 query 가
hit 가능해졌으므로 V007 시기의 "3자 이상 권장" hint 가 obsolete.
SearchResponse.hint field 는 wire schema 보존 위해 struct 에 유지 +
항상 None.
- kebab-app/src/app.rs: short_query_hint 함수 + doc-comment 삭제.
2 호출 site 가 hint = None 으로 정리.
- kebab-app/src/lib.rs: re-export 에서 short_query_hint 제거.
- kebab-tui/{app.rs,search.rs,run.rs}: short_query_hint field + 4
호출 cascade 제거.
- kebab-cli/tests/wire_search_response.rs:
search_plain_emits_short_query_hint_to_stderr test 삭제.
search_json_emits_hint_field_for_short_query →
search_json_hint_absent_for_short_query_v009 으로 교체
(hint 항상 None 검증).
- kebab-search/src/lexical.rs::build_match_string: V007 의 trigram
multi-token OR-combine 분기는 V009 환경에서 redundant 하나 보존
(future 확장성) — doc-comment 1 줄 추가.
Wire schema shape 변경 없음 (search_response.schema.json:33 의 hint
field 보존, struct 에 None 으로 항상 셋팅).
Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §7.2, §7.3, §11.3
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S5)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v0.17.0 trigram tokenizer entry 가 미수정으로 남겨둔
heading_path_json JSON 노이즈 (HOTFIXES 2026-05-24) closure.
trigram 이 chunks_fts.heading_path 컬럼 (V002/V007 트리거가
chunks.heading_path_json 그대로 INSERT) 의 JSON 표기 + 안의 path
세그먼트 (app, src) 까지 3-gram 색인해서 query 가 우연히 false
positive hit 하는 문제. column filter 채택 — heading 색인 유지
(V007 verbatim 불변), 매칭 대상만 text 컬럼 한정.
- build_match_string 가 non-raw 분기에서 combined expression 을
`text : (<expr>)` 로 wrap. FTS5 column filter syntax 가 OR/AND
sub-expression 허용.
- Raw mode (`'...'`) 는 그대로 — 사용자가 명시 의도로
`'heading_path : agent'` 같은 explicit opt-in 가능 (escape hatch).
- 8 기존 build_match_string unit test expected string 갱신 +
`build_match_string_raw_mode_preserves_heading_filter` 신규.
- `lexical_heading_only_token_does_not_hit_default_mode` 신규 회귀 핀
(heading-only unique token 이 default mode 에서 0 hit).
- `lexical_raw_mode_can_opt_into_heading_path_filter` 신규 — 같은
fixture 가 raw mode 로 hit 확인 (escape hatch 동작 핀).
사용자 영향: lexical / hybrid 검색의 본문 precision ↑. recall
변화 없음 (text 본문 token 매칭은 동일). re-ingest 불필요 (FTS
query 시점 매칭만 변경). lexical_snapshot_run_1 + hybrid_snapshot
도 fixture regenerate 불필요 (text 본문 매칭 query 라 BM25 동일).
HOTFIXES: 2026-05-24 v0.17.0 entry 의 `heading_path_json` 노이즈
항목 closure 표기 + 새 2026-05-25 post-v0.17.1 dogfood entry 추가.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
p10-1A-1 (PR #139) added SearchFilters.code_lang + .repo fields and the CLI
--code-lang / --repo flags propagate them correctly into SearchFilters, but
neither the lexical retriever's FTS SQL nor the shared filter_chunks helper
(used by the vector retriever) ever applied them — so a code-lang-filtered
search returned all-doc hits (markdown / pdf / code mixed).
Discovered while dogfooding p10-1B with httpx + zod + lodash clones:
`kebab search 'AsyncClient' --code-lang python --json` returned markdown
hits from httpx/docs/ first.
Fix: add IN-list filters on json_extract(d.metadata_json, '$.code_lang')
and '$.repo' to both lexical.rs and filters.rs, mirroring the existing
media filter pattern. Two regression tests added in each crate covering
the new filter behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire two new optional fields onto SearchHit (skip_serializing_if = None)
and two Vec<String> filter fields onto SearchFilters (serde default).
Add RetrievalDetail::Default impl (manual, uses SearchMode::Hybrid as
sentinel). Patch all downstream SearchHit / SearchFilters literal
constructors with repo: None / code_lang: None / vec![] as appropriate.
Also covers Citation::Code arm in kebab-eval metrics match.
This commit unblocks Tasks 3 and 4 of fb-38:
- VectorRetriever::build_hit now labels hits with ScoreKind::Cosine
- Hybrid retriever test helpers (mk_hit functions) label synthetic hits with ScoreKind::Rrf
- Updated lexical snapshot fixture to reflect new score_kind field in output
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add ScoreKind::Bm25 to LexicalRetriever::build_hit SearchHit construction
- Import ScoreKind from kebab_core in lexical.rs
- Add integration test lexical_retriever_hits_carry_bm25_score_kind to verify all
hits from LexicalRetriever carry score_kind == ScoreKind::Bm25
- Update lexical snapshot test baseline to include new score_kind field
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- doc TraceFusionInput.fusion_score semantics (single-mode vs hybrid)
- comment why total_ms vs stage sum can drift (millis truncation)
- TODO marker on TUI trace popup filter passthrough
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ingested_after: convert OffsetDateTime to UTC before formatting
so non-Z offsets compare correctly against UTC TEXT storage
(lexical.rs + filters.rs)
- README: --tag is repeatable-only, not csv (only --media is csv)
- test(cli): add multi-value --tag OR-within IN-list coverage
- test(store): add UTC-offset regression test for ingested_after
- mcp: use ERROR_V1_ID const instead of hardcoded "error.v1"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
filter_chunks helper in kebab-store-sqlite extended with the same 3
WHERE clauses as lexical. Vector still over-fetches k*2 then
post-filters via SqliteStore::filter_chunks; small k can return < k
hits when filters drop a lot — agent is expected to widen k or
paginate. AND combinator with existing filters.
- kebab-store-sqlite/src/filters.rs: media IN-list subquery, ingested_after
lexicographic >= compare, doc_id equality; mirrors lexical SQL arms
- 3 direct unit tests (filter_chunks_media_type/ingested_after/doc_id)
that run without AVX/Lance
- common/mod.rs: insert_doc / insert_doc_with_media / run_vector_search
helpers on HybridEnv for integration-test use
- hybrid.rs: 2 new #[ignore = "requires AVX..."] integration tests
(vector_filter_by_media, vector_filter_by_doc_id)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per code review on 2c80e2a. manual-repeat-n lint triggers
for Rust 1.94+ when repeat().take() can be expressed as
repeat_n directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQL WHERE clause extension. media uses CASE WHEN json_type='text'
to handle both unit (\`"markdown"\`) and tuple (\`{"image":"png"}\`)
MediaType serde shapes. ingested_after relies on RFC3339 lexicographic
ordering with UTC Z (per fb-32 ingest invariant). doc_id is a simple
equality. AND combinator with existing tags / lang / trust filters.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JOIN documents.updated_at. stale defaults to false; App facade
post-processes against config threshold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>