Commit Graph

345 Commits

Author SHA1 Message Date
th-kim0823
2a8451c033 fix(p10-1a-1): tighten kebab-parse-code manifest + tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:05:34 +09:00
th-kim0823
ff11f81f7f feat(p10-1a-1): kebab-parse-code crate (lang + repo + skip)
Tasks 5-8: new `kebab-parse-code` crate with three infrastructure modules
for the code ingest framework. Ships lang.rs (extension→language identifier
mapping), repo.rs (.git walk-up via gix 0.70 for RepoMeta), and skip.rs
(BUILTIN_BLACKLIST, is_generated_file, is_oversized). 14 integration tests
across three test files, all passing; clippy -D warnings clean.

Note: gix pinned to 0.70 (not 0.83 as originally suggested) because 0.83
fails to compile against Rust 1.94.1 due to non-exhaustive match patterns
in gix-hash. 0.70 resolves cleanly and has identical head_name/head_id API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 15:57:59 +09:00
th-kim0823
bf4ebf8d2a feat(p10-1a-1): add Metadata.repo / git_branch / git_commit / code_lang
Four optional, serde-skipped-when-None fields added to `Metadata` for
code ingest context. All 11 downstream construction sites patched with
`repo: None, git_branch: None, git_commit: None, code_lang: None`.
Full workspace check (`--tests`) and per-crate test suite pass clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 15:44:18 +09:00
th-kim0823
351c7a0826 feat(p10-1a-1): add IngestReport skip counters + SkipExamples
Adds five new u32 counters (skipped_gitignore, skipped_kebabignore,
skipped_builtin_blacklist, skipped_generated, skipped_size_exceeded)
and a SkipExamples struct (≤5 sample paths per category) to
IngestReport. All new fields are #[serde(default)] for backward-compat
deserialization. Downstream literal construction sites patched with
zeros/empty; snapshot re-baked.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 15:28:19 +09:00
th-kim0823
7329ba96ee fix(p10-1a-1): patch missed SearchHit test-only construction sites
Add repo: None, code_lang: None to the 3 SearchHit struct literals
inside #[cfg(test)] blocks that were missed by the fa4eeb5 sweep.
2026-05-15 15:17:10 +09:00
th-kim0823
fa4eeb5a87 feat(p10-1a-1): add SearchHit.repo / code_lang + SearchFilters.repo / code_lang
Wire two new optional fields onto SearchHit (skip_serializing_if = None)
and two Vec<String> filter fields onto SearchFilters (serde default).
Add RetrievalDetail::Default impl (manual, uses SearchMode::Hybrid as
sentinel). Patch all downstream SearchHit / SearchFilters literal
constructors with repo: None / code_lang: None / vec![] as appropriate.
Also covers Citation::Code arm in kebab-eval metrics match.
2026-05-15 15:04:23 +09:00
th-kim0823
3b1e878aed feat(p10-1a-1): add Citation::Code variant
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 14:39:18 +09:00
th-kim0823
b954e9ce66 fix(fb-39b): address PR #137 round 1 review
- CI-only embed_model.rs tests updated 384 → 1024 + e5-small → e5-large
  references (incl. file header download size, snapshot dim assert,
  L2 norm comment)
- kebab-embed-local module docs + Cargo.toml description list both
  models (small + large)
- Stale tracing message expanded with both model sizes
- Task spec Post-merge deviation section: record dropped
  embedding_dim_mismatch ErrorV1 + reason (LanceDB (model, dim)
  namespacing makes hard-error redundant)
- Task spec + HOTFIXES version bump 0.6→0.7 corrected to 0.5→0.6
  (current Cargo.toml = 0.5.0; fb-42 0.6 cut deferred per user
  direction)
- HOTFIXES "embedding_version bump 아님" line corrected — cascade rule
  DOES trigger release bump, plus deviation note for the dropped error

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:45:55 +09:00
th-kim0823
69c94b6692 feat(embed,config): add multilingual-e5-large + flip default config (fb-39b)
Task 1: Add multilingual-e5-large arm to kebab-embed-local::resolve_model with tests for 1024-dim variants and error cases.

Task 2: Flip kebab-config defaults from e5-small (384-dim) to e5-large (1024-dim) across defaults(), test assertions, and TOML template.

All tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:05:36 +09:00
th-kim0823
5870a1de15 fix(fb-39): address PR #136 round 1 review
kebab eval compare now surfaces precision_at_k_chunk delta in both
human-readable table + deltas JSON. Snapshot fixture regenerated
additively.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 22:39:11 +09:00
th-kim0823
bb0ec0469f feat(eval): precision_at_k_chunk metric (P@5, P@10) (fb-39) 2026-05-10 22:26:21 +09:00
th-kim0823
b53376e96e fix(fb-42): address PR #134 round 1 review
- print_schema_text plain mode: include bulk_search capability row
- README: tool count 7 → 8, fetch added to MCP tool name lists

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:19:20 +09:00
th-kim0823
441f1192ee docs(fb-42): wire schema + README + SMOKE + design + SKILL + INDEX
- Add bulk_search_item.v1 + bulk_search_response.v1 wire schemas
- Register both in WIRE_SCHEMAS const
- README: --bulk flag mention + MCP tool list 7→8 (bulk_search)
- SMOKE: bulk multi-query walkthrough (CLI + MCP equivalent)
- Design §2.2: Bulk multi-query (fb-42) subsection (additive minor)
- SKILL: mcp__kebab__bulk_search section + tool table row
- Task spec status open→completed, banner replaced
- INDEX: fb-42 row 머지 (rerank hint deferred)
- Fix: missed Capabilities {bulk_search} in cli wire.rs test (Task 7 leftover)
- Fix: missed tools.len() 7→8 in cli_mcp_smoke (Task 5 leftover)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:07:36 +09:00
th-kim0823
e8da415624 feat(schema): bulk_search capability flag (fb-42)
- Capabilities.bulk_search: true (snapshot)
- schema.v1 wire required list updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:49:09 +09:00
th-kim0823
d8e5f35601 test(mcp): integration tests for bulk_search tool (fb-42) 2026-05-10 20:33:32 +09:00
th-kim0823
6ab0d782ef feat(mcp): kebab__bulk_search tool (fb-42)
Exposes bulk multi-query search via MCP `bulk_search` tool:
- Input: { queries: [SearchInput shapes...] }, capped at 100
- Output: bulk_search_response.v1 with per-query results + summary
- Sequential execution reuses App instance for cache amortization
- Per-query errors embed error.v1 JSON; never aborts bulk call

Updates tool count from 7 to 8 in lib.rs comment + tools_list test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:31:20 +09:00
th-kim0823
2bbe94eb05 test(cli): integration tests for kebab search --bulk (fb-42) 2026-05-10 20:26:07 +09:00
th-kim0823
9ac13fa256 fix(cli): make query optional when --bulk is set (fb-42) 2026-05-10 20:26:03 +09:00
th-kim0823
67f2c16cc2 feat(cli): kebab search --bulk flag + stdin ndjson + output stream (fb-42) 2026-05-10 20:22:45 +09:00
th-kim0823
1ebbd6b711 feat(app): bulk_search_with_config facade (fb-42) 2026-05-10 20:18:49 +09:00
th-kim0823
892175d009 feat(core): BulkSearchItem / Summary / Response types (fb-42) 2026-05-10 20:12:31 +09:00
th-kim0823
b35f163f56 fix(fb-40): address PR #132 round 1 review
Module doc still pinned "rag-v1" — update to reflect dispatched
template via system_prompt_for (rag-v1 legacy / rag-v2 default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 19:42:57 +09:00
th-kim0823
0e8b800b6b test(rag): integration tests for rag-v1/v2/unknown dispatch (fb-40)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 19:18:36 +09:00
th-kim0823
126559ce7a fix(fb-40): update test fixtures for rag-v2 default 2026-05-10 19:15:15 +09:00
th-kim0823
137fc4ee31 feat(config): default prompt_template_version rag-v1 → rag-v2 (fb-40) 2026-05-10 19:04:55 +09:00
th-kim0823
59f01f8185 feat(rag): pipeline reads prompt_template_version via helper (fb-40) 2026-05-10 19:02:39 +09:00
th-kim0823
9f70681b77 feat(rag): SYSTEM_PROMPT_RAG_V2 + system_prompt_for dispatch helper (fb-40) 2026-05-10 19:01:05 +09:00
th-kim0823
67aee9f480 test(cli): integration tests for score_kind on lexical mode (fb-38) 2026-05-10 18:12:14 +09:00
th-kim0823
4440fa6659 fix(fb-38): add score_kind to remaining SearchHit literals
Add missing score_kind field to SearchHit constructors in:
- kebab-tui/tests/search.rs::make_hit()
- kebab-eval/tests/metrics_and_compare.rs::hit()
- kebab-eval/src/metrics.rs::hit()

All test fixtures default to Rrf (hybrid mode), matching the field's
Default impl and the test semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 18:08:29 +09:00
th-kim0823
b51cdb9e8f feat(search/hybrid): fuse hits override score_kind to Rrf (fb-38) 2026-05-10 17:56:56 +09:00
th-kim0823
4e739f3cd8 feat(search): add score_kind to VectorRetriever (Cosine) and hybrid test helpers (Rrf)
This commit unblocks Tasks 3 and 4 of fb-38:
- VectorRetriever::build_hit now labels hits with ScoreKind::Cosine
- Hybrid retriever test helpers (mk_hit functions) label synthetic hits with ScoreKind::Rrf
- Updated lexical snapshot fixture to reflect new score_kind field in output

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 17:54:16 +09:00
th-kim0823
3a621bba0d feat(search/lexical): label hits with ScoreKind::Bm25 (fb-38 task 2)
- Add ScoreKind::Bm25 to LexicalRetriever::build_hit SearchHit construction
- Import ScoreKind from kebab_core in lexical.rs
- Add integration test lexical_retriever_hits_carry_bm25_score_kind to verify all
  hits from LexicalRetriever carry score_kind == ScoreKind::Bm25
- Update lexical snapshot test baseline to include new score_kind field

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 17:54:11 +09:00
th-kim0823
3c605b1a5d feat(core): ScoreKind enum + SearchHit.score_kind (fb-38) 2026-05-10 17:49:02 +09:00
th-kim0823
6a33d08aea fix(fb-37): address PR #129 round 1 review
- doc TraceFusionInput.fusion_score semantics (single-mode vs hybrid)
- comment why total_ms vs stage sum can drift (millis truncation)
- TODO marker on TUI trace popup filter passthrough

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 16:26:34 +09:00
th-kim0823
a40593590b docs(fb-37): wire schema + README + SMOKE + INDEX + SKILL 2026-05-10 14:13:47 +09:00
th-kim0823
5687cbc0e2 feat(tui): search pane t-key opens TracePopup (fb-37) 2026-05-10 13:39:11 +09:00
th-kim0823
653e432a30 feat(mcp): kebab__search trace input + output mirror (fb-37)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 13:32:30 +09:00
th-kim0823
f7e2072d66 test(cli): integration tests for --trace + schema breakdowns (fb-37)
Also fixes App::search_with_opts trace branch to use NoopRetriever
for SearchMode::Lexical, removing the embeddings requirement when
the user only wants lexical-mode trace.
2026-05-10 13:21:33 +09:00
th-kim0823
72c227af23 feat(cli): kebab search --trace flag + wire trace + pretty print (fb-37)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 13:08:48 +09:00
th-kim0823
69037c313a feat(app): SearchResponse.trace + opts.trace threading (fb-37)
Adds the `trace: Option<SearchTrace>` field to `SearchResponse` and
threads `SearchOpts.trace` through `App::search_with_opts`. When the
caller sets `opts.trace = true` the path bypasses the LRU search cache
and runs through `HybridRetriever::search_with_trace`, which dispatches
all 3 SearchModes internally; this means `--trace` requires embeddings
(same constraint as `--mode hybrid`). The non-trace path keeps its
exact prior behavior with `trace: None` stamped on the response.

Picked up Task 1 / Task 3 follow-ups in the same commit so the
workspace compiles: SearchOpts struct-literals in kebab-cli/main.rs +
kebab-mcp/tools/search.rs default the new `trace` field to false, and
the schema-wrapper test in kebab-cli/wire.rs fills the new
media_breakdown / lang_breakdown / index_bytes / stale_doc_count fields
on Stats with `Default::default()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 13:01:18 +09:00
th-kim0823
6a067e3ab1 feat(search): HybridRetriever::search_with_trace (fb-37) 2026-05-10 12:38:53 +09:00
th-kim0823
231d80e82d feat(stats): media/lang/bytes/stale fields on schema.v1.stats (fb-37)
Extends CountSummary with media_breakdown, lang_breakdown, stale_doc_count
fields populated via stats_ext::breakdowns(). Adds count_summary_with_threshold
for callers that need real stale counts. Mirrors all new fields onto the
wire-bound Stats struct in kebab-app::schema with #[serde(default)] for
backwards-compat. Also fixes search_budget_integration.rs for the trace field
added to SearchOpts in Task 1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 12:34:57 +09:00
th-kim0823
69c6e23432 feat(store): breakdowns + index_bytes helpers (fb-37)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 12:24:43 +09:00
th-kim0823
1e943f21dc feat(core): SearchTrace + IndexBytes types + SearchOpts.trace (fb-37)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 12:17:04 +09:00
th-kim0823
84287d0ef6 fix(fb-36): address PR #127 round 1 review
- ingested_after: convert OffsetDateTime to UTC before formatting
  so non-Z offsets compare correctly against UTC TEXT storage
  (lexical.rs + filters.rs)
- README: --tag is repeatable-only, not csv (only --media is csv)
- test(cli): add multi-value --tag OR-within IN-list coverage
- test(store): add UTC-offset regression test for ingested_after
- mcp: use ERROR_V1_ID const instead of hardcoded "error.v1"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 04:47:55 +09:00
th-kim0823
b06f4654e7 feat(mcp): kebab__search filter inputs (fb-36)
7 new optional inputs on SearchInput: tags, lang, path_glob,
trust_min, media, ingested_after, doc_id. Validation surfaces as
error.v1 code = invalid_input via StructuredError. Dispatch builds
SearchFilters from the inputs and forwards through the existing
search_with_opts_with_config facade.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 04:11:27 +09:00
th-kim0823
4e0379c04f test(cli): wire_search_filters — lexical-only integration tests (fb-36)
Cover: --doc-id scoping, --ingested-after validation error,
--media md alias, --tag repeatable + frontmatter parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 04:06:21 +09:00
th-kim0823
6a18847892 feat(cli): kebab search filter flags (fb-36)
7 new flags: --tag (repeatable), --lang, --path-glob,
--trust-min (value_enum), --media (csv with `md` alias),
--ingested-after (RFC3339; config_invalid on parse fail),
--doc-id. Dispatch translates clap values into SearchFilters
and propagates structured errors through the existing
StructuredError wrapper from fb-34.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:57:55 +09:00
th-kim0823
c6cc1e2bfe feat(search/vector): media / ingested_after / doc_id filters (fb-36)
filter_chunks helper in kebab-store-sqlite extended with the same 3
WHERE clauses as lexical. Vector still over-fetches k*2 then
post-filters via SqliteStore::filter_chunks; small k can return < k
hits when filters drop a lot — agent is expected to widen k or
paginate. AND combinator with existing filters.

- kebab-store-sqlite/src/filters.rs: media IN-list subquery, ingested_after
  lexicographic >= compare, doc_id equality; mirrors lexical SQL arms
- 3 direct unit tests (filter_chunks_media_type/ingested_after/doc_id)
  that run without AVX/Lance
- common/mod.rs: insert_doc / insert_doc_with_media / run_vector_search
  helpers on HybridEnv for integration-test use
- hybrid.rs: 2 new #[ignore = "requires AVX..."] integration tests
  (vector_filter_by_media, vector_filter_by_doc_id)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:50:56 +09:00
th-kim0823
86475e5ba2 fix(search/lexical): use std::iter::repeat_n (clippy)
Per code review on 2c80e2a. manual-repeat-n lint triggers
for Rust 1.94+ when repeat().take() can be expressed as
repeat_n directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:43:51 +09:00