Files
kebab/integrations/claude-code/kebab/SKILL.md
altair823 8a2f7affa6 feat(mcp): fb-41 PR-5 — MCP ask multi_hop arg + SKILL.md 안내
fb-41 multi-hop RAG 의 **PR-5** (PR-4 머지 직후). PR-4 의 CLI `--multi-hop`
flag 와 sister surface — agent (Claude Code 등 MCP host) 가 `mcp__kebab__ask`
호출 시 `multi_hop: true` 옵션 사용 가능.

설계: docs/superpowers/specs/2026-05-25-p9-fb-41-multi-hop-rag-design.md
계획: docs/superpowers/plans/2026-05-25-p9-fb-41-multi-hop-rag.md (PR-5 단락)

## MCP surface

- `crates/kebab-mcp/src/tools/ask.rs`:
  - `AskInput.multi_hop: Option<bool>` 추가. JsonSchema derive 가 tools/list
    에 자동 반영 — agent capability discovery 가 새 필드 인식.
  - `handle()` 가 `AskOpts.multi_hop = input.multi_hop.unwrap_or(false)` —
    기존 caller (필드 누락 / null) 는 single-pass 그대로.
- `crates/kebab-mcp/src/lib.rs` (tools/list):
  - `ask` tool description 에 multi-hop 한 줄 (decompose → retrieve →
    synthesize, 2-5× LLM cost, per-hop trace on Answer.hops).

## SKILL.md 안내

- `integrations/claude-code/kebab/SKILL.md` 의 `mcp__kebab__ask` 절:
  - Input shape JSON 예제에 `multi_hop: false` 추가.
  - Returns 절에 `hops` (multi-hop only) 추가.
  - 신규 bullet (p9-fb-41) — opt-in 조건 / 비용 trade-off / 사용 케이스
    (compound questions / prereq chains / cross-doc reasoning) /
    `Answer.hops` 의 per-hop trace shape / `multi_hop_decompose_failed`
    refusal 처리.

## Tests (`tests/tools_call_ask_multi_hop.rs` 신규, 2 Ollama-free pins)

- `ask_tool_routes_multi_hop_true_to_decompose_first`: dispatch
  divergence 핀. invalid LLM endpoint (`http://127.0.0.1:1`,
  request_timeout_secs=2) 로 force unreachable. multi_hop=true 는
  decompose 먼저 호출 → `error.v1` (code=model_unreachable) /
  isError=true. multi_hop=false (single-pass) 는 empty KB 에서 retrieve
  먼저 → no LLM call → `answer.v1` grounded=false / isError=false. 두
  shape 의 분기가 dispatch 가 실제로 다른 path 로 라우팅됨의 증거.
- `ask_input_schema_advertises_multi_hop_field`: AskInput 의 JsonSchema
  가 `multi_hop` property 노출 — MCP host capability discovery
  (tools/list 의 input schema) 회귀 핀.

기존 `tools_call_ask.rs` 의 AskInput literal 도 `multi_hop: None`
추가 (struct field 추가에 따른 minimal cascade).

## 변경 없음

- `prompt_template_version` (`rag-multi-hop-v1`) — 그대로.
- TUI surface — PR-6 의 책임.
- error.v1 매핑 — PR-4 의 enum reservation 그대로 (no error_wire
  promotion).

## 검증

- `cargo test -p kebab-mcp -j 1` — 신규 tools_call_ask_multi_hop 2 +
  기존 ask / search / bulk_search / fetch / ingest / schema / doctor /
  tools_list / initialize 등 모두 통과 (회귀 없음).
- `cargo clippy -p kebab-mcp --all-targets -j 1 -- -D warnings` clean.
- 단일 crate 직렬 build (16 GB RAM 제약).

## 다음 PR

- PR-6: TUI Ask 패널 multi-hop toggle (F2 / Ctrl-T) + hop trace render +
  cheatsheet 갱신.
- v0.18.0 cut (PR-6 머지 후).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 09:06:28 +00:00

17 KiB
Raw Blame History

name, description
name description
kebab Local knowledge base + RAG over the user's pre-indexed documents (wiki crawls, Markdown notes, PDFs, images). Use when answering questions that need internal context the user has indexed locally — e.g. team-specific procedures, internal runbooks, infrastructure docs, credentials registries, project-specific conventions. Also use when a domain question (Kubernetes, MLOps, internal tooling, etc.) needs additional grounding from indexed docs before answering. Do NOT use for general public questions, code in the working directory, or anything obviously outside the indexed corpus.

kebab — local KB / RAG access

kebab indexes the user's personal documents and exposes them via lexical / vector / hybrid search and a local-LLM RAG answer. All output speaks frozen wire schema v1 — every JSON record carries a schema_version field.

Two surfaces ship: an MCP server (kebab mcp, preferred — process stays hot across calls) and a CLI (~/.cargo/bin/kebab, fallback for hosts without MCP).

When to invoke

Trigger when the user's question matches any of:

  • Refers to internal/organization-specific systems, procedures, or jargon that a generic public answer would miss.
  • Names a runbook or procedure the user is likely to have indexed ("how do I X", "what's our policy on Y", "where's the doc for Z").
  • Domain-technical question where additional internal context (custom CRDs, internal naming, team conventions) would change the answer vs. a generic public answer.
  • User explicitly references "the wiki", "내부 문서", "kb", or asks "do we have docs on X".

Skip when:

  • The question is about public OSS, language semantics, or anything in the current working directory.
  • The user is editing kebab's own source — that's a code task, not a KB query.
  • A previous kebab call in this session already returned grounded: false on a near-identical query (don't loop).

User-specific trigger keywords (team names, system names, internal acronyms) belong in a per-user override of this SKILL.md, not in this repo-shipped version.

MCP tools (preferred)

When kebab is registered as an MCP server (see ~/.claude/mcp.json example below), eight tools are exposed as mcp__kebab__<name>:

tool purpose mutation
mcp__kebab__search corpus search → search_response.v1 ({hits, next_cursor, truncated}) no
mcp__kebab__bulk_search N queries in one call → bulk_search_response.v1 ({results, summary}) no
mcp__kebab__ask RAG answer → answer.v1 no
mcp__kebab__fetch verbatim text → fetch_result.v1 (chunk / doc / span) no
mcp__kebab__schema capability discovery → schema.v1 no
mcp__kebab__doctor health check → doctor.v1 no
mcp__kebab__ingest_file save single file → ingest_report.v1 yes
mcp__kebab__ingest_stdin save markdown blob → ingest_report.v1 yes

Mutation tools require explicit user intent — never auto-invoke.

mcp__kebab__search — when you need the source

Use when the user wants to find a doc, or when you (the model) need raw chunks to reason from before answering.

Input:

{ "query": "<query>", "mode": "hybrid", "k": 10, "max_tokens": null, "snippet_chars": null, "cursor": null, "tags": null, "lang": null, "path_glob": null, "trust_min": null, "media": null, "ingested_after": null, "doc_id": null, "trace": null }
  • mode = "hybrid" is the default-correct choice. Use "vector" for semantic-only ("docs about X concept"), "lexical" for exact strings ("the literal flag --foo-bar").
  • max_tokens / snippet_chars / cursor (p9-fb-34) — agent budget controls. Set max_tokens to cap result wire size (chars/4 estimate); set cursor to the previous response's next_cursor to fetch the next page.
  • p9-fb-36 filter inputs: tags (string array — OR-within, AND across keys), lang (BCP-47 language code), path_glob (glob pattern matched against doc path), trust_min ("primary" | "secondary" | "generated" — includes that level and above), media (string array — IN-list of "markdown" | "pdf" | "image" | "audio" | "other"; alias "md""markdown"), ingested_after (RFC3339 UTC string), doc_id (exact doc UUID). AND combinator across keys. Invalid ingested_after or unknown trust_minerror.v1.code = invalid_input. Unknown media value → empty hits, no error.
  • Output is search_response.v1: { hits: search_hit.v1[], next_cursor: string|null, truncated: bool }. Iterate response.hits[] for individual hits. Key hit fields: rank, score, doc_path, heading_path[], section_label, snippet, citation (line range / page), chunk_id.
  • hits[].score_kind (p9-fb-38): "rrf" (hybrid) / "bm25" (lexical) / "cosine" (vector). Declares the meaning of the top-level score — it is a ranking signal, not a confidence value. If you need a trust threshold, use retrieval.lexical_score (BM25 raw) / retrieval.vector_score (cosine raw) instead of the top-level score.
  • Cite back to the user as doc_path § heading_path[-1] so they can open the source.
  • When truncated: true, the budget loop modified the page (snippet shortening or k reduction). next_cursor is independent — non-null whenever more hits may be reachable. Caller may widen max_tokens (re-issue same query for fuller snippets / more hits per page) or follow next_cursor (advance through more hits) or both. Mismatched cursor (corpus_revision changed) returns error.v1.code = stale_cursor — re-issue the search to obtain a fresh one.
  • trace: true (p9-fb-37) — debug aid. Response carries an extra trace block: lexical[] + vector[] (pre-fusion candidates), rrf_inputs[] (RRF union before final cut), and timing (lexical_ms, vector_ms, fusion_ms, total_ms). Trace bypasses the search cache (always cold). Use sparingly — it bloats the wire response and is for diagnosing "why did this hit / not hit", not normal retrieval.
  • hint (v0.17.0) — optional advisory string on search_response.v1. Present only when the result is empty AND the trimmed query is shorter than the FTS5 trigram tokenizer's 3-char minimum. Surface it to the user instead of retrying the same short query. Korean lexical search benefits most from ≥3-char keywords (충돌 zero-hit, 충돌은 substring-hit). Raw FTS5 mode ('...') opts out — the user opted into FTS5 syntax. Vector / hybrid modes carry the field too but it's rarely triggered (semantic embeddings handle short queries).
  • Column scoping (post-v0.17.1 dogfood) — default lexical / hybrid matching is scoped to the text column only. The heading_path column is indexed (path segments like app, src plus JSON punctuation are 3-gram'd) but excluded from the default MATCH — past JSON noise produced false positives where a query coincidentally shared a 3-gram with a file's heading path. To deliberately search heading paths, escape into raw FTS5 mode with an explicit column filter: 'heading_path : <token>' (e.g. kebab search "'heading_path : agent'"). Same applies to MCP search — quote the inner expression. Raw mode bypasses both column scoping and the 3-char hint short-circuit.

N개 query 한 번에 — agent loop 효율 개선. 각 query 는 mcp__kebab__search 와 동일 input shape (query 필수, 나머지 optional). Cap 100.

Input:

{"queries": [{"query": "..."}, {"query": "...", "mode": "lexical"}, ...]}

Output: bulk_search_response.v1 envelope — results: [bulk_search_item.v1] (각 item = {query, response | null, error | null}) + summary: {total, succeeded, failed}. Per-query 실패는 item 의 error 에 격리, 다른 query 계속 진행.

mcp__kebab__ask — when you need the answer

Use when the user wants a synthesized answer, not a list of links.

Input:

{ "query": "<question>", "session_id": "<optional-slug>", "mode": "hybrid", "multi_hop": false }
  • Returns answer.v1: answer (markdown), citations[], grounded (bool), refusal_reason, model, conversation_id, turn_index, hops (multi-hop only).
  • If grounded == false → KB doesn't have enough context. Don't paraphrase the refusal as if it were an answer. Tell the user the KB came up dry and fall back to your own knowledge or ask for the source.
  • For follow-up turns on the same topic, pass session_id (e.g. "team-onboarding-2026-05") and reuse it across the conversation. Sessions persist until kebab reset --data-only.
  • p9-fb-40: 기본 prompt_template_version = "rag-v2". 답변이 더 strict — fact 인용 시 verbatim span, 학습 지식 동원 금지, 근거 모호 시 "확실하지 않다" 출현 가능. user 가 [rag] prompt_template_version = "rag-v1" 명시 시 legacy 동작.
  • p9-fb-41 multi_hop: true — opt the ask into the multi-hop pipeline. The query is decomposed into sub-questions, each retrieved independently (LLM-driven decide loop, up to rag.multi_hop_max_depth iters), then synthesized over the merged chunk pool. Cost trade-off: 25× LLM calls vs. single-pass. Use for compound questions ("X 와 Y 의 차이는?", prereq chains, cross-doc reasoning where one chunk alone is insufficient). Don't for simple fact-finding (single-pass is faster + cheaper). When set, answer.v1.hops[] carries the per-hop trace ({iter, kind, sub_queries[], context_chunks_added, forced_stop, llm_call_ms}) — surface a brief "Searched in N hops" note when the trace is non-trivial. Decompose-failure (model emitted non-JSON) → refusal_reason = "multi_hop_decompose_failed"; treat like any other refusal.

mcp__kebab__fetch — when you need raw text

Use after search to read the verbatim chunk text + surrounding context, or to pull a full doc / line range.

Input:

{ "kind": "chunk", "chunk_id": "<id>", "context": 2 }
{ "kind": "doc", "doc_id": "<id>", "max_tokens": 1000 }
{ "kind": "span", "doc_id": "<id>", "line_start": 1, "line_end": 5 }
  • chunk mode: context: N returns ordinal-adjacent chunks before/after for surrounding paragraphs.
  • doc mode: full normalized markdown. max_tokens (chars/4) caps the response — truncated: true when applied.
  • span mode: 1-based inclusive line range. PDF / audio docs reject as error.v1.code = span_not_supported (use chunk mode instead — PDF chunks are page-aligned).
  • error.v1.code = chunk_not_found / doc_not_found are non-retryable from the same id — re-issue search to get a fresh one.

CLI fallback

If MCP tools aren't in scope (host without MCP support, or mcp.json not configured), call the CLI via Bash:

kebab search "<query>" --mode hybrid --json 2>/dev/null
kebab ask "<question>" --json 2>/dev/null
kebab ask "<question>" --session <stable-id> --json 2>/dev/null
kebab ask "<question>" --stream  # ndjson answer_event.v1 on stderr, final answer.v1 on stdout

Same wire shapes as MCP. CLI pays cold start (~1-2s) per call — prefer MCP when available.

--stream (p9-fb-33) emits retrieval_donetoken* → final events on stderr while the answer streams; the final stdout line is the standard answer.v1 for backwards compat. Use when you need progressive token consumption; otherwise the default non-streaming path is simpler. Refusal paths (score-gate / no-chunks) emit retrieval_done then no token/final — read stdout answer.v1 for the canonical refusal signal.

MCP host config

Register kebab mcp once in your host's MCP config. For Claude Code, edit ~/.claude/mcp.json:

{
  "mcpServers": {
    "kebab": {
      "command": "kebab",
      "args": ["mcp"]
    }
  }
}

Claude Code spawns kebab mcp at session start; the process stays alive across all tool calls so SQLite / Lance / fastembed are hot after the first call (~1-2s cold, sub-100ms thereafter). For Cursor / OpenAI Agents / Copilot CLI host examples plus per-tool input/output reference, see docs/mcp-usage.md in the kebab repo.

Parsing tips

  • MCP tools return JSON content blocks; CLI prints one JSON value to stdout, progress / warnings to stderr. Capture stdout only: kebab search ... --json 2>/dev/null.
  • search output can be large for broad queries. Project relevant fields when summarizing — for CLI: jq '.hits[] | {rank, doc_path, heading: .heading_path[-1], snippet}' (note: .hits[], not .[] — fb-34 wrapped the array). Use --max-tokens N (CLI) / max_tokens (MCP) to cap wire size in advance.
  • Pagination: search_response.v1.next_cursor is opaque base64 — pass back as --cursor (CLI) or cursor (MCP) for the next page. null means no more hits. corpus_revision mismatch returns error.v1.code = stale_cursor — re-issue search to obtain a fresh cursor.
  • search_response.v1.truncated = true means budget forced snippet shortening or k reduction. Independent of next_cursor: widen max_tokens for fuller snippets, follow next_cursor for more hits, or both.
  • ask's citations[] mirrors search_hit.v1 minus retrieval internals — same doc_path / citation shape.
  • Schema reference lives in the kebab repo at docs/wire-schema/v1/*.schema.json if a field is unclear.
  • search_hit.v1 and answer.v1.citations[] carry indexed_at (RFC3339) + stale (bool). When stale == true, the source doc hasn't been re-processed since config.search.stale_threshold_days. Surface this caveat to the user when summarizing — the cited snapshot may not reflect current reality.

Capability discovery

Before using streaming or multi-turn features, probe what this binary supports — call mcp__kebab__schema (or CLI kebab schema --json):

Returns schema.v1: wire.schemas (supported wire ids), capabilities (bool flags — e.g. streaming_ask, rag_multi_turn), models (version cascade 6-axis), stats (doc/chunk/asset count + last_ingest_at, plus p9-fb-37 health surface: media_breakdown per-kind doc counts (5 zero-padded keys: markdown / pdf / image / audio / other), lang_breakdown per BCP-47 lang (NULL keyed as the literal string "null"), index_bytes.{sqlite,lancedb} on-disk byte sums, stale_doc_count for docs older than config.search.stale_threshold_days). Gate streaming / session flows on capabilities.streaming_ask / capabilities.rag_multi_turn being true. Cheap call (no LLM), once per session.

Quick health check

If a call fails or returns suspicious output, call mcp__kebab__doctor (or CLI kebab doctor) first — it surfaces config-load / data-dir / Ollama-reachability problems in one line each. Don't silently retry on errors; report the doctor output.

Workflow recipes

Recipe A — user asks an internal-context question, you want grounded answer:

  1. Call mcp__kebab__ask (or CLI kebab ask "<question>" --json).
  2. If grounded, cite citations[].doc_path in your reply and quote the answer (translate / condense as needed).
  3. If !grounded, call mcp__kebab__search with the same query and look at top 3 hits — sometimes content exists but RAG threshold rejected it. If hits look relevant, summarize from snippets and cite. If still nothing, tell the user.

Recipe B — domain question where internal context might exist:

  1. Call mcp__kebab__search with key terms (cheap — no LLM).
  2. If top hit's score is low (< ~0.3) or no hits, answer from general knowledge without mentioning the KB.
  3. If top hit is relevant, fold its content into your answer and cite doc_path.

Recipe C — user wants to know "what's in the KB about X":

  1. Call mcp__kebab__search with the topic.
  2. List unique doc_paths back to the user as a discovery surface.

Recipe D — agent fetched a web doc, save to KB:

When you've fetched a markdown article (e.g. via WebFetch) that the user might query later:

  1. Call mcp__kebab__ingest_stdin with:
    • content: the markdown body
    • title: a stable title (article H1 or page title)
    • source_uri: the URL you fetched from

The doc lands in <workspace.root>/_external/<hash>.md and is indexed for search / ask immediately. Subsequent calls with identical content are no-ops (content-hash dedup).

Don't loop ingest the same article — dedup makes it safe but wastes embedding cost.

For files already on disk the user references, prefer mcp__kebab__ingest_file with the path — kebab handles the copy + dedup.

Don't

  • Don't auto-invoke mcp__kebab__ingest_file / mcp__kebab__ingest_stdin / kebab ingest / kebab reset / kebab init. Those mutate state — the user must explicitly request.
  • Don't pass user-supplied raw text into the query without trimming — long queries (> a few hundred chars) waste embedding budget. Extract the question.
  • Don't fabricate doc_paths. If you didn't see a doc in search / ask output, it's not in the KB.
  • Don't use kebab tui from a skill — it's interactive only.