diff --git a/README.md b/README.md index 5dd9ef7..6ce9b76 100644 --- a/README.md +++ b/README.md @@ -89,6 +89,32 @@ kebab doctor 글로벌 플래그: `--readonly` (또는 `KEBAB_READONLY=1`) — 모든 write-path 명령 (`ingest` / `ingest-file` / `ingest-stdin` / `reset`) 을 비활성화, exit 1. `--quiet` — 진행 바 / hint 등 human-readable stderr 억제 (exit code / stdout 출력은 그대로). `KEBAB_PROGRESS=plain` — TTY 가 없는 환경에서도 진행 상황을 plain-text 한 줄씩 stderr 로 출력 (spinner 대신). +### Score 해석 (fb-38) + +`search_hit.v1.score` 는 **ranking signal** 이지 confidence 가 아니다. `score_kind` 필드로 의미 선언: + +| `score_kind` | 의미 | 범위 | +|--------------|------|------| +| `rrf` (hybrid) | RRF normalized | `[0, 1]`, ceiling = 1.0 (양 채널 rank=1) | +| `bm25` (lexical) | raw BM25 | unbounded (≥ 0) | +| `cosine` (vector) | cosine sim | `[-1, 1]` | + +#### RRF 수식 (hybrid mode) + +``` +chunk c 의 raw RRF = Σ_m 1 / (k_rrf + rank_m(c)) + +여기서 m ∈ {lexical, vector}, k_rrf = config.search.rrf_k (default 60). +양 채널 모두 rank=1 일 때 raw RRF = 2 / (k_rrf + 1) ≈ 0.0328. + +normalize: rrf_score = raw_rrf / (2 / (k_rrf + 1)) + → rrf_score ∈ [0, 1]. 양쪽 rank=1 → 1.0, 한 쪽만 등장 → ≈ 0.5 천장. +``` + +`rrf_score = 0.5` 의 의미: chunk 가 한 채널 (lexical 또는 vector) 에서만 rank 1 로 등장. confidence 50% 가 아님 — RRF 수식의 산술적 천장. + +agent 가 trust threshold 가 필요하면 top-level `score` 가 아닌 nested `retrieval.lexical_score` (BM25 raw) / `retrieval.vector_score` (cosine raw) 사용. + ## 논리 아키텍처 ```mermaid diff --git a/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md b/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md index 5f5835f..90e6240 100644 --- a/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md +++ b/docs/superpowers/specs/2026-04-27-kebab-final-form-design.md @@ -194,6 +194,7 @@ variant 별 해당 키만 채움. `path` 와 `uri` 는 항상 채움 (`uri` 는 "schema_version": "search_hit.v1", "rank": 1, "score": 0.82, + "score_kind": "rrf", "chunk_id": "9b4a8c1e7d3f2a05", "doc_id": "3f9a2c10ee4d6b78", "doc_path": "notes/rust/kebab-architecture.md", @@ -218,6 +219,32 @@ variant 별 해당 키만 채움. `path` 와 `uri` 는 항상 채움 (`uri` 는 `retrieval.method ∈ {lexical, vector, hybrid}`. 단독 모드 시 다른 score/rank 는 null. +#### Score scale (fb-38) + +`score_kind` ∈ {`rrf`, `bm25`, `cosine`} 가 top-level `score` 의 의미를 선언. **ranking signal** 이지 confidence 가 아니다. + +| `score_kind` | mode | 의미 | 범위 | +|--------------|------|------|------| +| `rrf` | hybrid | RRF normalized | `[0, 1]`, ceiling = 1.0 (양 채널 rank=1) | +| `bm25` | lexical | raw BM25 | unbounded (≥ 0) | +| `cosine` | vector | cosine similarity | `[-1, 1]` | + +RRF 수식 (hybrid mode): + +```text +chunk c 의 raw RRF = Σ_m 1 / (k_rrf + rank_m(c)) + +여기서 m ∈ {lexical, vector}, k_rrf = config.search.rrf_k (default 60). +양 채널 모두 rank=1 일 때 raw RRF = 2 / (k_rrf + 1) ≈ 0.0328. + +normalize: rrf_score = raw_rrf / (2 / (k_rrf + 1)) + → rrf_score ∈ [0, 1]. 양쪽 rank=1 → 1.0, 한 쪽만 등장 → ≈ 0.5 천장. +``` + +`rrf_score = 0.5` = chunk 가 한 채널에서만 rank 1 로 등장 (산술적 천장). confidence 50% 아님. agent 가 trust threshold 가 필요하면 nested `retrieval.lexical_score` (BM25 raw) / `retrieval.vector_score` (cosine raw) 사용. + +`score_kind` 는 wire schema v1 에 **optional** 필드로 추가 (additive, backwards-compat). 누락 시 historical default `rrf` 로 해석. + ### 2.3 Answer ```json diff --git a/docs/wire-schema/v1/search_hit.schema.json b/docs/wire-schema/v1/search_hit.schema.json index 1083104..88256e1 100644 --- a/docs/wire-schema/v1/search_hit.schema.json +++ b/docs/wire-schema/v1/search_hit.schema.json @@ -24,6 +24,11 @@ "schema_version": { "const": "search_hit.v1" }, "rank": { "type": "integer", "minimum": 1 }, "score": { "type": "number" }, + "score_kind": { + "type": "string", + "enum": ["rrf", "bm25", "cosine"], + "description": "p9-fb-38: kind of `score` value. `rrf` = RRF normalized [0,1] (hybrid mode); `bm25` = raw BM25 score (lexical-only); `cosine` = raw cosine similarity (vector-only). Older clients that omit this field can treat absence as `rrf` (the historical default)." + }, "chunk_id": { "type": "string" }, "doc_id": { "type": "string" }, "doc_path": { "type": "string" }, diff --git a/integrations/claude-code/kebab/SKILL.md b/integrations/claude-code/kebab/SKILL.md index f3571af..35dcd6d 100644 --- a/integrations/claude-code/kebab/SKILL.md +++ b/integrations/claude-code/kebab/SKILL.md @@ -55,6 +55,7 @@ Input: - **`max_tokens` / `snippet_chars` / `cursor` (p9-fb-34)** — agent budget controls. Set `max_tokens` to cap result wire size (chars/4 estimate); set `cursor` to the previous response's `next_cursor` to fetch the next page. - **p9-fb-36 filter inputs:** `tags` (string array — OR-within, AND across keys), `lang` (BCP-47 language code), `path_glob` (glob pattern matched against doc path), `trust_min` (`"primary"` | `"secondary"` | `"generated"` — includes that level and above), `media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`), `ingested_after` (RFC3339 UTC string), `doc_id` (exact doc UUID). AND combinator across keys. Invalid `ingested_after` or unknown `trust_min` → `error.v1.code = invalid_input`. Unknown `media` value → empty hits, no error. - Output is `search_response.v1`: `{ hits: search_hit.v1[], next_cursor: string|null, truncated: bool }`. Iterate `response.hits[]` for individual hits. Key hit fields: `rank`, `score`, `doc_path`, `heading_path[]`, `section_label`, `snippet`, `citation` (line range / page), `chunk_id`. +- **`hits[].score_kind` (p9-fb-38):** `"rrf"` (hybrid) / `"bm25"` (lexical) / `"cosine"` (vector). Declares the meaning of the top-level `score` — it is a **ranking signal**, not a confidence value. If you need a trust threshold, use `retrieval.lexical_score` (BM25 raw) / `retrieval.vector_score` (cosine raw) instead of the top-level `score`. - Cite back to the user as `doc_path § heading_path[-1]` so they can open the source. - When `truncated: true`, the budget loop modified the page (snippet shortening or k reduction). `next_cursor` is **independent** — non-null whenever more hits may be reachable. Caller may widen `max_tokens` (re-issue same query for fuller snippets / more hits per page) or follow `next_cursor` (advance through more hits) or both. Mismatched cursor (corpus_revision changed) returns `error.v1.code = stale_cursor` — re-issue the search to obtain a fresh one. - **`trace: true` (p9-fb-37)** — debug aid. Response carries an extra `trace` block: `lexical[]` + `vector[]` (pre-fusion candidates), `rrf_inputs[]` (RRF union before final cut), and `timing` (`lexical_ms`, `vector_ms`, `fusion_ms`, `total_ms`). Trace bypasses the search cache (always cold). Use sparingly — it bloats the wire response and is for diagnosing "why did this hit / not hit", not normal retrieval. diff --git a/tasks/INDEX.md b/tasks/INDEX.md index 803acbc..c11d72c 100644 --- a/tasks/INDEX.md +++ b/tasks/INDEX.md @@ -128,7 +128,7 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능. - [p9-fb-37 trace + stats](p9/p9-fb-37-trace-and-stats.md) — ✅ 머지 (2026-05-10) ### 🎯 0.5.0 — RAG quality (cascade 동반: V00X + reindex) - - [p9-fb-38 score semantics](p9/p9-fb-38-score-semantics.md) — ⏳ 미구현, brainstorm 필요 + - [p9-fb-38 score semantics](p9/p9-fb-38-score-semantics.md) — ✅ 머지 (2026-05-10) - [p9-fb-39 retrieval precision 튜닝](p9/p9-fb-39-retrieval-precision-tuning.md) — ⏳ 미구현, brainstorm 필요 (embedding_version cascade) - [p9-fb-40 fact-grounded answer](p9/p9-fb-40-fact-grounded-answer.md) — ⏳ 미구현, brainstorm 필요 (prompt_template_version cascade) diff --git a/tasks/p9/p9-fb-38-score-semantics.md b/tasks/p9/p9-fb-38-score-semantics.md index 84bcad3..ad7a301 100644 --- a/tasks/p9/p9-fb-38-score-semantics.md +++ b/tasks/p9/p9-fb-38-score-semantics.md @@ -3,7 +3,7 @@ phase: P9 component: kebab-search + kebab-app + wire-schema task_id: p9-fb-38 title: "Score semantics 노출 + 문서화 (RRF score 천장 / 채널별 score 분리)" -status: open +status: completed target_version: 0.5.0 depends_on: [] unblocks: [] @@ -14,7 +14,10 @@ source_feedback: 사용자 도그푸딩 2026-05-06 — Claude Code 가 kebab CLI # p9-fb-38 — Score semantics 노출 + 문서화 -> ⏳ **백로그 only — 미구현.** 본 spec 은 도그푸딩 피드백 skeleton. 구현 착수 전 [superpowers:brainstorming](../../docs/superpowers/) 으로 설계 단계 선행 필요. score field naming / wire schema 변경 범위 / 채널별 score 노출 정책 brainstorm 후 확정. +> ✅ **구현 완료.** 본 spec 은 구현 시점의 frozen 상태. +> +> - Design: [`docs/superpowers/specs/2026-05-10-p9-fb-38-score-semantics-design.md`](../../docs/superpowers/specs/2026-05-10-p9-fb-38-score-semantics-design.md) +> - Plan: [`docs/superpowers/plans/2026-05-10-p9-fb-38-score-semantics.md`](../../docs/superpowers/plans/2026-05-10-p9-fb-38-score-semantics.md) ## 증상 / 동기