closure of HOTFIXES 2026-05-22 "code_lang_breakdown chunk granularity"
LOW. Chunk-level companion of the existing doc-count metric.
- crates/kebab-store-sqlite/src/store.rs: code_lang_chunk_breakdown()
method. chunks INNER JOIN documents → COUNT(c.chunk_id) GROUP BY
metadata_json.code_lang, NULL skipped. BTreeMap<String, u32>.
+ lib unit test code_lang_chunk_breakdown_counts_chunks_not_docs
(1 rust doc + 3 chunks → rust=3 chunks vs rust=1 doc).
- crates/kebab-app/src/schema.rs: Stats.code_lang_chunk_breakdown
additive field + collect_stats builder. tests_stats_ext 의
stats_includes_code_lang_and_repo_breakdown_fields 가 신규 필드도
검증.
- docs/wire-schema/v1/schema.schema.json: 신규 additive 필드
명세 + 기존 code_lang_breakdown / repo_breakdown description
정정 ("code chunk count" → "doc count", Gemini round 2 권고).
- tasks/HOTFIXES.md: 2026-05-24 PR-C closure entry.
wire additive, schema_version bump 불필요. v0.16.x 호출 호환.
cargo test --workspace --no-fail-fast -j 1 + clippy 통과.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
101 lines
4.4 KiB
JSON
101 lines
4.4 KiB
JSON
{
|
|
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
|
"$id": "https://kebab.local/wire-schema/v1/schema.schema.json",
|
|
"title": "schema.v1",
|
|
"description": "kebab introspection report — wire schemas, capabilities, model versions, and index stats.",
|
|
"type": "object",
|
|
"required": ["schema_version", "kebab_version", "wire", "capabilities", "models", "stats"],
|
|
"properties": {
|
|
"schema_version": { "const": "schema.v1" },
|
|
"kebab_version": { "type": "string" },
|
|
"wire": {
|
|
"type": "object",
|
|
"required": ["schemas"],
|
|
"properties": {
|
|
"schemas": {
|
|
"type": "array",
|
|
"items": { "type": "string", "pattern": "^[a-z_]+\\.v[0-9]+$" }
|
|
}
|
|
}
|
|
},
|
|
"capabilities": {
|
|
"type": "object",
|
|
"additionalProperties": { "type": "boolean" },
|
|
"required": [
|
|
"json_mode", "ingest_progress", "ingest_cancellation",
|
|
"rag_multi_turn", "search_cache", "incremental_ingest",
|
|
"streaming_ask", "http_daemon", "mcp_server", "single_file_ingest", "bulk_search"
|
|
]
|
|
},
|
|
"models": {
|
|
"type": "object",
|
|
"required": [
|
|
"parser_version", "chunker_version", "embedding_version",
|
|
"prompt_template_version", "index_version", "corpus_revision"
|
|
],
|
|
"properties": {
|
|
"parser_version": { "type": "string" },
|
|
"chunker_version": { "type": "string" },
|
|
"embedding_version": { "type": "string" },
|
|
"prompt_template_version": { "type": "string" },
|
|
"index_version": { "type": "string" },
|
|
"corpus_revision": { "type": "integer", "minimum": 0 }
|
|
}
|
|
},
|
|
"stats": {
|
|
"type": "object",
|
|
"required": ["doc_count", "chunk_count", "asset_count", "last_ingest_at"],
|
|
"properties": {
|
|
"doc_count": { "type": "integer", "minimum": 0 },
|
|
"chunk_count": { "type": "integer", "minimum": 0 },
|
|
"asset_count": { "type": "integer", "minimum": 0 },
|
|
"last_ingest_at": {
|
|
"anyOf": [
|
|
{ "type": "string", "format": "date-time" },
|
|
{ "type": "null" }
|
|
]
|
|
},
|
|
"media_breakdown": {
|
|
"type": "object",
|
|
"description": "p9-fb-37: per-media-kind doc count. 5 keys (markdown/pdf/image/audio/other), zero-padded.",
|
|
"additionalProperties": { "type": "integer", "minimum": 0 }
|
|
},
|
|
"lang_breakdown": {
|
|
"type": "object",
|
|
"description": "p9-fb-37: per-language doc count. NULL lang keyed as the literal string 'null'. Map may be empty on empty corpus.",
|
|
"additionalProperties": { "type": "integer", "minimum": 0 }
|
|
},
|
|
"index_bytes": {
|
|
"type": "object",
|
|
"description": "p9-fb-37: on-disk byte sums.",
|
|
"required": ["sqlite", "lancedb"],
|
|
"properties": {
|
|
"sqlite": { "type": "integer", "minimum": 0 },
|
|
"lancedb": { "type": "integer", "minimum": 0 }
|
|
}
|
|
},
|
|
"stale_doc_count": {
|
|
"type": "integer",
|
|
"minimum": 0,
|
|
"description": "p9-fb-37: docs whose updated_at exceeds config.search.stale_threshold_days. 0 when threshold=0."
|
|
},
|
|
"code_lang_breakdown": {
|
|
"type": "object",
|
|
"description": "p10-1A-1: per-language **doc** count (one entry per indexed code document). Key = lowercase language name (e.g. 'rust', 'python'). Empty on markdown-only corpora. Pair with `code_lang_chunk_breakdown` for chunk-level granularity (one file's 200 chunks vs one doc).",
|
|
"additionalProperties": { "type": "integer", "minimum": 0 }
|
|
},
|
|
"repo_breakdown": {
|
|
"type": "object",
|
|
"description": "p10-1A-1: per-repo **doc** count. Key = repo name as detected by kebab-parse-code::repo. Empty on markdown-only corpora.",
|
|
"additionalProperties": { "type": "integer", "minimum": 0 }
|
|
},
|
|
"code_lang_chunk_breakdown": {
|
|
"type": "object",
|
|
"description": "v0.17.0 PR-C: per-language **chunk** count (closes HOTFIXES 2026-05-22 'code_lang_breakdown chunk granularity'). Companion to `code_lang_breakdown` (doc count) — chunk-level granularity is the indexing-pressure metric (a 200-chunk PDF + a 5-chunk Rust file both appear as `1 doc` but `200` vs `5` chunks). Key = lowercase language name. Empty on markdown-only corpora.",
|
|
"additionalProperties": { "type": "integer", "minimum": 0 }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|