마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신. - 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈 path 표기 `kb_*` → `kebab_*` 포함). - 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`, `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`, `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때 같은 prefix 사용). - CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` / `kb doctor` / `kb inspect` / `kb list` / `kb eval` → `kebab <verb>`. fenced code block + 인라인 backtick 모두. - XDG paths + env vars + binary 경로 (`target/release/kb` → `target/release/kebab`) 동기화. - design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task spec 모든 reference 통일. - task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history 기록용 author 정보라 그대로 유지 (실제 git history 의 author 는 변경 불가). - `tasks/phase-5-evaluation.md` 의 `status: planned` → `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분). ## 검증 - `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]" --include="*.md"` 0 hits (task-decomposition.md 의 git author 제외). - 모든 file path reference 살아있음 (renamed file 들 모두 새 path 로 update). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
114 lines
4.3 KiB
Markdown
114 lines
4.3 KiB
Markdown
---
|
||
phase: P1
|
||
component: kebab-chunk
|
||
task_id: p1-5
|
||
title: "Markdown heading-aware chunker (md-heading-v1)"
|
||
status: completed
|
||
depends_on: [p1-4]
|
||
unblocks: [p1-6, p2-2, p3-2]
|
||
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
||
contract_sections: [§3.5 Chunk, §4.2 chunk_id recipe, §7.2 Chunker, §0 Q3 citation]
|
||
---
|
||
|
||
# p1-5 — Markdown heading-aware chunker
|
||
|
||
## Goal
|
||
|
||
Implement `Chunker` trait emitting `chunker_version = "md-heading-v1"`. Block-aware: respect heading boundaries, never split code/table, propagate `heading_path` and merged `source_spans`.
|
||
|
||
## Why now / why this size
|
||
|
||
The first concrete `Chunker`. Establishes how subsequent chunkers (PDF page chunker, audio segment chunker) are scoped: per-medium chunker version label. Independent of any store/embed.
|
||
|
||
## Allowed dependencies
|
||
|
||
- `kebab-core`
|
||
- `kebab-config`
|
||
- `serde`
|
||
- `blake3` (policy_hash)
|
||
- `serde-json-canonicalizer`
|
||
- `thiserror`
|
||
|
||
## Forbidden dependencies
|
||
|
||
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize` (consumes `CanonicalDocument` only via `kebab-core`), `kebab-store-*`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`
|
||
|
||
## Inputs
|
||
|
||
| input | type | source |
|
||
|-------|------|--------|
|
||
| `CanonicalDocument` | `kebab_core::CanonicalDocument` | p1-4 |
|
||
| `ChunkPolicy` | `kebab_core::ChunkPolicy` | `kebab-app` from config |
|
||
|
||
## Outputs
|
||
|
||
| output | type | downstream |
|
||
|--------|------|------------|
|
||
| `Vec<Chunk>` | `kebab_core::Chunk` | `kebab-store-sqlite` (p1-6), `kebab-embed*` (P3) |
|
||
|
||
## Public surface (signatures only — no new types)
|
||
|
||
```rust
|
||
pub struct MdHeadingV1Chunker;
|
||
|
||
impl kebab_core::Chunker for MdHeadingV1Chunker {
|
||
fn chunker_version(&self) -> kebab_core::ChunkerVersion;
|
||
fn policy_hash(&self, policy: &kebab_core::ChunkPolicy) -> String;
|
||
fn chunk(&self, doc: &kebab_core::CanonicalDocument, policy: &kebab_core::ChunkPolicy) -> anyhow::Result<Vec<kebab_core::Chunk>>;
|
||
}
|
||
```
|
||
|
||
`policy_hash` = `blake3(canonical_json(policy))` hex truncated to 16 chars.
|
||
|
||
## Behavior contract
|
||
|
||
- Priority order (per design §0 / report §14):
|
||
1. heading boundary first
|
||
2. never split a code block
|
||
3. table stays in a single chunk if possible
|
||
4. long sections split by paragraph
|
||
5. propagate `heading_path` from blocks
|
||
6. carry merged `source_spans` (each chunk lists every contributing block's span)
|
||
7. record `chunker_version = "md-heading-v1"` and `policy_hash`
|
||
- `target_tokens` and `overlap_tokens` from `ChunkPolicy`. Token estimate is byte-based proxy until a real tokenizer is introduced (note in `Chunk.token_estimate`).
|
||
- `chunk_id` per design §4.2: tagged tuple of `(doc_id, chunker_version, block_ids, policy_hash)`.
|
||
- `block_ids` listed in document order (significant — affects ID).
|
||
- ImageRef / AudioRef blocks are emitted as their own chunks (text portion = alt + caption preview if present, else empty string with `token_estimate=0`). They still receive `chunk_id` so future image/audio search can locate them.
|
||
|
||
## Storage / wire effects
|
||
|
||
- None directly. Outputs feed p1-6.
|
||
|
||
## Test plan
|
||
|
||
| kind | description | fixture / data |
|
||
|------|-------------|----------------|
|
||
| unit | heading boundary respected (no chunk crosses H2 → H2) | inline |
|
||
| unit | code block of 800 tokens stays in one chunk even when target=500 | inline |
|
||
| unit | table block stays single chunk if size < 2× target | inline |
|
||
| unit | long paragraph split with overlap_tokens applied | inline |
|
||
| unit | ImageRefBlock produces a chunk with token_estimate=0 | inline |
|
||
| determinism | identical input + identical policy → identical chunk_ids | inline |
|
||
| snapshot | `fixtures/markdown/long-section.md` → Vec<Chunk> JSON stable | fixture |
|
||
|
||
All tests under `cargo test -p kebab-chunk`.
|
||
|
||
## Definition of Done
|
||
|
||
- [ ] `cargo check -p kebab-chunk` passes
|
||
- [ ] `cargo test -p kebab-chunk` passes
|
||
- [ ] Snapshot stable across two runs
|
||
- [ ] No imports outside Allowed dependencies
|
||
- [ ] PR links design §3.5, §4.2
|
||
|
||
## Out of scope
|
||
|
||
- DB persistence (p1-6).
|
||
- Embedding (P3).
|
||
- Reranking / hybrid (P3).
|
||
|
||
## Risks / notes
|
||
|
||
- Token estimate proxy: a real tokenizer (e.g., sentencepiece for the embedding model) replaces this in P3. The proxy must err toward overestimation so chunks fit in real tokenizer budget.
|
||
- Changing `chunker_version` invalidates all downstream embedding records. Bump only with PR documenting the migration plan (design §9).
|