Files
kebab/tasks/p1/p1-5-chunk.md
altair823 f9714aa5cb docs(rename): kb → kebab — README, tasks/, docs/, design doc, report
마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신.

- 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈
  path 표기 `kb_*` → `kebab_*` 포함).
- 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`,
  `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`,
  `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때
  같은 prefix 사용).
- CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` /
  `kb doctor` / `kb inspect` / `kb list` / `kb eval` →
  `kebab <verb>`. fenced code block + 인라인 backtick 모두.
- XDG paths + env vars + binary 경로 (`target/release/kb` →
  `target/release/kebab`) 동기화.
- design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task
  spec 모든 reference 통일.
- task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history
  기록용 author 정보라 그대로 유지 (실제 git history 의 author 는
  변경 불가).
- `tasks/phase-5-evaluation.md` 의 `status: planned` →
  `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분).

## 검증

- `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]"
   --include="*.md"` 0 hits (task-decomposition.md 의 git author
  제외).
- 모든 file path reference 살아있음 (renamed file 들 모두 새 path
  로 update).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:01:55 +00:00

114 lines
4.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: P1
component: kebab-chunk
task_id: p1-5
title: "Markdown heading-aware chunker (md-heading-v1)"
status: completed
depends_on: [p1-4]
unblocks: [p1-6, p2-2, p3-2]
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
contract_sections: [§3.5 Chunk, §4.2 chunk_id recipe, §7.2 Chunker, §0 Q3 citation]
---
# p1-5 — Markdown heading-aware chunker
## Goal
Implement `Chunker` trait emitting `chunker_version = "md-heading-v1"`. Block-aware: respect heading boundaries, never split code/table, propagate `heading_path` and merged `source_spans`.
## Why now / why this size
The first concrete `Chunker`. Establishes how subsequent chunkers (PDF page chunker, audio segment chunker) are scoped: per-medium chunker version label. Independent of any store/embed.
## Allowed dependencies
- `kebab-core`
- `kebab-config`
- `serde`
- `blake3` (policy_hash)
- `serde-json-canonicalizer`
- `thiserror`
## Forbidden dependencies
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize` (consumes `CanonicalDocument` only via `kebab-core`), `kebab-store-*`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`
## Inputs
| input | type | source |
|-------|------|--------|
| `CanonicalDocument` | `kebab_core::CanonicalDocument` | p1-4 |
| `ChunkPolicy` | `kebab_core::ChunkPolicy` | `kebab-app` from config |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `Vec<Chunk>` | `kebab_core::Chunk` | `kebab-store-sqlite` (p1-6), `kebab-embed*` (P3) |
## Public surface (signatures only — no new types)
```rust
pub struct MdHeadingV1Chunker;
impl kebab_core::Chunker for MdHeadingV1Chunker {
fn chunker_version(&self) -> kebab_core::ChunkerVersion;
fn policy_hash(&self, policy: &kebab_core::ChunkPolicy) -> String;
fn chunk(&self, doc: &kebab_core::CanonicalDocument, policy: &kebab_core::ChunkPolicy) -> anyhow::Result<Vec<kebab_core::Chunk>>;
}
```
`policy_hash` = `blake3(canonical_json(policy))` hex truncated to 16 chars.
## Behavior contract
- Priority order (per design §0 / report §14):
1. heading boundary first
2. never split a code block
3. table stays in a single chunk if possible
4. long sections split by paragraph
5. propagate `heading_path` from blocks
6. carry merged `source_spans` (each chunk lists every contributing block's span)
7. record `chunker_version = "md-heading-v1"` and `policy_hash`
- `target_tokens` and `overlap_tokens` from `ChunkPolicy`. Token estimate is byte-based proxy until a real tokenizer is introduced (note in `Chunk.token_estimate`).
- `chunk_id` per design §4.2: tagged tuple of `(doc_id, chunker_version, block_ids, policy_hash)`.
- `block_ids` listed in document order (significant — affects ID).
- ImageRef / AudioRef blocks are emitted as their own chunks (text portion = alt + caption preview if present, else empty string with `token_estimate=0`). They still receive `chunk_id` so future image/audio search can locate them.
## Storage / wire effects
- None directly. Outputs feed p1-6.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | heading boundary respected (no chunk crosses H2 → H2) | inline |
| unit | code block of 800 tokens stays in one chunk even when target=500 | inline |
| unit | table block stays single chunk if size < 2× target | inline |
| unit | long paragraph split with overlap_tokens applied | inline |
| unit | ImageRefBlock produces a chunk with token_estimate=0 | inline |
| determinism | identical input + identical policy → identical chunk_ids | inline |
| snapshot | `fixtures/markdown/long-section.md` → Vec<Chunk> JSON stable | fixture |
All tests under `cargo test -p kebab-chunk`.
## Definition of Done
- [ ] `cargo check -p kebab-chunk` passes
- [ ] `cargo test -p kebab-chunk` passes
- [ ] Snapshot stable across two runs
- [ ] No imports outside Allowed dependencies
- [ ] PR links design §3.5, §4.2
## Out of scope
- DB persistence (p1-6).
- Embedding (P3).
- Reranking / hybrid (P3).
## Risks / notes
- Token estimate proxy: a real tokenizer (e.g., sentencepiece for the embedding model) replaces this in P3. The proxy must err toward overestimation so chunks fit in real tokenizer budget.
- Changing `chunker_version` invalidates all downstream embedding records. Bump only with PR documenting the migration plan (design §9).