마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신. - 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈 path 표기 `kb_*` → `kebab_*` 포함). - 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`, `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`, `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때 같은 prefix 사용). - CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` / `kb doctor` / `kb inspect` / `kb list` / `kb eval` → `kebab <verb>`. fenced code block + 인라인 backtick 모두. - XDG paths + env vars + binary 경로 (`target/release/kb` → `target/release/kebab`) 동기화. - design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task spec 모든 reference 통일. - task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history 기록용 author 정보라 그대로 유지 (실제 git history 의 author 는 변경 불가). - `tasks/phase-5-evaluation.md` 의 `status: planned` → `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분). ## 검증 - `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]" --include="*.md"` 0 hits (task-decomposition.md 의 git author 제외). - 모든 file path reference 살아있음 (renamed file 들 모두 새 path 로 update). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
109 lines
4.8 KiB
Markdown
109 lines
4.8 KiB
Markdown
---
|
|
phase: P1
|
|
component: kebab-parse-md (blocks submodule)
|
|
task_id: p1-3
|
|
title: "Markdown body → Block tree with line spans"
|
|
status: completed
|
|
depends_on: [p0-1]
|
|
unblocks: [p1-4]
|
|
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
|
contract_sections: [§3.4 Block, §3.4 SourceSpan, §3.7b kebab-parse-types, §0 Q3 citation]
|
|
---
|
|
|
|
# p1-3 — Markdown body → Block tree
|
|
|
|
## Goal
|
|
|
|
Parse Markdown body bytes into a flat `Vec<kebab_parse_types::ParsedBlock>` with heading paths and line ranges preserved, ready for `kebab-normalize` to lift into `CanonicalDocument`.
|
|
|
|
## Why now / why this size
|
|
|
|
This is the heaviest part of P1 parser. Separating it from frontmatter and from normalization keeps each piece tractable. Determinism of line ranges directly determines citation quality (design §0 Q3 / §3.4 SourceSpan::Line).
|
|
|
|
## Allowed dependencies
|
|
|
|
- `kebab-core`
|
|
- `kebab-parse-types` (defines `ParsedBlock`, `ParsedPayload`, `Warning`)
|
|
- `pulldown-cmark` (CommonMark with source-map; GFM tables enabled via feature)
|
|
- `serde`
|
|
- `thiserror`
|
|
|
|
## Forbidden dependencies
|
|
|
|
- `kebab-store-*`, `kebab-llm*`, `kebab-rag`, `kebab-embed*`, `kebab-search`, `kebab-source-fs`, `kebab-chunk`, `kebab-normalize`, `kebab-tui`, `kebab-desktop`, `comrak` (alternative parser; pick one)
|
|
|
|
## Inputs
|
|
|
|
| input | type | source |
|
|
|-------|------|--------|
|
|
| Markdown body bytes | `&[u8]` | extractor (after frontmatter stripped) |
|
|
| `body_offset_lines` | `u32` | extractor (so line ranges are reported relative to original file) |
|
|
|
|
## Outputs
|
|
|
|
| output | type | downstream |
|
|
|--------|------|------------|
|
|
| `Vec<kebab_parse_types::ParsedBlock>` | shared type from `kebab-parse-types` | `kebab-normalize` |
|
|
| `Vec<kebab_parse_types::Warning>` | shared type | propagated into Provenance |
|
|
|
|
## Public surface (signatures only — no new types)
|
|
|
|
```rust
|
|
pub fn parse_blocks(
|
|
body: &[u8],
|
|
body_offset_lines: u32,
|
|
) -> anyhow::Result<(Vec<kebab_parse_types::ParsedBlock>, Vec<kebab_parse_types::Warning>)>;
|
|
```
|
|
|
|
`ParsedBlock` is defined in `kebab-parse-types` (design §3.7b). `kebab-parse-md` does NOT define its own; it consumes the shared type. Lift to `kebab_core::Block` (with `BlockId` assignment) is `kebab-normalize`'s job (p1-4).
|
|
|
|
## Behavior contract
|
|
|
|
- Source-map: each `ParsedBlock` carries `SourceSpan::Line { start, end }` relative to the original file (i.e., add `body_offset_lines`).
|
|
- Heading tree: every block records its ancestor heading texts in order (e.g., `["아키텍처", "Chunking 정책"]`).
|
|
- Code blocks: `ParsedPayload::Code { lang: Some("rust"), code }` — fenced content not split.
|
|
- Tables: GFM tables produce `ParsedPayload::Table { headers, rows }`; if a table cell is malformed, fall back to `ParsedPayload::Paragraph` + `Warning::MalformedTable`.
|
|
- Image references: `` produces `ParsedPayload::ImageRef { src, alt }`. `AssetId` resolution happens later in `kebab-normalize` (when image src can be matched to a workspace asset).
|
|
- Lists: ordered/unordered preserved via `ParsedPayload::List { ordered, items }`; nested list items flattened so each `items[i]` is a `Vec<kebab_core::Inline>` for one top-level item.
|
|
- Inline elements: only `Text`, `Code`, `Link`, `Strong`, `Emph` (per `kebab_core::Inline` per design §3.4). Drop other inlines silently.
|
|
- Malformed input never panics. Worst case: empty `Vec<ParsedBlock>` + `Warning::ExtractFailed`.
|
|
|
|
## Storage / wire effects
|
|
|
|
- None.
|
|
|
|
## Test plan
|
|
|
|
| kind | description | fixture / data |
|
|
|------|-------------|----------------|
|
|
| unit | heading tree depth + heading_path correctness | inline |
|
|
| unit | code block lang tag preserved | inline |
|
|
| unit | GFM table parses; malformed table degrades to paragraph + warning | inline |
|
|
| unit | line range correct under various line-ending styles (LF / CRLF) | inline |
|
|
| unit | image ref captured with src/alt | inline |
|
|
| unit | nested list flattens correctly | inline |
|
|
| unit | malformed input does not panic | inline (random byte slices) |
|
|
| snapshot | `fixtures/markdown/nested-headings.md` → ParsedBlock JSON stable | fixture |
|
|
| snapshot | `fixtures/markdown/code-and-table.md` → JSON stable | fixture |
|
|
|
|
All tests under `cargo test -p kebab-parse-md --lib blocks`.
|
|
|
|
## Definition of Done
|
|
|
|
- [ ] `cargo check -p kebab-parse-md` passes
|
|
- [ ] `cargo test -p kebab-parse-md blocks` passes
|
|
- [ ] Snapshot tests stable across two runs
|
|
- [ ] No imports outside Allowed dependencies
|
|
- [ ] PR links design §3.4
|
|
|
|
## Out of scope
|
|
|
|
- Frontmatter (p1-2).
|
|
- Lifting `kebab_parse_types::ParsedBlock` → `kebab_core::Block` with `BlockId` (p1-4 normalize).
|
|
- Chunking (p1-5).
|
|
|
|
## Risks / notes
|
|
|
|
- `pulldown-cmark` source-map may not include exact byte ranges for all event kinds; line ranges are the binding contract per design (line-range citation is the primary form for Markdown).
|
|
- CRLF normalization: convert internally to LF for span math but report line numbers from the original byte stream.
|