Files
kebab/tasks/p1/p1-3-parse-md-blocks.md
altair823 f9714aa5cb docs(rename): kb → kebab — README, tasks/, docs/, design doc, report
마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신.

- 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈
  path 표기 `kb_*` → `kebab_*` 포함).
- 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`,
  `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`,
  `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때
  같은 prefix 사용).
- CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` /
  `kb doctor` / `kb inspect` / `kb list` / `kb eval` →
  `kebab <verb>`. fenced code block + 인라인 backtick 모두.
- XDG paths + env vars + binary 경로 (`target/release/kb` →
  `target/release/kebab`) 동기화.
- design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task
  spec 모든 reference 통일.
- task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history
  기록용 author 정보라 그대로 유지 (실제 git history 의 author 는
  변경 불가).
- `tasks/phase-5-evaluation.md` 의 `status: planned` →
  `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분).

## 검증

- `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]"
   --include="*.md"` 0 hits (task-decomposition.md 의 git author
  제외).
- 모든 file path reference 살아있음 (renamed file 들 모두 새 path
  로 update).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:01:55 +00:00

109 lines
4.8 KiB
Markdown

---
phase: P1
component: kebab-parse-md (blocks submodule)
task_id: p1-3
title: "Markdown body → Block tree with line spans"
status: completed
depends_on: [p0-1]
unblocks: [p1-4]
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
contract_sections: [§3.4 Block, §3.4 SourceSpan, §3.7b kebab-parse-types, §0 Q3 citation]
---
# p1-3 — Markdown body → Block tree
## Goal
Parse Markdown body bytes into a flat `Vec<kebab_parse_types::ParsedBlock>` with heading paths and line ranges preserved, ready for `kebab-normalize` to lift into `CanonicalDocument`.
## Why now / why this size
This is the heaviest part of P1 parser. Separating it from frontmatter and from normalization keeps each piece tractable. Determinism of line ranges directly determines citation quality (design §0 Q3 / §3.4 SourceSpan::Line).
## Allowed dependencies
- `kebab-core`
- `kebab-parse-types` (defines `ParsedBlock`, `ParsedPayload`, `Warning`)
- `pulldown-cmark` (CommonMark with source-map; GFM tables enabled via feature)
- `serde`
- `thiserror`
## Forbidden dependencies
- `kebab-store-*`, `kebab-llm*`, `kebab-rag`, `kebab-embed*`, `kebab-search`, `kebab-source-fs`, `kebab-chunk`, `kebab-normalize`, `kebab-tui`, `kebab-desktop`, `comrak` (alternative parser; pick one)
## Inputs
| input | type | source |
|-------|------|--------|
| Markdown body bytes | `&[u8]` | extractor (after frontmatter stripped) |
| `body_offset_lines` | `u32` | extractor (so line ranges are reported relative to original file) |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `Vec<kebab_parse_types::ParsedBlock>` | shared type from `kebab-parse-types` | `kebab-normalize` |
| `Vec<kebab_parse_types::Warning>` | shared type | propagated into Provenance |
## Public surface (signatures only — no new types)
```rust
pub fn parse_blocks(
body: &[u8],
body_offset_lines: u32,
) -> anyhow::Result<(Vec<kebab_parse_types::ParsedBlock>, Vec<kebab_parse_types::Warning>)>;
```
`ParsedBlock` is defined in `kebab-parse-types` (design §3.7b). `kebab-parse-md` does NOT define its own; it consumes the shared type. Lift to `kebab_core::Block` (with `BlockId` assignment) is `kebab-normalize`'s job (p1-4).
## Behavior contract
- Source-map: each `ParsedBlock` carries `SourceSpan::Line { start, end }` relative to the original file (i.e., add `body_offset_lines`).
- Heading tree: every block records its ancestor heading texts in order (e.g., `["아키텍처", "Chunking 정책"]`).
- Code blocks: `ParsedPayload::Code { lang: Some("rust"), code }` — fenced content not split.
- Tables: GFM tables produce `ParsedPayload::Table { headers, rows }`; if a table cell is malformed, fall back to `ParsedPayload::Paragraph` + `Warning::MalformedTable`.
- Image references: `![alt](src)` produces `ParsedPayload::ImageRef { src, alt }`. `AssetId` resolution happens later in `kebab-normalize` (when image src can be matched to a workspace asset).
- Lists: ordered/unordered preserved via `ParsedPayload::List { ordered, items }`; nested list items flattened so each `items[i]` is a `Vec<kebab_core::Inline>` for one top-level item.
- Inline elements: only `Text`, `Code`, `Link`, `Strong`, `Emph` (per `kebab_core::Inline` per design §3.4). Drop other inlines silently.
- Malformed input never panics. Worst case: empty `Vec<ParsedBlock>` + `Warning::ExtractFailed`.
## Storage / wire effects
- None.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | heading tree depth + heading_path correctness | inline |
| unit | code block lang tag preserved | inline |
| unit | GFM table parses; malformed table degrades to paragraph + warning | inline |
| unit | line range correct under various line-ending styles (LF / CRLF) | inline |
| unit | image ref captured with src/alt | inline |
| unit | nested list flattens correctly | inline |
| unit | malformed input does not panic | inline (random byte slices) |
| snapshot | `fixtures/markdown/nested-headings.md` → ParsedBlock JSON stable | fixture |
| snapshot | `fixtures/markdown/code-and-table.md` → JSON stable | fixture |
All tests under `cargo test -p kebab-parse-md --lib blocks`.
## Definition of Done
- [ ] `cargo check -p kebab-parse-md` passes
- [ ] `cargo test -p kebab-parse-md blocks` passes
- [ ] Snapshot tests stable across two runs
- [ ] No imports outside Allowed dependencies
- [ ] PR links design §3.4
## Out of scope
- Frontmatter (p1-2).
- Lifting `kebab_parse_types::ParsedBlock``kebab_core::Block` with `BlockId` (p1-4 normalize).
- Chunking (p1-5).
## Risks / notes
- `pulldown-cmark` source-map may not include exact byte ranges for all event kinds; line ranges are the binding contract per design (line-range citation is the primary form for Markdown).
- CRLF normalization: convert internally to LF for span math but report line numbers from the original byte stream.