feat(p1-3): kb-parse-md blocks (Markdown body → ParsedBlock tree) #8

Merged
altair823 merged 10 commits from feat/p1-3-parse-md-blocks into main 2026-04-30 15:03:26 +00:00
Owner

요약

P1-3 task spec 구현. kb-parse-md crate 의 blocks submodule (P1-2 frontmatter sibling) — Markdown body bytes를 Vec<kb_parse_types::ParsedBlock> 로 파싱. heading path / line span / inline filter 결정적, P1-4 normalize 가 CanonicalDocument 로 lift할 입력.

기준: tasks/p1/p1-3-parse-md-blocks.md, docs/superpowers/specs/2026-04-27-kb-final-form-design.md (§3.4 / §3.7b / §0 Q3).

구현

crates/kb-parse-md/src/blocks.rs:

  • parse_blocks(body, body_offset_lines) -> Result<(Vec<ParsedBlock>, Vec<Warning>)>pulldown_cmark::Parser::new_ext + into_offset_iter 기반 walker
  • WalkState + Frame enum — paragraph/code/list/listitem/table/quote/heading frame stack
  • LineIndex\n 기반 byte→line 매핑 (CRLF 호환)
  • panic::catch_unwind 래퍼 — adversarial input fallback (vec![], vec![Warning::ExtractFailed])
  • 검증된 동작:
    • Heading path ancestors-only, H1 reset, 빈 heading은 path에서 skip
    • Code/image inside list item → 부모 item inline buffer 로 flatten (escape 차단)
    • Image refs는 Tag::Image { dest_url } 이벤트에서 캡처 (title leak / angle-bracket 처리)
    • Inline filter Text/Code/Link/Strong/Emph 만 통과
    • body_offset_lines 오버플로 → checked_add 감지 + ExtractFailed warning + 빈 출력

crates/kb-parse-md/src/lib.rsparse_blocks 추가 re-export
crates/kb-parse-md/Cargo.tomlpulldown-cmark v0.13 (default-features = false, GFM 테이블 런타임 활성)
crates/kb-parse-md/tests/blocks_snapshots.rs — 3 snapshot + 1 regenerator
fixtures/markdown/{nested-headings,code-and-table}.blocks.snapshot.json — baseline

검증

  • cargo check / build (RUSTFLAGS=-D warnings) / clippy --all-targets -D warnings clean
  • cargo test -p kb-parse-md57 unit + 6 integration pass
  • cargo test --workspace 전체 그린
  • cargo tree -p kb-parse-md --depth 1 → 허용 deps 만 (kb-core, kb-parse-types, pulldown-cmark, anyhow, serde, serde_json, tracing + P1-2 shared)

Review trail

  • spec compliance: PASS-WITH-NICE-TO-FIX → 3 self-reported concern 검증, must-fix 0
  • code quality round 1: CHANGES-REQUESTED — Critical 3 (C1 overflow, C2 list escape, C3 empty heading) + Important 2 (I1 image byte-scan, I2 link space)
  • 반영 7 commit (de91648, 0050cf3, d49dbc1, 73040ca, 23ff4d6, 2b6d9ab, 80123e9)
  • code quality round 2 (re-review): APPROVED — must-fix-now 0, snapshot baseline 영향 없음

후속 자연 해소 (비차단)

  • kb_core::Inline 직렬화 버그 (P0-1 schema): internally-tagged enum + newtype variant 조합 → serde 4/5 variant 실패. snapshot test가 BlockView projection으로 우회. P1-4 normalize 가 CanonicalDocument 직렬화 시점에 §9 schema migration 으로 hotfix 예정.
  • ParsedPayload::Quote 가 list/code/table 자식을 보존하지 못함 (schema 한계) — TODO 주석 + pin test 박힘.
  • 스냅샷 byte-equality, richer adversarial corpus, clippy pedantic — 추후 hardening.

다음

머지 후 feat/p1-4-normalize 분기 → P1-4 (kb-normalize) 진행. P1-4 시점에 Inline schema fix 동반.

## 요약 P1-3 task spec 구현. `kb-parse-md` crate 의 `blocks` submodule (P1-2 frontmatter sibling) — Markdown body bytes를 `Vec<kb_parse_types::ParsedBlock>` 로 파싱. heading path / line span / inline filter 결정적, P1-4 normalize 가 `CanonicalDocument` 로 lift할 입력. 기준: tasks/p1/p1-3-parse-md-blocks.md, docs/superpowers/specs/2026-04-27-kb-final-form-design.md (§3.4 / §3.7b / §0 Q3). ## 구현 `crates/kb-parse-md/src/blocks.rs`: - `parse_blocks(body, body_offset_lines) -> Result<(Vec<ParsedBlock>, Vec<Warning>)>` — `pulldown_cmark::Parser::new_ext` + `into_offset_iter` 기반 walker - `WalkState` + `Frame` enum — paragraph/code/list/listitem/table/quote/heading frame stack - `LineIndex` — `\n` 기반 byte→line 매핑 (CRLF 호환) - `panic::catch_unwind` 래퍼 — adversarial input fallback `(vec![], vec![Warning::ExtractFailed])` - 검증된 동작: - Heading path ancestors-only, H1 reset, 빈 heading은 path에서 skip - Code/image inside list item → 부모 item inline buffer 로 flatten (escape 차단) - Image refs는 `Tag::Image { dest_url }` 이벤트에서 캡처 (title leak / angle-bracket 처리) - Inline filter Text/Code/Link/Strong/Emph 만 통과 - `body_offset_lines` 오버플로 → `checked_add` 감지 + `ExtractFailed` warning + 빈 출력 `crates/kb-parse-md/src/lib.rs` — `parse_blocks` 추가 re-export `crates/kb-parse-md/Cargo.toml` — `pulldown-cmark v0.13` (default-features = false, GFM 테이블 런타임 활성) `crates/kb-parse-md/tests/blocks_snapshots.rs` — 3 snapshot + 1 regenerator `fixtures/markdown/{nested-headings,code-and-table}.blocks.snapshot.json` — baseline ## 검증 - `cargo check / build (RUSTFLAGS=-D warnings) / clippy --all-targets -D warnings` clean - `cargo test -p kb-parse-md` → **57 unit + 6 integration** pass - `cargo test --workspace` 전체 그린 - `cargo tree -p kb-parse-md --depth 1` → 허용 deps 만 (kb-core, kb-parse-types, pulldown-cmark, anyhow, serde, serde_json, tracing + P1-2 shared) ## Review trail - spec compliance: PASS-WITH-NICE-TO-FIX → 3 self-reported concern 검증, must-fix 0 - code quality round 1: CHANGES-REQUESTED — Critical 3 (C1 overflow, C2 list escape, C3 empty heading) + Important 2 (I1 image byte-scan, I2 link space) - 반영 7 commit (`de91648`, `0050cf3`, `d49dbc1`, `73040ca`, `23ff4d6`, `2b6d9ab`, `80123e9`) - code quality round 2 (re-review): **APPROVED** — must-fix-now 0, snapshot baseline 영향 없음 ## 후속 자연 해소 (비차단) - **`kb_core::Inline` 직렬화 버그** (P0-1 schema): internally-tagged enum + newtype variant 조합 → serde 4/5 variant 실패. snapshot test가 `BlockView` projection으로 우회. P1-4 normalize 가 `CanonicalDocument` 직렬화 시점에 §9 schema migration 으로 hotfix 예정. - `ParsedPayload::Quote` 가 list/code/table 자식을 보존하지 못함 (schema 한계) — TODO 주석 + pin test 박힘. - 스냅샷 byte-equality, richer adversarial corpus, clippy pedantic — 추후 hardening. ## 다음 머지 후 `feat/p1-4-normalize` 분기 → P1-4 (kb-normalize) 진행. P1-4 시점에 `Inline` schema fix 동반.
altair823 added 10 commits 2026-04-30 14:49:45 +00:00
Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a
flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a
small frame-based state machine that tracks heading paths, accumulates
inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and
reports SourceSpan::Line spans in 1-indexed file-line coordinates.

Covers headings, paragraphs, code blocks (lang from info string), GFM
tables (with malformed fallback to paragraph + MalformedTable warning),
lists (nested sub-lists flattened into parent item), and block-level image
references. Inline images are dropped silently per the inline filter.
Adversarial inputs are caught with `catch_unwind` and degrade to an empty
output + ExtractFailed warning.

15 unit tests cover heading-path correctness, code lang, table parsing,
malformed-table fallback (driven via synthetic events since pulldown-cmark
auto-normalizes table widths), LF/CRLF line-range parity, image refs,
nested-list flattening, inline filter, and 100-iteration random-bytes plus
hand-crafted adversarial-input no-panic guards.
Adds two snapshot tests (`nested-headings.md`, `code-and-table.md`) under
crates/kb-parse-md/tests/blocks_snapshots.rs, with matching baseline JSON
next to each fixture. The snapshot view projects `kb_core::Inline` to
flat strings — `Inline` carries `serde(tag = "kind")` which is
incompatible with newtype variants holding a primitive (`Text(String)`),
so direct serialization of `ParsedBlock` would fail today. The view
preserves the contract that matters for P1-3 (heading paths, source
spans, payload kinds, payload text/code/table content) and will keep
working once kb-core fixes the Inline schema in a later task.

Also tightens `level_to_use >= 1 && <= 6` into `(1..=6).contains(&_)` to
satisfy `clippy::manual_range_contains`.
Removes the leftover `let _ = level_u8;` and `let _ = header_count;`
discards. The Heading frame already carries the canonical level (we
populated it from `Tag::Heading` at Start), so we destructure that
directly and ignore the redundant `TagEnd::Heading(level)` payload.
The header_count helper was dead — Frame::Table tracks `cols`
internally and we never consumed `header_count`.
`span_for` previously used `u32 + u32` directly, so callers passing a
large `body_offset_lines` could panic (debug, then masked by
`catch_unwind` and the entire body discarded) or wrap to an inverted
span with `start > end` (release).

Switch to `checked_add`; on overflow flag the walk state and at the
end of `parse_blocks_inner` discard accumulated blocks and surface a
single `Warning::ExtractFailed` carrying the offending body line.
This degrades cleanly without panicking and without emitting a
silently-broken span.

Also extend `random_bytes_do_not_panic` to mix u32::MAX-style offsets
across the fuzz iterations so the overflow path is exercised by the
randomized corpus.

New tests:
- body_offset_lines_max_returns_extract_failed
- body_offset_lines_zero_at_max_minus_one_no_overflow

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`emit_block` previously walked the frame stack looking only for a Quote
container, falling back to top-level on miss. That caused any block
emitted inside a list item — code blocks, images, tables, headings —
to escape the list and appear at the top of `blocks`, after the
entire list and out of source order. `ParsedPayload::List { items:
Vec<Vec<Inline>> }` cannot represent a child block structurally, so
the choice is between dropping content and flattening.

Extend the reverse-walk to also recognize `Frame::ListItem` and route
the block into a textual rendering appended to the item's inline
buffer (`flatten_block_into_item`):

- Code → fenced text approximation, preserving lang hint + body
- Image → `![alt](src)` text
- Audio → `[audio](src)` text
- Heading → leading hashes + text
- Quote → `> text`
- Nested List → same rendering as `nested_in_item` flatten
- Table → pipe-table approximation

Document order is preserved because flattening happens inside the
item's frame, before the item closes.

New tests:
- code_block_inside_list_item_flattens_into_parent
- image_inside_list_item_flattens_into_parent
- block_content_in_list_preserves_document_order

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`# ` (a heading with no following text) used to seed the heading
stack with `Some("")`, which then propagated into every child block's
`heading_path` as a `""` segment — visibly polluting the path that
downstream consumers index by.

Filter empty entries from both `heading_path()` and the in-line
ancestor collection at heading-end. We deliberately keep `Some("")`
in the stack rather than skipping the assignment so the slot remains
occupied and a subsequent deeper heading is still positioned
correctly relative to its level — only the visible path drops the
empty.

New tests:
- empty_heading_does_not_pollute_path
- empty_h1_then_h2_does_not_break_stack

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous block-level image detector scanned paragraph source bytes
for the literal `![alt](src)` shape. That was fragile in three ways:

- `![alt](src "title")` leaked the title into `src` (`src "title"`)
- `![alt](<https://x.com/a b>)` kept the angle brackets verbatim
- `![]()` had undefined behavior

Replace the byte-scan with state on `Frame::Paragraph` that observes
the actual `Tag::Image` events from pulldown-cmark:

- `image_count` increments on each `Start(Tag::Image)` and `image_src`
  captures `dest_url` (which already strips angle brackets and excludes
  the title).
- Text events seen while `image_depth > 0` are routed into `image_alt`
  and suppressed from the inline buffer.
- Strong/Emph/Link starts and any non-image text outside the image
  flag `non_image_text_seen`.

At `End(Paragraph)`, the paragraph is lifted to `ImageRef` iff
`image_count == 1 && !non_image_text_seen`. The byte-scanner
`match_block_image` is removed.

New tests:
- image_with_title_attribute (title dropped, no leak into src)
- image_with_angle_bracketed_url (brackets stripped)
- empty_image_alt_and_src (`![]()` pins to empty/empty)

Existing image tests (`image_ref_block_captures_src_and_alt`,
`inline_image_inside_paragraph_is_dropped`) continue to pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`[multi\nline](http://x)` produced `Inline::Link.text = "multiline"`
because the SoftBreak/HardBreak handler called `push_text(" ")` —
which updates `paragraph.text` and the inline buffer, but NOT the
open link frame's flattened text accumulator. Text events flowed
through `push_link_text`; line breaks didn't.

Add `push_link_text(" ")` alongside the existing `push_text(" ")` in
the break handler so a line break inside `[ ... ](href)` collapses
to a visible space rather than disappearing.

New tests:
- link_with_soft_break_preserves_space_in_text
- link_with_hard_break_preserves_space_in_text

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ParsedPayload::Quote { text, inlines }` cannot represent block-level
children (lists, code, tables, images), so the BlockQuote end handler
silently drops them when assembling the Quote payload. This matches
§3.4 for now but is non-obvious and easy to regress without an
explicit pin.

Add a TODO(P1-future) comment near the Quote emission code and a
regression test (`quote_with_list_inside_drops_list`) that fixes the
current shape: a `> - item` blockquote produces a Quote with empty
text and empty inlines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The quality reviewer named three specific input probes for the C1/C2/
C3 fixes. Encode each as a verbatim test so future regressions on
those exact inputs surface immediately:

- probe_overflow: parse_blocks(b"# h\nbody\n", u32::MAX) → empty +
  Warning::ExtractFailed.
- probe_list_escape: list with embedded code block → single List
  block, two items.
- probe_empty_heading: `# \n# Real\nbody\n` → body's heading_path is
  `["Real"]`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claude-reviewer-01 reviewed 2026-04-30 14:50:07 +00:00
claude-reviewer-01 left a comment
Member

리뷰: 모든 항목 해소 — APPROVE 권고 (gate 정책상 COMMENT 등록)

내부 review loop 3회 완료. 마지막 리뷰어 verdict: APPROVED, must-fix-now 0건.

Review trail

Round Verdict 반영 commit
Spec compliance PASS-WITH-NICE-TO-FIX (3 concerns 검증, must-fix 0)
Quality round 1 CHANGES-REQUESTED (Critical 3 + Important 2) de91648, 0050cf3, d49dbc1, 73040ca, 23ff4d6, 2b6d9ab, 80123e9
Quality round 2 APPROVED

핵심 검증 포인트

항목 상태
C1 span overflow → checked_add + Warning::ExtractFailed body_offset_lines_max_returns_extract_failed + u32::MAX-N boundary 검증
C2 code/image inside list item → 부모 item flatten flatten_block_into_item reverse-walk, document order 보존
C3 empty heading path 오염 filter_map 으로 path 에서 빈 string skip, 스택은 유지하여 deeper level positioning
I1 image byte-scan 취약성 Tag::Image { dest_url } 이벤트 캡처로 교체, title/angle-bracket/empty 케이스 모두 픽스
I2 link text 공백 손실 (SoftBreak/HardBreak) push_text + push_link_text 동시 갱신
I5 Quote schema 한계 🟡 documented + pin test (P1-future schema migration 대기)

Gate 통과

  • cargo check / build (RUSTFLAGS=-D warnings) / clippy --all-targets -D warnings clean
  • cargo test -p kb-parse-md → 57 unit + 6 integration pass (was 41 / 6, +16 신규)
  • 스냅샷 baseline 변동 없음 (parser output 안정)

후속 자연 해소 (비차단)

  • kb_core::Inline 직렬화 버그: P0-1 schema 가 #[serde(tag="kind")] + newtype variant 조합으로 4/5 variant runtime 실패. P1-3 snapshot은 BlockView projection 으로 우회. P1-4 가 CanonicalDocument SQLite 직렬화 시작할 때 §9 schema migration 동반 권장.
  • snapshot byte-equality, richer adversarial corpus, clippy pedantic — 추후 hardening.

결론

papered over 없음. self-review gate 정책상 본 코멘트는 COMMENT — author 측 수동 APPROVE + merge 부탁.

## 리뷰: 모든 항목 해소 — APPROVE 권고 (gate 정책상 COMMENT 등록) 내부 review loop 3회 완료. 마지막 리뷰어 verdict: **APPROVED**, must-fix-now 0건. ### Review trail | Round | Verdict | 반영 commit | |-------|---------|------------| | Spec compliance | PASS-WITH-NICE-TO-FIX (3 concerns 검증, must-fix 0) | — | | Quality round 1 | CHANGES-REQUESTED (Critical 3 + Important 2) | `de91648`, `0050cf3`, `d49dbc1`, `73040ca`, `23ff4d6`, `2b6d9ab`, `80123e9` | | Quality round 2 | **APPROVED** | — | ### 핵심 검증 포인트 | 항목 | 상태 | |------|------| | C1 span overflow → `checked_add` + `Warning::ExtractFailed` | ✅ `body_offset_lines_max_returns_extract_failed` + `u32::MAX-N` boundary 검증 | | C2 code/image inside list item → 부모 item flatten | ✅ `flatten_block_into_item` reverse-walk, document order 보존 | | C3 empty heading path 오염 | ✅ `filter_map` 으로 path 에서 빈 string skip, 스택은 유지하여 deeper level positioning | | I1 image byte-scan 취약성 | ✅ `Tag::Image { dest_url }` 이벤트 캡처로 교체, title/angle-bracket/empty 케이스 모두 픽스 | | I2 link text 공백 손실 (SoftBreak/HardBreak) | ✅ `push_text` + `push_link_text` 동시 갱신 | | I5 Quote schema 한계 | 🟡 documented + pin test (P1-future schema migration 대기) | ### Gate 통과 - `cargo check / build (RUSTFLAGS=-D warnings) / clippy --all-targets -D warnings` clean - `cargo test -p kb-parse-md` → 57 unit + 6 integration pass (was 41 / 6, +16 신규) - 스냅샷 baseline 변동 없음 (parser output 안정) ### 후속 자연 해소 (비차단) - **`kb_core::Inline` 직렬화 버그**: P0-1 schema 가 `#[serde(tag="kind")]` + newtype variant 조합으로 4/5 variant runtime 실패. P1-3 snapshot은 `BlockView` projection 으로 우회. P1-4 가 `CanonicalDocument` SQLite 직렬화 시작할 때 §9 schema migration 동반 권장. - snapshot byte-equality, richer adversarial corpus, clippy pedantic — 추후 hardening. ### 결론 papered over 없음. self-review gate 정책상 본 코멘트는 COMMENT — author 측 수동 APPROVE + merge 부탁.
altair823 merged commit 8ce44af95a into main 2026-04-30 15:03:26 +00:00
altair823 deleted branch feat/p1-3-parse-md-blocks 2026-04-30 15:03:27 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: altair823-org/kebab#8