The quality reviewer named three specific input probes for the C1/C2/
C3 fixes. Encode each as a verbatim test so future regressions on
those exact inputs surface immediately:
- probe_overflow: parse_blocks(b"# h\nbody\n", u32::MAX) → empty +
Warning::ExtractFailed.
- probe_list_escape: list with embedded code block → single List
block, two items.
- probe_empty_heading: `# \n# Real\nbody\n` → body's heading_path is
`["Real"]`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ParsedPayload::Quote { text, inlines }` cannot represent block-level
children (lists, code, tables, images), so the BlockQuote end handler
silently drops them when assembling the Quote payload. This matches
§3.4 for now but is non-obvious and easy to regress without an
explicit pin.
Add a TODO(P1-future) comment near the Quote emission code and a
regression test (`quote_with_list_inside_drops_list`) that fixes the
current shape: a `> - item` blockquote produces a Quote with empty
text and empty inlines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`[multi\nline](http://x)` produced `Inline::Link.text = "multiline"`
because the SoftBreak/HardBreak handler called `push_text(" ")` —
which updates `paragraph.text` and the inline buffer, but NOT the
open link frame's flattened text accumulator. Text events flowed
through `push_link_text`; line breaks didn't.
Add `push_link_text(" ")` alongside the existing `push_text(" ")` in
the break handler so a line break inside `[ ... ](href)` collapses
to a visible space rather than disappearing.
New tests:
- link_with_soft_break_preserves_space_in_text
- link_with_hard_break_preserves_space_in_text
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous block-level image detector scanned paragraph source bytes
for the literal `` shape. That was fragile in three ways:
- `` leaked the title into `src` (`src "title"`)
- `` kept the angle brackets verbatim
- `![]()` had undefined behavior
Replace the byte-scan with state on `Frame::Paragraph` that observes
the actual `Tag::Image` events from pulldown-cmark:
- `image_count` increments on each `Start(Tag::Image)` and `image_src`
captures `dest_url` (which already strips angle brackets and excludes
the title).
- Text events seen while `image_depth > 0` are routed into `image_alt`
and suppressed from the inline buffer.
- Strong/Emph/Link starts and any non-image text outside the image
flag `non_image_text_seen`.
At `End(Paragraph)`, the paragraph is lifted to `ImageRef` iff
`image_count == 1 && !non_image_text_seen`. The byte-scanner
`match_block_image` is removed.
New tests:
- image_with_title_attribute (title dropped, no leak into src)
- image_with_angle_bracketed_url (brackets stripped)
- empty_image_alt_and_src (`![]()` pins to empty/empty)
Existing image tests (`image_ref_block_captures_src_and_alt`,
`inline_image_inside_paragraph_is_dropped`) continue to pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`# ` (a heading with no following text) used to seed the heading
stack with `Some("")`, which then propagated into every child block's
`heading_path` as a `""` segment — visibly polluting the path that
downstream consumers index by.
Filter empty entries from both `heading_path()` and the in-line
ancestor collection at heading-end. We deliberately keep `Some("")`
in the stack rather than skipping the assignment so the slot remains
occupied and a subsequent deeper heading is still positioned
correctly relative to its level — only the visible path drops the
empty.
New tests:
- empty_heading_does_not_pollute_path
- empty_h1_then_h2_does_not_break_stack
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`emit_block` previously walked the frame stack looking only for a Quote
container, falling back to top-level on miss. That caused any block
emitted inside a list item — code blocks, images, tables, headings —
to escape the list and appear at the top of `blocks`, after the
entire list and out of source order. `ParsedPayload::List { items:
Vec<Vec<Inline>> }` cannot represent a child block structurally, so
the choice is between dropping content and flattening.
Extend the reverse-walk to also recognize `Frame::ListItem` and route
the block into a textual rendering appended to the item's inline
buffer (`flatten_block_into_item`):
- Code → fenced text approximation, preserving lang hint + body
- Image → `` text
- Audio → `[audio](src)` text
- Heading → leading hashes + text
- Quote → `> text`
- Nested List → same rendering as `nested_in_item` flatten
- Table → pipe-table approximation
Document order is preserved because flattening happens inside the
item's frame, before the item closes.
New tests:
- code_block_inside_list_item_flattens_into_parent
- image_inside_list_item_flattens_into_parent
- block_content_in_list_preserves_document_order
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`span_for` previously used `u32 + u32` directly, so callers passing a
large `body_offset_lines` could panic (debug, then masked by
`catch_unwind` and the entire body discarded) or wrap to an inverted
span with `start > end` (release).
Switch to `checked_add`; on overflow flag the walk state and at the
end of `parse_blocks_inner` discard accumulated blocks and surface a
single `Warning::ExtractFailed` carrying the offending body line.
This degrades cleanly without panicking and without emitting a
silently-broken span.
Also extend `random_bytes_do_not_panic` to mix u32::MAX-style offsets
across the fuzz iterations so the overflow path is exercised by the
randomized corpus.
New tests:
- body_offset_lines_max_returns_extract_failed
- body_offset_lines_zero_at_max_minus_one_no_overflow
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes the leftover `let _ = level_u8;` and `let _ = header_count;`
discards. The Heading frame already carries the canonical level (we
populated it from `Tag::Heading` at Start), so we destructure that
directly and ignore the redundant `TagEnd::Heading(level)` payload.
The header_count helper was dead — Frame::Table tracks `cols`
internally and we never consumed `header_count`.
Adds two snapshot tests (`nested-headings.md`, `code-and-table.md`) under
crates/kb-parse-md/tests/blocks_snapshots.rs, with matching baseline JSON
next to each fixture. The snapshot view projects `kb_core::Inline` to
flat strings — `Inline` carries `serde(tag = "kind")` which is
incompatible with newtype variants holding a primitive (`Text(String)`),
so direct serialization of `ParsedBlock` would fail today. The view
preserves the contract that matters for P1-3 (heading paths, source
spans, payload kinds, payload text/code/table content) and will keep
working once kb-core fixes the Inline schema in a later task.
Also tightens `level_to_use >= 1 && <= 6` into `(1..=6).contains(&_)` to
satisfy `clippy::manual_range_contains`.
Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a
flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a
small frame-based state machine that tracks heading paths, accumulates
inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and
reports SourceSpan::Line spans in 1-indexed file-line coordinates.
Covers headings, paragraphs, code blocks (lang from info string), GFM
tables (with malformed fallback to paragraph + MalformedTable warning),
lists (nested sub-lists flattened into parent item), and block-level image
references. Inline images are dropped silently per the inline filter.
Adversarial inputs are caught with `catch_unwind` and degrade to an empty
output + ExtractFailed warning.
15 unit tests cover heading-path correctness, code lang, table parsing,
malformed-table fallback (driven via synthetic events since pulldown-cmark
auto-normalizes table widths), LF/CRLF line-range parity, image refs,
nested-list flattening, inline filter, and 100-iteration random-bytes plus
hand-crafted adversarial-input no-panic guards.