Commit Graph

17 Commits

Author SHA1 Message Date
cfccb3687d p1-3: update kb-parse-md callers + drop BlockView projection in snapshots
Mechanical sweep over `Inline::Text(_)` / `Code(_)` / `Strong(_)` / `Emph(_)`
construction and match sites under the new struct-variant shape introduced
in the previous commit. `Inline::Link { text, href }` is unchanged.

The snapshot test in `tests/blocks_snapshots.rs` previously projected
`ParsedBlock` into a `BlockView`/`PayloadView` shim because the old
`Inline` could not serialize. With the schema fix in place we now
serialize `ParsedBlock` directly through serde — the shim and its
`flatten_inline` helper are removed. Inlines surface as structured
objects (`{"kind":"text","text":"…"}` etc.). Regenerated
`nested-headings.blocks.snapshot.json` to reflect the new shape via
the existing `--ignored` emitter; `code-and-table.blocks.snapshot.json`
has no inlines and is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:10:54 +00:00
80123e9e27 p1-3: pin reviewer probe inputs as regression tests
The quality reviewer named three specific input probes for the C1/C2/
C3 fixes. Encode each as a verbatim test so future regressions on
those exact inputs surface immediately:

- probe_overflow: parse_blocks(b"# h\nbody\n", u32::MAX) → empty +
  Warning::ExtractFailed.
- probe_list_escape: list with embedded code block → single List
  block, two items.
- probe_empty_heading: `# \n# Real\nbody\n` → body's heading_path is
  `["Real"]`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:42:21 +00:00
2b6d9abc0f p1-3: doc-comment + test pin Quote drops non-text children
`ParsedPayload::Quote { text, inlines }` cannot represent block-level
children (lists, code, tables, images), so the BlockQuote end handler
silently drops them when assembling the Quote payload. This matches
§3.4 for now but is non-obvious and easy to regress without an
explicit pin.

Add a TODO(P1-future) comment near the Quote emission code and a
regression test (`quote_with_list_inside_drops_list`) that fixes the
current shape: a `> - item` blockquote produces a Quote with empty
text and empty inlines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:41:47 +00:00
23ff4d68af p1-3: preserve whitespace in link text across SoftBreak/HardBreak
`[multi\nline](http://x)` produced `Inline::Link.text = "multiline"`
because the SoftBreak/HardBreak handler called `push_text(" ")` —
which updates `paragraph.text` and the inline buffer, but NOT the
open link frame's flattened text accumulator. Text events flowed
through `push_link_text`; line breaks didn't.

Add `push_link_text(" ")` alongside the existing `push_text(" ")` in
the break handler so a line break inside `[ ... ](href)` collapses
to a visible space rather than disappearing.

New tests:
- link_with_soft_break_preserves_space_in_text
- link_with_hard_break_preserves_space_in_text

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:40:42 +00:00
73040cab30 p1-3: capture image refs from pulldown-cmark Tag::Image events
The previous block-level image detector scanned paragraph source bytes
for the literal `![alt](src)` shape. That was fragile in three ways:

- `![alt](src "title")` leaked the title into `src` (`src "title"`)
- `![alt](<https://x.com/a b>)` kept the angle brackets verbatim
- `![]()` had undefined behavior

Replace the byte-scan with state on `Frame::Paragraph` that observes
the actual `Tag::Image` events from pulldown-cmark:

- `image_count` increments on each `Start(Tag::Image)` and `image_src`
  captures `dest_url` (which already strips angle brackets and excludes
  the title).
- Text events seen while `image_depth > 0` are routed into `image_alt`
  and suppressed from the inline buffer.
- Strong/Emph/Link starts and any non-image text outside the image
  flag `non_image_text_seen`.

At `End(Paragraph)`, the paragraph is lifted to `ImageRef` iff
`image_count == 1 && !non_image_text_seen`. The byte-scanner
`match_block_image` is removed.

New tests:
- image_with_title_attribute (title dropped, no leak into src)
- image_with_angle_bracketed_url (brackets stripped)
- empty_image_alt_and_src (`![]()` pins to empty/empty)

Existing image tests (`image_ref_block_captures_src_and_alt`,
`inline_image_inside_paragraph_is_dropped`) continue to pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:40:04 +00:00
d49dbc1926 p1-3: skip empty heading text when building heading_path
`# ` (a heading with no following text) used to seed the heading
stack with `Some("")`, which then propagated into every child block's
`heading_path` as a `""` segment — visibly polluting the path that
downstream consumers index by.

Filter empty entries from both `heading_path()` and the in-line
ancestor collection at heading-end. We deliberately keep `Some("")`
in the stack rather than skipping the assignment so the slot remains
occupied and a subsequent deeper heading is still positioned
correctly relative to its level — only the visible path drops the
empty.

New tests:
- empty_heading_does_not_pollute_path
- empty_h1_then_h2_does_not_break_stack

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:37:40 +00:00
0050cf32ea p1-3: route block-level content inside list items into parent inline buffer
`emit_block` previously walked the frame stack looking only for a Quote
container, falling back to top-level on miss. That caused any block
emitted inside a list item — code blocks, images, tables, headings —
to escape the list and appear at the top of `blocks`, after the
entire list and out of source order. `ParsedPayload::List { items:
Vec<Vec<Inline>> }` cannot represent a child block structurally, so
the choice is between dropping content and flattening.

Extend the reverse-walk to also recognize `Frame::ListItem` and route
the block into a textual rendering appended to the item's inline
buffer (`flatten_block_into_item`):

- Code → fenced text approximation, preserving lang hint + body
- Image → `![alt](src)` text
- Audio → `[audio](src)` text
- Heading → leading hashes + text
- Quote → `> text`
- Nested List → same rendering as `nested_in_item` flatten
- Table → pipe-table approximation

Document order is preserved because flattening happens inside the
item's frame, before the item closes.

New tests:
- code_block_inside_list_item_flattens_into_parent
- image_inside_list_item_flattens_into_parent
- block_content_in_list_preserves_document_order

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:36:27 +00:00
de9164802b p1-3: fix span arithmetic overflow + body_offset_lines fuzz
`span_for` previously used `u32 + u32` directly, so callers passing a
large `body_offset_lines` could panic (debug, then masked by
`catch_unwind` and the entire body discarded) or wrap to an inverted
span with `start > end` (release).

Switch to `checked_add`; on overflow flag the walk state and at the
end of `parse_blocks_inner` discard accumulated blocks and surface a
single `Warning::ExtractFailed` carrying the offending body line.
This degrades cleanly without panicking and without emitting a
silently-broken span.

Also extend `random_bytes_do_not_panic` to mix u32::MAX-style offsets
across the fuzz iterations so the overflow path is exercised by the
randomized corpus.

New tests:
- body_offset_lines_max_returns_extract_failed
- body_offset_lines_zero_at_max_minus_one_no_overflow

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:34:58 +00:00
2ac0f56105 p1-3: drop dead bindings in heading + table end handlers
Removes the leftover `let _ = level_u8;` and `let _ = header_count;`
discards. The Heading frame already carries the canonical level (we
populated it from `Tag::Heading` at Start), so we destructure that
directly and ignore the redundant `TagEnd::Heading(level)` payload.
The header_count helper was dead — Frame::Table tracks `cols`
internally and we never consumed `header_count`.
2026-04-30 14:18:54 +00:00
f604a381df p1-3: snapshot tests + clippy fix
Adds two snapshot tests (`nested-headings.md`, `code-and-table.md`) under
crates/kb-parse-md/tests/blocks_snapshots.rs, with matching baseline JSON
next to each fixture. The snapshot view projects `kb_core::Inline` to
flat strings — `Inline` carries `serde(tag = "kind")` which is
incompatible with newtype variants holding a primitive (`Text(String)`),
so direct serialization of `ParsedBlock` would fail today. The view
preserves the contract that matters for P1-3 (heading paths, source
spans, payload kinds, payload text/code/table content) and will keep
working once kb-core fixes the Inline schema in a later task.

Also tightens `level_to_use >= 1 && <= 6` into `(1..=6).contains(&_)` to
satisfy `clippy::manual_range_contains`.
2026-04-30 14:17:41 +00:00
4e7e9cad87 p1-3: add parse_blocks (pulldown-cmark walker) submodule
Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a
flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a
small frame-based state machine that tracks heading paths, accumulates
inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and
reports SourceSpan::Line spans in 1-indexed file-line coordinates.

Covers headings, paragraphs, code blocks (lang from info string), GFM
tables (with malformed fallback to paragraph + MalformedTable warning),
lists (nested sub-lists flattened into parent item), and block-level image
references. Inline images are dropped silently per the inline filter.
Adversarial inputs are caught with `catch_unwind` and degrade to an empty
output + ExtractFailed warning.

15 unit tests cover heading-path correctness, code lang, table parsing,
malformed-table fallback (driven via synthetic events since pulldown-cmark
auto-normalizes table widths), LF/CRLF line-range parity, image refs,
nested-list flattening, inline filter, and 100-iteration random-bytes plus
hand-crafted adversarial-input no-panic guards.
2026-04-30 14:14:34 +00:00
5850bfcf7a p1-2: address review minors (FrontmatterSpan doc, parse_frontmatter rustdoc, YAML library note)
M1: Reword the FrontmatterSpan doc-comment from "technically meant to be
crate-internal" to a forward-looking note about P1-3 / P1-4 callers using
bytes[span.end..] for body slicing.

M3: Add an explicit `# Errors` section to parse_frontmatter's rustdoc.
The current implementation never returns Err — all recoverable problems
are downgraded to warnings — but the Result is kept on the signature so
future hard-fail conditions can be added without breaking callers.

M4: Mention serde_yml in the library-choice rationale alongside
serde_yaml_ng, with a one-line note on why _ng was preferred (stricter
adherence to original serde_yaml semantics around null / tagged enums).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 13:13:16 +00:00
6a4db624b6 p1-2: fix CRLF / trailing whitespace / BOM in frontmatter delimiter detection
C1: detect_delimiters now returns (DelimKind, FrontmatterSpan, Range<usize>)
where the inner range is the YAML/TOML payload byte range — derived in one
place rather than recomputed by the parser via fixed-width opening_len /
closing_len constants that wrongly assumed LF endings. CRLF input now parses
correctly end-to-end; the originally-failing reviewer probe
"---\r\ntitle: Doc\r\n---\r\nbody\r\n" now yields title="Doc" with no
warnings.

I1: Trailing horizontal whitespace (spaces / tabs) on either delimiter
line is now accepted, matching Hugo / Jekyll. Editors that auto-trim
trailing whitespace no longer silently break otherwise-valid frontmatter.

I2: A leading UTF-8 BOM (EF BB BF, byte 0 only) is tolerated and skipped
before delimiter scanning. The returned span.start accounts for the BOM
(=3) so callers using bytes[span.end..] for body slicing still get the
correct range without further bookkeeping. Mid-input BOMs are not stripped.

M2: Drop the now-dead DelimKind::opening_len / closing_len constants —
the inner range is encoded once at detection time.

12 new tests covering CRLF (YAML / TOML / mixed-EOL / end-to-end),
trailing whitespace on opener / closer / tabs, leading BOM (detection +
full pipeline), and mid-input BOM non-stripping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 13:12:34 +00:00
1fab6b0207 p1-2: address spec review (drop user_id_alias mirror in user map)
Spec §"Behavior contract" line 74 says `id:` is captured into
`metadata.user_id_alias` only. Remove the redundant `user.insert`
that was also writing it into the user map, and update the snapshot
baseline accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 13:02:28 +00:00
42a7d53e5d p1-2: fixtures + snapshot tests for frontmatter parser
Two markdown fixtures with hand-authored JSON baselines that pin the
§0 Q9 derive output across runs:

- frontmatter-only.md exercises the YAML happy path with most fields,
  unknown keys, an `id:` field, and a non-UTC created_at (so the
  baseline shows original_timestamps preservation).
- mixed-lang.md is body-only with no `lang:` field; baseline pins the
  lingua autodetect result for our enabled language set.

A separate `emit_snapshots` test (marked `#[ignore]`) regenerates the
baselines from the current parser output. A determinism test parses
the fixture twice and asserts equality so any non-determinism (e.g.
key ordering, lingua nondeterminism) fails fast.
2026-04-30 12:56:19 +00:00
cc8f7dad3f p1-2: parse_frontmatter + §0 Q9 derive table
Implement the frontmatter submodule:

- detect_delimiters scans for a leading YAML (---) or TOML (+++) block at
  byte 0. Strict per §0 Q9: no leading whitespace / BOM, no chars on the
  delimiter line. Closing must be its own line. Unterminated → no FM.
- parse_raw deserializes into RawFrontmatter, a serde-flatten struct that
  catches unknown keys into a serde_json::Map for verbatim preservation
  in metadata.user.
- derive_metadata implements the §0 Q9 fallback chain:
    title       → frontmatter | BodyHints.first_h1 | (filename: caller)
    aliases/tags→ frontmatter | []
    lang        → frontmatter | lingua autodetect on first 4 KB | hints
                  | "und"
    created_at  → frontmatter (RFC 3339, normalized to UTC) | fs_ctime
    updated_at  → frontmatter | fs_mtime
    source_type → frontmatter | "markdown"
    trust_level → frontmatter | "primary"
    id          → user_id_alias only — never a doc_id factor (§4.2)
- Non-UTC offsets are normalized to UTC; the original string is preserved
  in user.original_timestamps[field] per §0 Q9.
- Warnings are emitted for: malformed YAML/TOML, unknown enum values,
  malformed timestamps. Unknown keys are silent.
- lingua detector is cached in a OnceLock — first build is heavy.
- 15 unit tests cover every row of the derive table + delimiter edge
  cases + an explicit pin that `id:` does not feed id_for_doc.
2026-04-30 12:56:02 +00:00
a86b463fc4 p1-2: scaffold kb-parse-md crate
Add the workspace member with the dep allow-list pinned by design §0 Q9
and the task spec. P1-2 will land the frontmatter submodule in the next
commit; P1-3 will add the block parser as a sibling.

Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024
so we use serde_yaml_ng, the maintained fork. lingua's per-language
features are explicitly enabled (default-features=false) to keep build
time + binary size sane — only the languages we need at parse time.
2026-04-30 12:55:20 +00:00