kebab

Author	SHA1	Message	Date
altair823	cfccb3687d	p1-3: update kb-parse-md callers + drop BlockView projection in snapshots Mechanical sweep over `Inline::Text(_)` / `Code(_)` / `Strong(_)` / `Emph(_)` construction and match sites under the new struct-variant shape introduced in the previous commit. `Inline::Link { text, href }` is unchanged. The snapshot test in `tests/blocks_snapshots.rs` previously projected `ParsedBlock` into a `BlockView`/`PayloadView` shim because the old `Inline` could not serialize. With the schema fix in place we now serialize `ParsedBlock` directly through serde — the shim and its `flatten_inline` helper are removed. Inlines surface as structured objects (`{"kind":"text","text":"…"}` etc.). Regenerated `nested-headings.blocks.snapshot.json` to reflect the new shape via the existing `--ignored` emitter; `code-and-table.blocks.snapshot.json` has no inlines and is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:10:54 +00:00
altair823	80123e9e27	p1-3: pin reviewer probe inputs as regression tests The quality reviewer named three specific input probes for the C1/C2/ C3 fixes. Encode each as a verbatim test so future regressions on those exact inputs surface immediately: - probe_overflow: parse_blocks(b"# h\nbody\n", u32::MAX) → empty + Warning::ExtractFailed. - probe_list_escape: list with embedded code block → single List block, two items. - probe_empty_heading: `# \n# Real\nbody\n` → body's heading_path is `["Real"]`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:42:21 +00:00
altair823	2b6d9abc0f	p1-3: doc-comment + test pin Quote drops non-text children `ParsedPayload::Quote { text, inlines }` cannot represent block-level children (lists, code, tables, images), so the BlockQuote end handler silently drops them when assembling the Quote payload. This matches §3.4 for now but is non-obvious and easy to regress without an explicit pin. Add a TODO(P1-future) comment near the Quote emission code and a regression test (`quote_with_list_inside_drops_list`) that fixes the current shape: a `> - item` blockquote produces a Quote with empty text and empty inlines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:41:47 +00:00
altair823	23ff4d68af	p1-3: preserve whitespace in link text across SoftBreak/HardBreak `[multi\nline](http://x)` produced `Inline::Link.text = "multiline"` because the SoftBreak/HardBreak handler called `push_text(" ")` — which updates `paragraph.text` and the inline buffer, but NOT the open link frame's flattened text accumulator. Text events flowed through `push_link_text`; line breaks didn't. Add `push_link_text(" ")` alongside the existing `push_text(" ")` in the break handler so a line break inside `[ ... ](href)` collapses to a visible space rather than disappearing. New tests: - link_with_soft_break_preserves_space_in_text - link_with_hard_break_preserves_space_in_text Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:40:42 +00:00
altair823	73040cab30	p1-3: capture image refs from pulldown-cmark Tag::Image events The previous block-level image detector scanned paragraph source bytes for the literal `![alt](src)` shape. That was fragile in three ways: - `![alt](src "title")` leaked the title into `src` (`src "title"`) - `![alt](<https://x.com/a b>)` kept the angle brackets verbatim - `![]()` had undefined behavior Replace the byte-scan with state on `Frame::Paragraph` that observes the actual `Tag::Image` events from pulldown-cmark: - `image_count` increments on each `Start(Tag::Image)` and `image_src` captures `dest_url` (which already strips angle brackets and excludes the title). - Text events seen while `image_depth > 0` are routed into `image_alt` and suppressed from the inline buffer. - Strong/Emph/Link starts and any non-image text outside the image flag `non_image_text_seen`. At `End(Paragraph)`, the paragraph is lifted to `ImageRef` iff `image_count == 1 && !non_image_text_seen`. The byte-scanner `match_block_image` is removed. New tests: - image_with_title_attribute (title dropped, no leak into src) - image_with_angle_bracketed_url (brackets stripped) - empty_image_alt_and_src (`![]()` pins to empty/empty) Existing image tests (`image_ref_block_captures_src_and_alt`, `inline_image_inside_paragraph_is_dropped`) continue to pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:40:04 +00:00
altair823	d49dbc1926	p1-3: skip empty heading text when building heading_path `# ` (a heading with no following text) used to seed the heading stack with `Some("")`, which then propagated into every child block's `heading_path` as a `""` segment — visibly polluting the path that downstream consumers index by. Filter empty entries from both `heading_path()` and the in-line ancestor collection at heading-end. We deliberately keep `Some("")` in the stack rather than skipping the assignment so the slot remains occupied and a subsequent deeper heading is still positioned correctly relative to its level — only the visible path drops the empty. New tests: - empty_heading_does_not_pollute_path - empty_h1_then_h2_does_not_break_stack Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:37:40 +00:00
altair823	0050cf32ea	p1-3: route block-level content inside list items into parent inline buffer `emit_block` previously walked the frame stack looking only for a Quote container, falling back to top-level on miss. That caused any block emitted inside a list item — code blocks, images, tables, headings — to escape the list and appear at the top of `blocks`, after the entire list and out of source order. `ParsedPayload::List { items: Vec<Vec<Inline>> }` cannot represent a child block structurally, so the choice is between dropping content and flattening. Extend the reverse-walk to also recognize `Frame::ListItem` and route the block into a textual rendering appended to the item's inline buffer (`flatten_block_into_item`): - Code → fenced text approximation, preserving lang hint + body - Image → `![alt](src)` text - Audio → `[audio](src)` text - Heading → leading hashes + text - Quote → `> text` - Nested List → same rendering as `nested_in_item` flatten - Table → pipe-table approximation Document order is preserved because flattening happens inside the item's frame, before the item closes. New tests: - code_block_inside_list_item_flattens_into_parent - image_inside_list_item_flattens_into_parent - block_content_in_list_preserves_document_order Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:36:27 +00:00
altair823	de9164802b	p1-3: fix span arithmetic overflow + body_offset_lines fuzz `span_for` previously used `u32 + u32` directly, so callers passing a large `body_offset_lines` could panic (debug, then masked by `catch_unwind` and the entire body discarded) or wrap to an inverted span with `start > end` (release). Switch to `checked_add`; on overflow flag the walk state and at the end of `parse_blocks_inner` discard accumulated blocks and surface a single `Warning::ExtractFailed` carrying the offending body line. This degrades cleanly without panicking and without emitting a silently-broken span. Also extend `random_bytes_do_not_panic` to mix u32::MAX-style offsets across the fuzz iterations so the overflow path is exercised by the randomized corpus. New tests: - body_offset_lines_max_returns_extract_failed - body_offset_lines_zero_at_max_minus_one_no_overflow Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:34:58 +00:00
altair823	2ac0f56105	p1-3: drop dead bindings in heading + table end handlers Removes the leftover `let _ = level_u8;` and `let _ = header_count;` discards. The Heading frame already carries the canonical level (we populated it from `Tag::Heading` at Start), so we destructure that directly and ignore the redundant `TagEnd::Heading(level)` payload. The header_count helper was dead — Frame::Table tracks `cols` internally and we never consumed `header_count`.	2026-04-30 14:18:54 +00:00
altair823	f604a381df	p1-3: snapshot tests + clippy fix Adds two snapshot tests (`nested-headings.md`, `code-and-table.md`) under crates/kb-parse-md/tests/blocks_snapshots.rs, with matching baseline JSON next to each fixture. The snapshot view projects `kb_core::Inline` to flat strings — `Inline` carries `serde(tag = "kind")` which is incompatible with newtype variants holding a primitive (`Text(String)`), so direct serialization of `ParsedBlock` would fail today. The view preserves the contract that matters for P1-3 (heading paths, source spans, payload kinds, payload text/code/table content) and will keep working once kb-core fixes the Inline schema in a later task. Also tightens `level_to_use >= 1 && <= 6` into `(1..=6).contains(&_)` to satisfy `clippy::manual_range_contains`.	2026-04-30 14:17:41 +00:00
altair823	4e7e9cad87	p1-3: add parse_blocks (pulldown-cmark walker) submodule Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a small frame-based state machine that tracks heading paths, accumulates inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and reports SourceSpan::Line spans in 1-indexed file-line coordinates. Covers headings, paragraphs, code blocks (lang from info string), GFM tables (with malformed fallback to paragraph + MalformedTable warning), lists (nested sub-lists flattened into parent item), and block-level image references. Inline images are dropped silently per the inline filter. Adversarial inputs are caught with `catch_unwind` and degrade to an empty output + ExtractFailed warning. 15 unit tests cover heading-path correctness, code lang, table parsing, malformed-table fallback (driven via synthetic events since pulldown-cmark auto-normalizes table widths), LF/CRLF line-range parity, image refs, nested-list flattening, inline filter, and 100-iteration random-bytes plus hand-crafted adversarial-input no-panic guards.	2026-04-30 14:14:34 +00:00
altair823	5850bfcf7a	p1-2: address review minors (FrontmatterSpan doc, parse_frontmatter rustdoc, YAML library note) M1: Reword the FrontmatterSpan doc-comment from "technically meant to be crate-internal" to a forward-looking note about P1-3 / P1-4 callers using bytes[span.end..] for body slicing. M3: Add an explicit `# Errors` section to parse_frontmatter's rustdoc. The current implementation never returns Err — all recoverable problems are downgraded to warnings — but the Result is kept on the signature so future hard-fail conditions can be added without breaking callers. M4: Mention serde_yml in the library-choice rationale alongside serde_yaml_ng, with a one-line note on why _ng was preferred (stricter adherence to original serde_yaml semantics around null / tagged enums). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:13:16 +00:00
altair823	6a4db624b6	p1-2: fix CRLF / trailing whitespace / BOM in frontmatter delimiter detection C1: detect_delimiters now returns (DelimKind, FrontmatterSpan, Range<usize>) where the inner range is the YAML/TOML payload byte range — derived in one place rather than recomputed by the parser via fixed-width opening_len / closing_len constants that wrongly assumed LF endings. CRLF input now parses correctly end-to-end; the originally-failing reviewer probe "---\r\ntitle: Doc\r\n---\r\nbody\r\n" now yields title="Doc" with no warnings. I1: Trailing horizontal whitespace (spaces / tabs) on either delimiter line is now accepted, matching Hugo / Jekyll. Editors that auto-trim trailing whitespace no longer silently break otherwise-valid frontmatter. I2: A leading UTF-8 BOM (EF BB BF, byte 0 only) is tolerated and skipped before delimiter scanning. The returned span.start accounts for the BOM (=3) so callers using bytes[span.end..] for body slicing still get the correct range without further bookkeeping. Mid-input BOMs are not stripped. M2: Drop the now-dead DelimKind::opening_len / closing_len constants — the inner range is encoded once at detection time. 12 new tests covering CRLF (YAML / TOML / mixed-EOL / end-to-end), trailing whitespace on opener / closer / tabs, leading BOM (detection + full pipeline), and mid-input BOM non-stripping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:12:34 +00:00
altair823	1fab6b0207	p1-2: address spec review (drop user_id_alias mirror in user map) Spec §"Behavior contract" line 74 says `id:` is captured into `metadata.user_id_alias` only. Remove the redundant `user.insert` that was also writing it into the user map, and update the snapshot baseline accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 13:02:28 +00:00
altair823	42a7d53e5d	p1-2: fixtures + snapshot tests for frontmatter parser Two markdown fixtures with hand-authored JSON baselines that pin the §0 Q9 derive output across runs: - frontmatter-only.md exercises the YAML happy path with most fields, unknown keys, an `id:` field, and a non-UTC created_at (so the baseline shows original_timestamps preservation). - mixed-lang.md is body-only with no `lang:` field; baseline pins the lingua autodetect result for our enabled language set. A separate `emit_snapshots` test (marked `#[ignore]`) regenerates the baselines from the current parser output. A determinism test parses the fixture twice and asserts equality so any non-determinism (e.g. key ordering, lingua nondeterminism) fails fast.	2026-04-30 12:56:19 +00:00
altair823	cc8f7dad3f	p1-2: parse_frontmatter + §0 Q9 derive table Implement the frontmatter submodule: - detect_delimiters scans for a leading YAML (---) or TOML (+++) block at byte 0. Strict per §0 Q9: no leading whitespace / BOM, no chars on the delimiter line. Closing must be its own line. Unterminated → no FM. - parse_raw deserializes into RawFrontmatter, a serde-flatten struct that catches unknown keys into a serde_json::Map for verbatim preservation in metadata.user. - derive_metadata implements the §0 Q9 fallback chain: title → frontmatter \| BodyHints.first_h1 \| (filename: caller) aliases/tags→ frontmatter \| [] lang → frontmatter \| lingua autodetect on first 4 KB \| hints \| "und" created_at → frontmatter (RFC 3339, normalized to UTC) \| fs_ctime updated_at → frontmatter \| fs_mtime source_type → frontmatter \| "markdown" trust_level → frontmatter \| "primary" id → user_id_alias only — never a doc_id factor (§4.2) - Non-UTC offsets are normalized to UTC; the original string is preserved in user.original_timestamps[field] per §0 Q9. - Warnings are emitted for: malformed YAML/TOML, unknown enum values, malformed timestamps. Unknown keys are silent. - lingua detector is cached in a OnceLock — first build is heavy. - 15 unit tests cover every row of the derive table + delimiter edge cases + an explicit pin that `id:` does not feed id_for_doc.	2026-04-30 12:56:02 +00:00
altair823	a86b463fc4	p1-2: scaffold kb-parse-md crate Add the workspace member with the dep allow-list pinned by design §0 Q9 and the task spec. P1-2 will land the frontmatter submodule in the next commit; P1-3 will add the block parser as a sibling. Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024 so we use serde_yaml_ng, the maintained fork. lingua's per-language features are explicitly enabled (default-features=false) to keep build time + binary size sane — only the languages we need at parse time.	2026-04-30 12:55:20 +00:00

17 Commits