M1: Reword the FrontmatterSpan doc-comment from "technically meant to be
crate-internal" to a forward-looking note about P1-3 / P1-4 callers using
bytes[span.end..] for body slicing.
M3: Add an explicit `# Errors` section to parse_frontmatter's rustdoc.
The current implementation never returns Err — all recoverable problems
are downgraded to warnings — but the Result is kept on the signature so
future hard-fail conditions can be added without breaking callers.
M4: Mention serde_yml in the library-choice rationale alongside
serde_yaml_ng, with a one-line note on why _ng was preferred (stricter
adherence to original serde_yaml semantics around null / tagged enums).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
C1: detect_delimiters now returns (DelimKind, FrontmatterSpan, Range<usize>)
where the inner range is the YAML/TOML payload byte range — derived in one
place rather than recomputed by the parser via fixed-width opening_len /
closing_len constants that wrongly assumed LF endings. CRLF input now parses
correctly end-to-end; the originally-failing reviewer probe
"---\r\ntitle: Doc\r\n---\r\nbody\r\n" now yields title="Doc" with no
warnings.
I1: Trailing horizontal whitespace (spaces / tabs) on either delimiter
line is now accepted, matching Hugo / Jekyll. Editors that auto-trim
trailing whitespace no longer silently break otherwise-valid frontmatter.
I2: A leading UTF-8 BOM (EF BB BF, byte 0 only) is tolerated and skipped
before delimiter scanning. The returned span.start accounts for the BOM
(=3) so callers using bytes[span.end..] for body slicing still get the
correct range without further bookkeeping. Mid-input BOMs are not stripped.
M2: Drop the now-dead DelimKind::opening_len / closing_len constants —
the inner range is encoded once at detection time.
12 new tests covering CRLF (YAML / TOML / mixed-EOL / end-to-end),
trailing whitespace on opener / closer / tabs, leading BOM (detection +
full pipeline), and mid-input BOM non-stripping.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §"Behavior contract" line 74 says `id:` is captured into
`metadata.user_id_alias` only. Remove the redundant `user.insert`
that was also writing it into the user map, and update the snapshot
baseline accordingly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two markdown fixtures with hand-authored JSON baselines that pin the
§0 Q9 derive output across runs:
- frontmatter-only.md exercises the YAML happy path with most fields,
unknown keys, an `id:` field, and a non-UTC created_at (so the
baseline shows original_timestamps preservation).
- mixed-lang.md is body-only with no `lang:` field; baseline pins the
lingua autodetect result for our enabled language set.
A separate `emit_snapshots` test (marked `#[ignore]`) regenerates the
baselines from the current parser output. A determinism test parses
the fixture twice and asserts equality so any non-determinism (e.g.
key ordering, lingua nondeterminism) fails fast.
Implement the frontmatter submodule:
- detect_delimiters scans for a leading YAML (---) or TOML (+++) block at
byte 0. Strict per §0 Q9: no leading whitespace / BOM, no chars on the
delimiter line. Closing must be its own line. Unterminated → no FM.
- parse_raw deserializes into RawFrontmatter, a serde-flatten struct that
catches unknown keys into a serde_json::Map for verbatim preservation
in metadata.user.
- derive_metadata implements the §0 Q9 fallback chain:
title → frontmatter | BodyHints.first_h1 | (filename: caller)
aliases/tags→ frontmatter | []
lang → frontmatter | lingua autodetect on first 4 KB | hints
| "und"
created_at → frontmatter (RFC 3339, normalized to UTC) | fs_ctime
updated_at → frontmatter | fs_mtime
source_type → frontmatter | "markdown"
trust_level → frontmatter | "primary"
id → user_id_alias only — never a doc_id factor (§4.2)
- Non-UTC offsets are normalized to UTC; the original string is preserved
in user.original_timestamps[field] per §0 Q9.
- Warnings are emitted for: malformed YAML/TOML, unknown enum values,
malformed timestamps. Unknown keys are silent.
- lingua detector is cached in a OnceLock — first build is heavy.
- 15 unit tests cover every row of the derive table + delimiter edge
cases + an explicit pin that `id:` does not feed id_for_doc.
Add the workspace member with the dep allow-list pinned by design §0 Q9
and the task spec. P1-2 will land the frontmatter submodule in the next
commit; P1-3 will add the block parser as a sibling.
Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024
so we use serde_yaml_ng, the maintained fork. lingua's per-language
features are explicitly enabled (default-features=false) to keep build
time + binary size sane — only the languages we need at parse time.