Files
kebab/tasks/p1/p1-2-parse-md-frontmatter.md
kb bc1b3147cd refactor(spec): cleanup pass over component specs
Address 8 issues found in spec audit (post PR #2):

1. §refs label: distinguish design vs report sections in p3-1 / p3-2 / p4-2 /
   p9-1 / p9-5 contract_sections (e.g., "report §11.2 Ollama" not "§11.2").
2. mock feature gate: gate MockEmbedder (p3-1) and MockLanguageModel (p4-1)
   behind `mock` cargo feature, default OFF; add CI symbol-scan as DoD item.
3. Warning type unification: p1-2 frontmatter now emits
   `kb_parse_types::Warning` (matches p1-3 / p1-4); drops crate-internal type.
4. p4-3 streaming thread: explicitly single-threaded inside RagPipeline::ask;
   collection + sink.send share the calling thread, no race. UI concurrency
   is callers responsibility (TUI worker thread pattern in p9-3).
5. p6-2 tesseract version: noted that `tesseract` 0.13 has no stable Rust
   `version()` accessor; use TessVersion FFI or shell-out + cache approach.
6. p9-* App struct extensions: introduce `kb_tui::{Library,Search,Ask,Inspect}State`
   slots in p9-1 forward-decl form; p9-2/3/4 fill bodies in their own crate
   without editing `App`. Parallel-safety contract added.
7. p3-3 cosine score: shift `(sim+1)/2` instead of clamp; preserve ranking
   signal between unrelated and opposite vectors. Clamp reserved for NaN.
8. fixtures/ root: p0-1 DoD now creates all fixture subdirs with .gitkeep so
   downstream tasks have a stable target path.
2026-04-27 23:38:13 +00:00

4.8 KiB

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections
phase component task_id title status depends_on unblocks contract_source contract_sections
P1 kb-parse-md (frontmatter submodule) p1-2 Markdown frontmatter parsing → Metadata planned
p0-1
p1-4
../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
design §3.6 Metadata
design §3.7b kb-parse-types (Warning)
design §0 Q9 frontmatter
design §10 errors

p1-2 — Markdown frontmatter parsing

Goal

Parse YAML/TOML frontmatter from Markdown bytes into kb_core::Metadata, with auto-derive defaults and unknown-key preservation in metadata.user.

Why now / why this size

Frontmatter is small but contractually load-bearing (Q9 spec). Isolating it from block parsing keeps both halves of kb-parse-md simple and lets us reach 100% test coverage on the rules in design §0 Q9.

Allowed dependencies

  • kb-core
  • kb-parse-types (provides shared Warning + WarningKind per design §3.7b)
  • serde
  • serde_yaml (or yaml-rust2) for YAML
  • toml for TOML
  • time
  • lingua (lang auto-detect — accept feature-gate if heavy)
  • thiserror

Forbidden dependencies

  • kb-store-*, kb-llm*, kb-rag, kb-embed*, kb-search, kb-tui, kb-desktop, kb-source-fs, kb-chunk, kb-normalize, pulldown-cmark (block parser is a sibling task)

Inputs

input type source
Markdown bytes &[u8] extractor
body fallbacks BodyHints { first_h1: Option<String>, fs_ctime: OffsetDateTime, fs_mtime: OffsetDateTime, fallback_lang: Option<String> } caller

Outputs

output type downstream
(Metadata, Option<FrontmatterSpan>, Vec<kb_parse_types::Warning>) tuple kb-normalize → CanonicalDocument

Public surface (signatures only — no new types)

pub fn parse_frontmatter(
    bytes: &[u8],
    hints: &BodyHints,
) -> anyhow::Result<(kb_core::Metadata, Option<FrontmatterSpan>, Vec<kb_parse_types::Warning>)>;

Warning / WarningKind come from kb-parse-types (shared with p1-3 blocks parser and downstream kb-normalize). FrontmatterSpan is crate-internal; if any new public type is needed, STOP and update the frozen design doc first.

Behavior contract

  • All Metadata fields are optional in input. Missing fields populated per design §0 Q9 derive table:
    • title ← first H1 (from BodyHints.first_h1) → filename without extension if no H1.
    • lang ← lingua auto-detect on first 4 KB of body → fallback BodyHints.fallback_lang or "und".
    • created_at / updated_atBodyHints.fs_ctime / fs_mtime if missing.
    • source_type default markdown; trust_level default primary.
    • aliases, tags default empty.
  • Unknown keys → metadata.user (serde_json::Map), preserved verbatim, no warning.
  • Unknown enum value (e.g. trust_level: weird) → emit kb_parse_types::Warning { kind: WarningKind::MalformedFrontmatter, note: "unknown trust_level=weird, defaulted to primary" } + ingest continues with default.
  • Malformed YAML → frontmatter discarded, body still parsed, Warning { kind: WarningKind::MalformedFrontmatter, note: "<error msg>" } emitted.
  • No frontmatter at all → defaults applied silently.
  • id: field captured into metadata.user_id_alias (alias only — does NOT influence doc_id per design §4.2).

Storage / wire effects

  • None. Pure function.

Test plan

kind description fixture / data
unit YAML frontmatter happy path → Metadata fields inline
unit TOML frontmatter happy path inline
unit unknown keys preserved in metadata.user inline
unit unknown enum value → warning + default inline
unit malformed YAML → empty Metadata + warning inline
unit no frontmatter → derive from BodyHints inline
unit id: field becomes user_id_alias, not doc_id factor inline + assert via §4.2 recipe stub
snapshot fixtures/markdown/frontmatter-only.md produces stable JSON fixture
snapshot mixed-language body with no lang: detects ko or en fixtures/markdown/mixed-lang.md

All tests under cargo test -p kb-parse-md --lib frontmatter.

Definition of Done

  • cargo check -p kb-parse-md passes
  • cargo test -p kb-parse-md frontmatter passes
  • No pulldown-cmark import in this submodule
  • Snapshot tests stable across two consecutive runs
  • PR links design §0 Q9, §3.6

Out of scope

  • Block parsing (p1-3).
  • Building CanonicalDocument (p1-4).
  • Persisting metadata (p1-6).

Risks / notes

  • lingua model load is heavy on first call; tests should reuse a static instance.
  • timezone normalization: parse created_at/updated_at to UTC; preserve original offset only in metadata.user.original_timestamps if present and non-UTC.