Files
kebab/tasks/p1/p1-2-parse-md-frontmatter.md
kb bc1b3147cd refactor(spec): cleanup pass over component specs
Address 8 issues found in spec audit (post PR #2):

1. §refs label: distinguish design vs report sections in p3-1 / p3-2 / p4-2 /
   p9-1 / p9-5 contract_sections (e.g., "report §11.2 Ollama" not "§11.2").
2. mock feature gate: gate MockEmbedder (p3-1) and MockLanguageModel (p4-1)
   behind `mock` cargo feature, default OFF; add CI symbol-scan as DoD item.
3. Warning type unification: p1-2 frontmatter now emits
   `kb_parse_types::Warning` (matches p1-3 / p1-4); drops crate-internal type.
4. p4-3 streaming thread: explicitly single-threaded inside RagPipeline::ask;
   collection + sink.send share the calling thread, no race. UI concurrency
   is callers responsibility (TUI worker thread pattern in p9-3).
5. p6-2 tesseract version: noted that `tesseract` 0.13 has no stable Rust
   `version()` accessor; use TessVersion FFI or shell-out + cache approach.
6. p9-* App struct extensions: introduce `kb_tui::{Library,Search,Ask,Inspect}State`
   slots in p9-1 forward-decl form; p9-2/3/4 fill bodies in their own crate
   without editing `App`. Parallel-safety contract added.
7. p3-3 cosine score: shift `(sim+1)/2` instead of clamp; preserve ranking
   signal between unrelated and opposite vectors. Clamp reserved for NaN.
8. fixtures/ root: p0-1 DoD now creates all fixture subdirs with .gitkeep so
   downstream tasks have a stable target path.
2026-04-27 23:38:13 +00:00

114 lines
4.8 KiB
Markdown

---
phase: P1
component: kb-parse-md (frontmatter submodule)
task_id: p1-2
title: "Markdown frontmatter parsing → Metadata"
status: planned
depends_on: [p0-1]
unblocks: [p1-4]
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
contract_sections: [design §3.6 Metadata, design §3.7b kb-parse-types (Warning), design §0 Q9 frontmatter, design §10 errors]
---
# p1-2 — Markdown frontmatter parsing
## Goal
Parse YAML/TOML frontmatter from Markdown bytes into `kb_core::Metadata`, with auto-derive defaults and unknown-key preservation in `metadata.user`.
## Why now / why this size
Frontmatter is small but contractually load-bearing (Q9 spec). Isolating it from block parsing keeps both halves of `kb-parse-md` simple and lets us reach 100% test coverage on the rules in design §0 Q9.
## Allowed dependencies
- `kb-core`
- `kb-parse-types` (provides shared `Warning` + `WarningKind` per design §3.7b)
- `serde`
- `serde_yaml` (or `yaml-rust2`) for YAML
- `toml` for TOML
- `time`
- `lingua` (lang auto-detect — accept feature-gate if heavy)
- `thiserror`
## Forbidden dependencies
- `kb-store-*`, `kb-llm*`, `kb-rag`, `kb-embed*`, `kb-search`, `kb-tui`, `kb-desktop`, `kb-source-fs`, `kb-chunk`, `kb-normalize`, `pulldown-cmark` (block parser is a sibling task)
## Inputs
| input | type | source |
|-------|------|--------|
| Markdown bytes | `&[u8]` | extractor |
| body fallbacks | `BodyHints { first_h1: Option<String>, fs_ctime: OffsetDateTime, fs_mtime: OffsetDateTime, fallback_lang: Option<String> }` | caller |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `(Metadata, Option<FrontmatterSpan>, Vec<kb_parse_types::Warning>)` | tuple | `kb-normalize` → CanonicalDocument |
## Public surface (signatures only — no new types)
```rust
pub fn parse_frontmatter(
bytes: &[u8],
hints: &BodyHints,
) -> anyhow::Result<(kb_core::Metadata, Option<FrontmatterSpan>, Vec<kb_parse_types::Warning>)>;
```
`Warning` / `WarningKind` come from `kb-parse-types` (shared with `p1-3` blocks parser and downstream `kb-normalize`). `FrontmatterSpan` is crate-internal; if any new public type is needed, STOP and update the frozen design doc first.
## Behavior contract
- All Metadata fields are optional in input. Missing fields populated per design §0 Q9 derive table:
- `title` ← first H1 (from `BodyHints.first_h1`) → filename without extension if no H1.
- `lang` ← lingua auto-detect on first 4 KB of body → fallback `BodyHints.fallback_lang` or `"und"`.
- `created_at` / `updated_at``BodyHints.fs_ctime` / `fs_mtime` if missing.
- `source_type` default `markdown`; `trust_level` default `primary`.
- `aliases`, `tags` default empty.
- Unknown keys → `metadata.user` (`serde_json::Map`), preserved verbatim, no warning.
- Unknown enum value (e.g. `trust_level: weird`) → emit `kb_parse_types::Warning { kind: WarningKind::MalformedFrontmatter, note: "unknown trust_level=weird, defaulted to primary" }` + ingest continues with default.
- Malformed YAML → frontmatter discarded, body still parsed, `Warning { kind: WarningKind::MalformedFrontmatter, note: "<error msg>" }` emitted.
- No frontmatter at all → defaults applied silently.
- `id:` field captured into `metadata.user_id_alias` (alias only — does NOT influence `doc_id` per design §4.2).
## Storage / wire effects
- None. Pure function.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| unit | YAML frontmatter happy path → Metadata fields | inline |
| unit | TOML frontmatter happy path | inline |
| unit | unknown keys preserved in `metadata.user` | inline |
| unit | unknown enum value → warning + default | inline |
| unit | malformed YAML → empty Metadata + warning | inline |
| unit | no frontmatter → derive from BodyHints | inline |
| unit | `id:` field becomes `user_id_alias`, not `doc_id` factor | inline + assert via §4.2 recipe stub |
| snapshot | `fixtures/markdown/frontmatter-only.md` produces stable JSON | fixture |
| snapshot | mixed-language body with no `lang:` detects `ko` or `en` | `fixtures/markdown/mixed-lang.md` |
All tests under `cargo test -p kb-parse-md --lib frontmatter`.
## Definition of Done
- [ ] `cargo check -p kb-parse-md` passes
- [ ] `cargo test -p kb-parse-md frontmatter` passes
- [ ] No `pulldown-cmark` import in this submodule
- [ ] Snapshot tests stable across two consecutive runs
- [ ] PR links design §0 Q9, §3.6
## Out of scope
- Block parsing (p1-3).
- Building `CanonicalDocument` (p1-4).
- Persisting metadata (p1-6).
## Risks / notes
- `lingua` model load is heavy on first call; tests should reuse a static instance.
- timezone normalization: parse `created_at`/`updated_at` to UTC; preserve original offset only in `metadata.user.original_timestamps` if present and non-UTC.