마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신. - 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈 path 표기 `kb_*` → `kebab_*` 포함). - 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`, `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`, `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때 같은 prefix 사용). - CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` / `kb doctor` / `kb inspect` / `kb list` / `kb eval` → `kebab <verb>`. fenced code block + 인라인 backtick 모두. - XDG paths + env vars + binary 경로 (`target/release/kb` → `target/release/kebab`) 동기화. - design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task spec 모든 reference 통일. - task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history 기록용 author 정보라 그대로 유지 (실제 git history 의 author 는 변경 불가). - `tasks/phase-5-evaluation.md` 의 `status: planned` → `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분). ## 검증 - `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]" --include="*.md"` 0 hits (task-decomposition.md 의 git author 제외). - 모든 file path reference 살아있음 (renamed file 들 모두 새 path 로 update). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
114 lines
4.8 KiB
Markdown
114 lines
4.8 KiB
Markdown
---
|
|
phase: P1
|
|
component: kebab-parse-md (frontmatter submodule)
|
|
task_id: p1-2
|
|
title: "Markdown frontmatter parsing → Metadata"
|
|
status: completed
|
|
depends_on: [p0-1]
|
|
unblocks: [p1-4]
|
|
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
|
contract_sections: [design §3.6 Metadata, design §3.7b kebab-parse-types (Warning), design §0 Q9 frontmatter, design §10 errors]
|
|
---
|
|
|
|
# p1-2 — Markdown frontmatter parsing
|
|
|
|
## Goal
|
|
|
|
Parse YAML/TOML frontmatter from Markdown bytes into `kebab_core::Metadata`, with auto-derive defaults and unknown-key preservation in `metadata.user`.
|
|
|
|
## Why now / why this size
|
|
|
|
Frontmatter is small but contractually load-bearing (Q9 spec). Isolating it from block parsing keeps both halves of `kebab-parse-md` simple and lets us reach 100% test coverage on the rules in design §0 Q9.
|
|
|
|
## Allowed dependencies
|
|
|
|
- `kebab-core`
|
|
- `kebab-parse-types` (provides shared `Warning` + `WarningKind` per design §3.7b)
|
|
- `serde`
|
|
- `serde_yaml` (or `yaml-rust2`) for YAML
|
|
- `toml` for TOML
|
|
- `time`
|
|
- `lingua` (lang auto-detect — accept feature-gate if heavy)
|
|
- `thiserror`
|
|
|
|
## Forbidden dependencies
|
|
|
|
- `kebab-store-*`, `kebab-llm*`, `kebab-rag`, `kebab-embed*`, `kebab-search`, `kebab-tui`, `kebab-desktop`, `kebab-source-fs`, `kebab-chunk`, `kebab-normalize`, `pulldown-cmark` (block parser is a sibling task)
|
|
|
|
## Inputs
|
|
|
|
| input | type | source |
|
|
|-------|------|--------|
|
|
| Markdown bytes | `&[u8]` | extractor |
|
|
| body fallbacks | `BodyHints { first_h1: Option<String>, fs_ctime: OffsetDateTime, fs_mtime: OffsetDateTime, fallback_lang: Option<String> }` | caller |
|
|
|
|
## Outputs
|
|
|
|
| output | type | downstream |
|
|
|--------|------|------------|
|
|
| `(Metadata, Option<FrontmatterSpan>, Vec<kebab_parse_types::Warning>)` | tuple | `kebab-normalize` → CanonicalDocument |
|
|
|
|
## Public surface (signatures only — no new types)
|
|
|
|
```rust
|
|
pub fn parse_frontmatter(
|
|
bytes: &[u8],
|
|
hints: &BodyHints,
|
|
) -> anyhow::Result<(kebab_core::Metadata, Option<FrontmatterSpan>, Vec<kebab_parse_types::Warning>)>;
|
|
```
|
|
|
|
`Warning` / `WarningKind` come from `kebab-parse-types` (shared with `p1-3` blocks parser and downstream `kebab-normalize`). `FrontmatterSpan` is crate-internal; if any new public type is needed, STOP and update the frozen design doc first.
|
|
|
|
## Behavior contract
|
|
|
|
- All Metadata fields are optional in input. Missing fields populated per design §0 Q9 derive table:
|
|
- `title` ← first H1 (from `BodyHints.first_h1`) → filename without extension if no H1.
|
|
- `lang` ← lingua auto-detect on first 4 KB of body → fallback `BodyHints.fallback_lang` or `"und"`.
|
|
- `created_at` / `updated_at` ← `BodyHints.fs_ctime` / `fs_mtime` if missing.
|
|
- `source_type` default `markdown`; `trust_level` default `primary`.
|
|
- `aliases`, `tags` default empty.
|
|
- Unknown keys → `metadata.user` (`serde_json::Map`), preserved verbatim, no warning.
|
|
- Unknown enum value (e.g. `trust_level: weird`) → emit `kebab_parse_types::Warning { kind: WarningKind::MalformedFrontmatter, note: "unknown trust_level=weird, defaulted to primary" }` + ingest continues with default.
|
|
- Malformed YAML → frontmatter discarded, body still parsed, `Warning { kind: WarningKind::MalformedFrontmatter, note: "<error msg>" }` emitted.
|
|
- No frontmatter at all → defaults applied silently.
|
|
- `id:` field captured into `metadata.user_id_alias` (alias only — does NOT influence `doc_id` per design §4.2).
|
|
|
|
## Storage / wire effects
|
|
|
|
- None. Pure function.
|
|
|
|
## Test plan
|
|
|
|
| kind | description | fixture / data |
|
|
|------|-------------|----------------|
|
|
| unit | YAML frontmatter happy path → Metadata fields | inline |
|
|
| unit | TOML frontmatter happy path | inline |
|
|
| unit | unknown keys preserved in `metadata.user` | inline |
|
|
| unit | unknown enum value → warning + default | inline |
|
|
| unit | malformed YAML → empty Metadata + warning | inline |
|
|
| unit | no frontmatter → derive from BodyHints | inline |
|
|
| unit | `id:` field becomes `user_id_alias`, not `doc_id` factor | inline + assert via §4.2 recipe stub |
|
|
| snapshot | `fixtures/markdown/frontmatter-only.md` produces stable JSON | fixture |
|
|
| snapshot | mixed-language body with no `lang:` detects `ko` or `en` | `fixtures/markdown/mixed-lang.md` |
|
|
|
|
All tests under `cargo test -p kebab-parse-md --lib frontmatter`.
|
|
|
|
## Definition of Done
|
|
|
|
- [ ] `cargo check -p kebab-parse-md` passes
|
|
- [ ] `cargo test -p kebab-parse-md frontmatter` passes
|
|
- [ ] No `pulldown-cmark` import in this submodule
|
|
- [ ] Snapshot tests stable across two consecutive runs
|
|
- [ ] PR links design §0 Q9, §3.6
|
|
|
|
## Out of scope
|
|
|
|
- Block parsing (p1-3).
|
|
- Building `CanonicalDocument` (p1-4).
|
|
- Persisting metadata (p1-6).
|
|
|
|
## Risks / notes
|
|
|
|
- `lingua` model load is heavy on first call; tests should reuse a static instance.
|
|
- timezone normalization: parse `created_at`/`updated_at` to UTC; preserve original offset only in `metadata.user.original_timestamps` if present and non-UTC.
|