feat(p1-2): kb-parse-md frontmatter (YAML/TOML → Metadata) #7

Merged
altair823 merged 6 commits from feat/p1-2-parse-md-frontmatter into main 2026-04-30 14:06:56 +00:00
Owner

요약

P1-2 task spec 구현. 신규 crate kb-parse-md 의 frontmatter submodule — Markdown bytes에서 YAML/TOML frontmatter를 파싱해 kb_core::Metadata + Vec<Warning> + Option<FrontmatterSpan> 산출. 디자인 §0 Q9 derive table 1:1 적용. P1-3 block parser는 동일 crate에 sibling submodule로 추가될 예정.

기준: tasks/p1/p1-2-parse-md-frontmatter.md, docs/superpowers/specs/2026-04-27-kb-final-form-design.md (§0 Q9 / §3.6 Metadata / §3.7b Warning / §10).

구현

crates/kb-parse-md:

  • lib.rsparse_frontmatter, BodyHints, FrontmatterSpan만 노출
  • frontmatter.rs — delim detection / YAML(serde_yaml_ng) + TOML 파서 / §0 Q9 derive chain / lingua lang detect (ko/en/ja/zh, OnceLock 캐시)
  • tests/frontmatter_snapshots.rs — 3 snapshot + 1 regenerator
  • fixtures/markdown/frontmatter-only.md + baseline JSON
  • fixtures/markdown/mixed-lang.md + baseline JSON

동작 contract 핵심

  • §0 Q9 derive: title/lang/created_at/updated_at/source_type/trust_level/aliases/tags 의 fallback chain
  • Unknown key → metadata.user 보존, no warning
  • Unknown enum → MalformedFrontmatter warning + default + continue
  • Malformed YAML/TOML → frontmatter discard + warning (parser error 인용) + body 처리는 계속, span은 살림
  • Timezone non-UTC → UTC 변환 + 원본 string은 metadata.user.original_timestamps[field] 보존
  • id: field → Metadata.user_id_alias 단일 위치 (doc_id 영향 없음, recipe sentinel test 박힘)
  • lang/title은 Metadata 외부 필드 (디자인 §3.4 CanonicalDocument 직접 필드) → metadata.user["lang"] / metadata.user["title"] 임시 보관, P1-4 normalize 가 CanonicalDocument 로 lift

CRLF / BOM / 트림 robustness

  • LF/CRLF/mixed 모두 지원 (per-line EOL 처리, opening_len 상수 제거)
  • 선두 UTF-8 BOM strip + offset 정확히 더해서 span 반환
  • delimiter 라인 trailing 공백/탭 허용 (silent-loss 방지)

검증

  • cargo check / build (RUSTFLAGS=-D warnings) / clippy --all-targets -D warnings clean
  • cargo test --workspace → kb-parse-md 26 unit + 3 snapshot, 워크스페이스 전체 그린
  • cargo tree -p kb-parse-md --depth 1 → 허용 deps만 (kb-core, kb-parse-types, anyhow, serde, serde_json, serde_yaml_ng, time, toml, lingua)

Review

  • spec compliance: PASS-WITH-NICE-TO-FIX → must-fix 1건(user_id_alias 중복) 즉시 반영 (1fab6b0)
  • code quality round 1: CHANGES-REQUESTED — Critical CRLF + Important I1(trailing whitespace) + I2(BOM) + minor 4건
  • code quality round 2 (re-review): APPROVED — 4 EOL 조합 (LF/LF, CRLF/CRLF, LF/CRLF, CRLF/LF) 모두 hand-trace 검증, BOM offset 정확, must-fix-now 0건

다음

머지 후 feat/p1-3-parse-md-blocks 분기 → P1-3 진행 (sibling submodule).

## 요약 P1-2 task spec 구현. 신규 crate `kb-parse-md` 의 frontmatter submodule — Markdown bytes에서 YAML/TOML frontmatter를 파싱해 `kb_core::Metadata` + `Vec<Warning>` + `Option<FrontmatterSpan>` 산출. 디자인 §0 Q9 derive table 1:1 적용. P1-3 block parser는 동일 crate에 sibling submodule로 추가될 예정. 기준: tasks/p1/p1-2-parse-md-frontmatter.md, docs/superpowers/specs/2026-04-27-kb-final-form-design.md (§0 Q9 / §3.6 Metadata / §3.7b Warning / §10). ## 구현 `crates/kb-parse-md`: - `lib.rs` — `parse_frontmatter`, `BodyHints`, `FrontmatterSpan`만 노출 - `frontmatter.rs` — delim detection / YAML(`serde_yaml_ng`) + TOML 파서 / §0 Q9 derive chain / lingua lang detect (ko/en/ja/zh, `OnceLock` 캐시) - `tests/frontmatter_snapshots.rs` — 3 snapshot + 1 regenerator - `fixtures/markdown/frontmatter-only.md` + baseline JSON - `fixtures/markdown/mixed-lang.md` + baseline JSON ### 동작 contract 핵심 - §0 Q9 derive: title/lang/created_at/updated_at/source_type/trust_level/aliases/tags 의 fallback chain - Unknown key → `metadata.user` 보존, no warning - Unknown enum → `MalformedFrontmatter` warning + default + continue - Malformed YAML/TOML → frontmatter discard + warning (parser error 인용) + body 처리는 계속, span은 살림 - Timezone non-UTC → UTC 변환 + 원본 string은 `metadata.user.original_timestamps[field]` 보존 - `id:` field → `Metadata.user_id_alias` 단일 위치 (doc_id 영향 없음, recipe sentinel test 박힘) - lang/title은 `Metadata` 외부 필드 (디자인 §3.4 CanonicalDocument 직접 필드) → `metadata.user["lang"]` / `metadata.user["title"]` 임시 보관, P1-4 normalize 가 `CanonicalDocument` 로 lift ### CRLF / BOM / 트림 robustness - LF/CRLF/mixed 모두 지원 (per-line EOL 처리, opening_len 상수 제거) - 선두 UTF-8 BOM strip + offset 정확히 더해서 span 반환 - delimiter 라인 trailing 공백/탭 허용 (silent-loss 방지) ## 검증 - `cargo check / build (RUSTFLAGS=-D warnings) / clippy --all-targets -D warnings` clean - `cargo test --workspace` → kb-parse-md 26 unit + 3 snapshot, 워크스페이스 전체 그린 - `cargo tree -p kb-parse-md --depth 1` → 허용 deps만 (kb-core, kb-parse-types, anyhow, serde, serde_json, serde_yaml_ng, time, toml, lingua) ## Review - spec compliance: PASS-WITH-NICE-TO-FIX → must-fix 1건(`user_id_alias` 중복) 즉시 반영 (`1fab6b0`) - code quality round 1: CHANGES-REQUESTED — Critical CRLF + Important I1(trailing whitespace) + I2(BOM) + minor 4건 - code quality round 2 (re-review): **APPROVED** — 4 EOL 조합 (LF/LF, CRLF/CRLF, LF/CRLF, CRLF/LF) 모두 hand-trace 검증, BOM offset 정확, must-fix-now 0건 ## 다음 머지 후 feat/p1-3-parse-md-blocks 분기 → P1-3 진행 (sibling submodule).
altair823 added 6 commits 2026-04-30 13:18:02 +00:00
Add the workspace member with the dep allow-list pinned by design §0 Q9
and the task spec. P1-2 will land the frontmatter submodule in the next
commit; P1-3 will add the block parser as a sibling.

Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024
so we use serde_yaml_ng, the maintained fork. lingua's per-language
features are explicitly enabled (default-features=false) to keep build
time + binary size sane — only the languages we need at parse time.
Implement the frontmatter submodule:

- detect_delimiters scans for a leading YAML (---) or TOML (+++) block at
  byte 0. Strict per §0 Q9: no leading whitespace / BOM, no chars on the
  delimiter line. Closing must be its own line. Unterminated → no FM.
- parse_raw deserializes into RawFrontmatter, a serde-flatten struct that
  catches unknown keys into a serde_json::Map for verbatim preservation
  in metadata.user.
- derive_metadata implements the §0 Q9 fallback chain:
    title       → frontmatter | BodyHints.first_h1 | (filename: caller)
    aliases/tags→ frontmatter | []
    lang        → frontmatter | lingua autodetect on first 4 KB | hints
                  | "und"
    created_at  → frontmatter (RFC 3339, normalized to UTC) | fs_ctime
    updated_at  → frontmatter | fs_mtime
    source_type → frontmatter | "markdown"
    trust_level → frontmatter | "primary"
    id          → user_id_alias only — never a doc_id factor (§4.2)
- Non-UTC offsets are normalized to UTC; the original string is preserved
  in user.original_timestamps[field] per §0 Q9.
- Warnings are emitted for: malformed YAML/TOML, unknown enum values,
  malformed timestamps. Unknown keys are silent.
- lingua detector is cached in a OnceLock — first build is heavy.
- 15 unit tests cover every row of the derive table + delimiter edge
  cases + an explicit pin that `id:` does not feed id_for_doc.
Two markdown fixtures with hand-authored JSON baselines that pin the
§0 Q9 derive output across runs:

- frontmatter-only.md exercises the YAML happy path with most fields,
  unknown keys, an `id:` field, and a non-UTC created_at (so the
  baseline shows original_timestamps preservation).
- mixed-lang.md is body-only with no `lang:` field; baseline pins the
  lingua autodetect result for our enabled language set.

A separate `emit_snapshots` test (marked `#[ignore]`) regenerates the
baselines from the current parser output. A determinism test parses
the fixture twice and asserts equality so any non-determinism (e.g.
key ordering, lingua nondeterminism) fails fast.
Spec §"Behavior contract" line 74 says `id:` is captured into
`metadata.user_id_alias` only. Remove the redundant `user.insert`
that was also writing it into the user map, and update the snapshot
baseline accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
C1: detect_delimiters now returns (DelimKind, FrontmatterSpan, Range<usize>)
where the inner range is the YAML/TOML payload byte range — derived in one
place rather than recomputed by the parser via fixed-width opening_len /
closing_len constants that wrongly assumed LF endings. CRLF input now parses
correctly end-to-end; the originally-failing reviewer probe
"---\r\ntitle: Doc\r\n---\r\nbody\r\n" now yields title="Doc" with no
warnings.

I1: Trailing horizontal whitespace (spaces / tabs) on either delimiter
line is now accepted, matching Hugo / Jekyll. Editors that auto-trim
trailing whitespace no longer silently break otherwise-valid frontmatter.

I2: A leading UTF-8 BOM (EF BB BF, byte 0 only) is tolerated and skipped
before delimiter scanning. The returned span.start accounts for the BOM
(=3) so callers using bytes[span.end..] for body slicing still get the
correct range without further bookkeeping. Mid-input BOMs are not stripped.

M2: Drop the now-dead DelimKind::opening_len / closing_len constants —
the inner range is encoded once at detection time.

12 new tests covering CRLF (YAML / TOML / mixed-EOL / end-to-end),
trailing whitespace on opener / closer / tabs, leading BOM (detection +
full pipeline), and mid-input BOM non-stripping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M1: Reword the FrontmatterSpan doc-comment from "technically meant to be
crate-internal" to a forward-looking note about P1-3 / P1-4 callers using
bytes[span.end..] for body slicing.

M3: Add an explicit `# Errors` section to parse_frontmatter's rustdoc.
The current implementation never returns Err — all recoverable problems
are downgraded to warnings — but the Result is kept on the signature so
future hard-fail conditions can be added without breaking callers.

M4: Mention serde_yml in the library-choice rationale alongside
serde_yaml_ng, with a one-line note on why _ng was preferred (stricter
adherence to original serde_yaml semantics around null / tagged enums).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claude-reviewer-01 reviewed 2026-04-30 13:18:23 +00:00
claude-reviewer-01 left a comment
Member

리뷰: 모든 항목 해소 — APPROVE 권고 (gate 정책상 COMMENT 등록)

내부 review loop 3회 (spec → quality round 1 → quality round 2) 완료. 마지막 리뷰어 verdict: APPROVED, must-fix-now 0건.

Review trail

Round Verdict 반영 commit
Spec compliance PASS-WITH-NICE-TO-FIX (user_id_alias 중복 must-fix) 1fab6b0
Quality round 1 CHANGES-REQUESTED (C1 CRLF / I1 trailing ws / I2 BOM / M1-M4) 6a4db62, 5850bfc
Quality round 2 APPROVED

핵심 검증 포인트

  • CRLF probe b"---\r\ntitle: Doc\r\n---\r\nbody\r\n"title=Some("Doc"), warns=[], span 정확. 4 EOL 조합 모두 hand-trace 통과
  • BOM offset 정확: span.start / span.end / inner range 모두 +3 보정
  • delimiter 라인 trailing space + tab 허용 (silent-loss 방지)
  • serde(flatten) partition 정합 — id_field_does_not_feed_doc_id sentinel test 통과
  • lingua OnceLock lazy init, 4 lang feature gate (default-features = false)
  • determinism test 리팩터 후에도 통과

Gate 통과

  • cargo check / build (RUSTFLAGS=-D warnings) / clippy --all-targets -D warnings clean
  • cargo test -p kb-parse-md → 26 unit + 3 snapshot pass
  • cargo test --workspace → 전체 그린

후속 자연 해소 (비차단)

  • P1-4 normalize: metadata.user["lang"] / metadata.user["title"] lift to CanonicalDocument
  • 미닫힘 frontmatter silent-None 정책: 추후 user 보고 발생 시 휴리스틱(key:value 패턴) 추가 검토

결론

papered over 없음. self-review gate 정책상 본 코멘트는 COMMENT — author 측 수동 APPROVE + merge 부탁.

## 리뷰: 모든 항목 해소 — APPROVE 권고 (gate 정책상 COMMENT 등록) 내부 review loop 3회 (spec → quality round 1 → quality round 2) 완료. 마지막 리뷰어 verdict: **APPROVED**, must-fix-now 0건. ### Review trail | Round | Verdict | 반영 commit | |-------|---------|------------| | Spec compliance | PASS-WITH-NICE-TO-FIX (`user_id_alias` 중복 must-fix) | `1fab6b0` | | Quality round 1 | CHANGES-REQUESTED (C1 CRLF / I1 trailing ws / I2 BOM / M1-M4) | `6a4db62`, `5850bfc` | | Quality round 2 | **APPROVED** | — | ### 핵심 검증 포인트 - CRLF probe `b"---\r\ntitle: Doc\r\n---\r\nbody\r\n"` → `title=Some("Doc")`, `warns=[]`, span 정확. 4 EOL 조합 모두 hand-trace 통과 - BOM offset 정확: span.start / span.end / inner range 모두 +3 보정 - delimiter 라인 trailing space + tab 허용 (silent-loss 방지) - `serde(flatten)` partition 정합 — `id_field_does_not_feed_doc_id` sentinel test 통과 - lingua `OnceLock` lazy init, 4 lang feature gate (default-features = false) - determinism test 리팩터 후에도 통과 ### Gate 통과 - `cargo check / build (RUSTFLAGS=-D warnings) / clippy --all-targets -D warnings` clean - `cargo test -p kb-parse-md` → 26 unit + 3 snapshot pass - `cargo test --workspace` → 전체 그린 ### 후속 자연 해소 (비차단) - P1-4 normalize: `metadata.user["lang"]` / `metadata.user["title"]` lift to `CanonicalDocument` - 미닫힘 frontmatter silent-None 정책: 추후 user 보고 발생 시 휴리스틱(`key:value` 패턴) 추가 검토 ### 결론 papered over 없음. self-review gate 정책상 본 코멘트는 COMMENT — author 측 수동 APPROVE + merge 부탁.
altair823 merged commit ff37ea5927 into main 2026-04-30 14:06:56 +00:00
altair823 deleted branch feat/p1-2-parse-md-frontmatter 2026-04-30 14:07:01 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: altair823-org/kebab#7