PR #1 review left a design-debt note: ParsedBlock landing in kb-core would
(a) force every crate to recompile on parser-internal changes, and
(b) cause namespace pollution when P6/P7/P8 parsers add their own variants.
Resolution: a new thin crate kb-parse-types sits between kb-core and parsers.
Owns ParsedBlock + ParsedPayload + Warning + forward-refs for image/pdf/audio
parser intermediates. Depends on kb-core only (for SourceSpan / Inline).
Updates:
- design §3.7b: add new section defining kb-parse-types
- design §8: add kb-parse-types to module-boundary diagram + forbidden list
- design §3.4 Inline stays in kb-core; kb-parse-types references it (no duplication)
- p0-1 skeleton: workspace + Cargo deps + public surface block
- p1-3 parse-md-blocks: outputs Vec<kb_parse_types::ParsedBlock> directly
- p1-4 normalize: Allowed gains kb-parse-types, drops cross-coupling note
- INDEX + phase-0 epic: list kb-parse-types in P0 deliverables
Combine Metadata (p1-2) + Vec<kb_parse_types::ParsedBlock> (p1-3) + RawAsset (p1-1) into a kb_core::CanonicalDocument with deterministic doc_id and block_ids per design §4 recipe.
Why now / why this size
Single responsibility: ID generation + struct assembly. Keeps kb-parse-md purely a parser and isolates the (security-critical) deterministic ID logic in one crate.
serde-json-canonicalizer (canonical JSON for ID hashing)
blake3
unicode-normalization (NFC)
time
thiserror
Forbidden dependencies
kb-source-fs, kb-parse-md (consumed via shared kb-parse-types only — kb-parse-md must NOT appear in this crate's cargo tree), kb-parse-pdf, kb-parse-image, kb-parse-audio, kb-chunk, kb-store-*, kb-embed*, kb-search, kb-llm*, kb-rag, kb-tui, kb-desktop
ID generation strictly follows design §4.2 (canonical JSON of tagged tuple, blake3 hex truncated to 32 chars).
block_id ordinal: per (heading_path, kind) group, 0-based, in document order.
All input strings normalized to NFC before hashing.
POSIX path normalization applied to workspace_path.
Unicode line endings normalized internally; SourceSpan::Line indices preserved as-is from p1-3.
Provenance built with one event per pipeline stage encountered: Discovered, Parsed, Normalized. Warnings appended as ProvenanceKind::Warning with note.
Determinism property test: same inputs → byte-identical CanonicalDocument JSON, including ID stability across runs.
Storage / wire effects
None.
Test plan
kind
description
fixture / data
unit
id_for_doc deterministic across 1000 runs
inline
unit
NFC vs NFD Korean inputs produce identical IDs
inline
unit
POSIX path with ./ and // collapse to same doc_id
inline
unit
block ordinal numbering inside same heading_path is correct
inline
unit
provenance contains Discovered/Parsed/Normalized in order
inline
snapshot
fixtures/markdown/code-and-table.md → CanonicalDocument JSON stable (incl. all IDs)
fixture
All tests under cargo test -p kb-normalize.
Definition of Done
cargo check -p kb-normalize passes
cargo test -p kb-normalize passes
Determinism test runs ≥ 1000 iterations under 1 second
No kb-parse-md (or any other parser crate) appears in cargo tree -p kb-normalize — input types come from kb-parse-types only
PR links design §3.7b, §4.2, §4.3, §8
Out of scope
Chunking (p1-5).
DB writes (p1-6).
Block validation beyond what is needed to assign IDs (e.g., we do NOT verify image src exists on disk here).
Risks / notes
If ID recipe changes, all dependent records become stale. Treat any change to id_for_doc/id_for_block as a parser_version bump (design §9).