Files

kb 9fa38543a8 refactor(spec): introduce kb-parse-types thin crate

PR #1 review left a design-debt note: ParsedBlock landing in kb-core would
(a) force every crate to recompile on parser-internal changes, and
(b) cause namespace pollution when P6/P7/P8 parsers add their own variants.

Resolution: a new thin crate kb-parse-types sits between kb-core and parsers.
Owns ParsedBlock + ParsedPayload + Warning + forward-refs for image/pdf/audio
parser intermediates. Depends on kb-core only (for SourceSpan / Inline).

Updates:
- design §3.7b: add new section defining kb-parse-types
- design §8: add kb-parse-types to module-boundary diagram + forbidden list
- design §3.4 Inline stays in kb-core; kb-parse-types references it (no duplication)
- p0-1 skeleton: workspace + Cargo deps + public surface block
- p1-3 parse-md-blocks: outputs Vec<kb_parse_types::ParsedBlock> directly
- p1-4 normalize: Allowed gains kb-parse-types, drops cross-coupling note
- INDEX + phase-0 epic: list kb-parse-types in P0 deliverables

2026-04-27 20:41:35 +00:00

4.6 KiB

Raw Blame History

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections

phase

component

task_id

title

status

depends_on

unblocks

contract_source

contract_sections

kb-normalize

p1-4

Lift parser output → CanonicalDocument with deterministic IDs

planned

p1-2

p1-3

p1-5

p1-6

../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md

§3.4

§3.7b kb-parse-types

§4 ID recipe

§3.6 Provenance

§8 module boundaries

p1-4 — Lift to CanonicalDocument

Goal

Combine Metadata (p1-2) + Vec<kb_parse_types::ParsedBlock> (p1-3) + RawAsset (p1-1) into a kb_core::CanonicalDocument with deterministic doc_id and block_ids per design §4 recipe.

Why now / why this size

Single responsibility: ID generation + struct assembly. Keeps kb-parse-md purely a parser and isolates the (security-critical) deterministic ID logic in one crate.

Allowed dependencies

kb-core
kb-parse-types (input shapes — ParsedBlock, ParsedPayload, Warning)
kb-config
serde
serde-json-canonicalizer (canonical JSON for ID hashing)
blake3
unicode-normalization (NFC)
time
thiserror

Forbidden dependencies

kb-source-fs, kb-parse-md (consumed via shared kb-parse-types only — kb-parse-md must NOT appear in this crate's cargo tree), kb-parse-pdf, kb-parse-image, kb-parse-audio, kb-chunk, kb-store-*, kb-embed*, kb-search, kb-llm*, kb-rag, kb-tui, kb-desktop

Inputs

input	type	source
`RawAsset`	`kb_core::RawAsset`	p1-1
`Metadata` + frontmatter span + warnings	`(kb_core::Metadata, Option<FrontmatterSpan>, Vec<kb_parse_types::Warning>)`	p1-2
`Vec<ParsedBlock>` + warnings	`(Vec<kb_parse_types::ParsedBlock>, Vec<kb_parse_types::Warning>)`	p1-3
`parser_version`	`kb_core::ParserVersion`	constant in `kb-parse-md`

Outputs

output	type	downstream
`CanonicalDocument`	`kb_core::CanonicalDocument`	`kb-chunk`, `kb-store-sqlite`

Public surface (signatures only — no new types)

pub fn build_canonical_document(
    asset: &kb_core::RawAsset,
    metadata: kb_core::Metadata,
    blocks: Vec<kb_parse_types::ParsedBlock>,
    parser_version: &kb_core::ParserVersion,
    warnings: Vec<kb_parse_types::Warning>,
) -> anyhow::Result<kb_core::CanonicalDocument>;

pub fn id_for_doc(workspace_path: &kb_core::WorkspacePath, asset: &kb_core::AssetId, parser_version: &kb_core::ParserVersion) -> kb_core::DocumentId;
pub fn id_for_block(doc: &kb_core::DocumentId, kind: &str, heading_path: &[String], ordinal: u32, span: &kb_core::SourceSpan) -> kb_core::BlockId;

Behavior contract

ID generation strictly follows design §4.2 (canonical JSON of tagged tuple, blake3 hex truncated to 32 chars).
block_id ordinal: per (heading_path, kind) group, 0-based, in document order.
All input strings normalized to NFC before hashing.
POSIX path normalization applied to workspace_path.
Unicode line endings normalized internally; SourceSpan::Line indices preserved as-is from p1-3.
Provenance built with one event per pipeline stage encountered: Discovered, Parsed, Normalized. Warnings appended as ProvenanceKind::Warning with note.
Determinism property test: same inputs → byte-identical CanonicalDocument JSON, including ID stability across runs.

Storage / wire effects

None.

Test plan

kind	description	fixture / data
unit	id_for_doc deterministic across 1000 runs	inline
unit	NFC vs NFD Korean inputs produce identical IDs	inline
unit	POSIX path with `./` and `//` collapse to same `doc_id`	inline
unit	block ordinal numbering inside same heading_path is correct	inline
unit	provenance contains Discovered/Parsed/Normalized in order	inline
snapshot	`fixtures/markdown/code-and-table.md` → CanonicalDocument JSON stable (incl. all IDs)	fixture

All tests under cargo test -p kb-normalize.

Definition of Done

cargo check -p kb-normalize passes
cargo test -p kb-normalize passes
Determinism test runs ≥ 1000 iterations under 1 second
No kb-parse-md (or any other parser crate) appears in cargo tree -p kb-normalize — input types come from kb-parse-types only
PR links design §3.7b, §4.2, §4.3, §8

Out of scope

Chunking (p1-5).
DB writes (p1-6).
Block validation beyond what is needed to assign IDs (e.g., we do NOT verify image src exists on disk here).

Risks / notes

If ID recipe changes, all dependent records become stale. Treat any change to id_for_doc/id_for_block as a parser_version bump (design §9).

4.6 KiB Raw Blame History