Merge pull request 'refactor(spec): introduce kb-parse-types thin crate' (#2) from refactor/parse-types-crate into main

This commit was merged in pull request #2.
This commit is contained in:
2026-04-27 20:43:21 +00:00
6 changed files with 144 additions and 33 deletions

View File

@@ -581,7 +581,7 @@ pub struct RetrievalDetail {
### 3.7a Forward-declared types
`Block::ImageRef` / `AudioRef` variant 은 v1 부터 존재하나, 그 안의 `ocr` / `caption` / `transcript` 필드는 P1 에선 항상 `None`. 다음 타입은 `kb-core` 에 stub 으로 둠:
`Block::ImageRef` / `AudioRef` variant 은 v1 부터 존재하나, 그 안의 `ocr` / `caption` / `transcript` 필드는 P1 에선 항상 `None`. 다음 타입은 `kb-core` 에 stub 으로 둠 (최종 도메인 모델 슬롯):
```rust
pub struct OcrText { pub joined: String, pub regions: Vec<OcrRegion>, pub engine: String, pub engine_version: String }
@@ -600,6 +600,69 @@ pub enum AudioType { M4a, Mp3, Wav, Flac, Ogg, Other(String) }
`OffsetDateTime``time::OffsetDateTime`, `Result` 는 crate-local alias.
### 3.7b Parser intermediate types — `kb-parse-types`
Parser 의 *중간* 표현 (`ParsedBlock` 류) 은 `kb-core` 가 아니라 별도의 thin crate **`kb-parse-types`** 에 둔다. 이유: `kb-normalize` 는 medium-agnostic 한 ID/Provenance lift 를 책임지고 어떤 parser 도 직접 import 하면 안 된다. 그러나 normalize 에 들어오는 입력 타입이 어딘가에 정의되어야 하는데, 그것을 `kb-core` 에 박으면 (a) parser-별 ParsedBlock 변종 (`ParsedImageRegion`, `ParsedPdfPage`, `ParsedAudioSegment`) 이 향후 합류할 때 core 의 namespace 가 폭발하고, (b) parser 의 의미 변경이 core 변경이 되어 모든 의존자가 영향을 받는다.
`kb-parse-types` 는 이 둘 사이의 **유일한 layer** 다. 의존 그래프:
```text
kb-core (도메인 모델 — Block, Chunk, SourceSpan, IDs, …)
kb-parse-types (parser 중간 표현 — ParsedBlock, ParsedImageRegion[P+], ParsedPdfPage[P+], ParsedAudioSegment[P+], Inline)
▲ ▲
│ │
kb-parse-md, kb-parse-pdf, kb-normalize
kb-parse-image, kb-parse-audio
```
`kb-parse-types` 는:
- `kb-core` 에만 의존 (`Block`, `SourceSpan`, `Lang` 등 도메인 타입 사용).
- 다른 어떤 `kb-*` 에도 의존하지 않는다.
- 어떤 parser 의 구체 라이브러리 (`pulldown-cmark`, `pdf-extract`, `image`, `whisper-rs`) 에도 의존하지 않는다.
- serde + thiserror 정도의 외부 의존만 가진다.
P1 에서 정의되는 타입:
```rust
// kb-parse-types — depends on kb-core only.
pub struct ParsedBlock {
pub kind: ParsedBlockKind,
pub heading_path: Vec<String>,
pub source_span: kb_core::SourceSpan,
pub payload: ParsedPayload,
}
pub enum ParsedBlockKind { Heading, Paragraph, List, Code, Table, Quote, ImageRef, AudioRef }
pub enum ParsedPayload {
Heading { level: u8, text: String },
Paragraph { text: String, inlines: Vec<kb_core::Inline> },
List { ordered: bool, items: Vec<Vec<kb_core::Inline>> },
Code { lang: Option<String>, code: String },
Table { headers: Vec<String>, rows: Vec<Vec<String>> },
Quote { text: String, inlines: Vec<kb_core::Inline> },
ImageRef { src: String, alt: String },
AudioRef { src: String }, // duration_ms filled by extractor before chunking
}
pub struct Warning { pub kind: WarningKind, pub note: String }
pub enum WarningKind { MalformedFrontmatter, MalformedTable, EncodingFallback, ExtractFailed }
```
`Inline``kb-core` (§3.4) 에 있는 도메인 타입. `kb-parse-types` 는 그것을 *참조* 만 한다 — 같은 의미를 두 crate 에 중복 정의하지 않는다 (그러면 normalize 가 identity-conversion 을 해야 해서 무의미).
P6/P7/P8 에서 추가될 타입 (forward-ref):
```rust
pub struct ParsedImageRegion { /* OCR/EXIF 추출 전 단계 */ }
pub struct ParsedPdfPage { pub page: u32, pub text: String }
pub struct ParsedAudioSegment { pub start_ms: u64, pub end_ms: u64, pub text: String }
```
→ 새 medium 추가 시 `kb-core::Block` 변종은 변하지 않고, `kb-parse-types` 만 확장된다.
### 3.8 Answer / RAG types
```rust
@@ -1151,7 +1214,9 @@ kb-cli, kb-tui, kb-desktop
└─> kb-app
├─> kb-source-fs
├─> kb-parse-md / kb-parse-pdf / kb-parse-image / kb-parse-audio
│ └─> kb-parse-types (parser intermediate)
├─> kb-normalize
│ └─> kb-parse-types
├─> kb-chunk
├─> kb-store-sqlite (DocumentStore, JobRepo, Retriever[lexical])
├─> kb-store-vector (VectorStore)
@@ -1164,11 +1229,15 @@ kb-cli, kb-tui, kb-desktop
└─> kb-core (모두 의존)
```
`kb-parse-types``kb-core` 와 parsers/normalize 사이의 thin layer (§3.7b 참조). parser-별 중간 표현 (`ParsedBlock`, `ParsedImageRegion`, `ParsedPdfPage`, `ParsedAudioSegment`, `Inline`) 을 한 곳에 모아 (a) `kb-core` 의 namespace 폭발을 막고 (b) `kb-normalize` 가 parser 를 직접 import 하지 않게 한다.
핵심 금지:
- UI → store/llm/parse 직접 의존 ✗
- parse-* → store/llm/embed ✗
- parse-* → kb-normalize ✗ (단방향: parsers → kb-parse-types ← normalize)
- chunk → llm/embed ✗
- normalize → store ✗
- normalize → store / parse-*
- kb-parse-types → 어떤 parser/normalize/store/llm/embed/search/rag/ui ✗ (`kb-core` 만 의존)
- 다른 store 와 cross-write ✗
`cargo deny` + workspace deny.toml + CI 체크로 강제.

View File

@@ -25,7 +25,7 @@ P0~P5 는 직렬. P6~P9 는 P5 이후 병렬 가능.
| # | 코드 | 제목 | 핵심 산출 crate | 선행 |
|---|------|------|----------------|------|
| P0 | [phase-0-skeleton.md](phase-0-skeleton.md) | Workspace 뼈대 + 도메인 계약 | kb-core, kb-config, kb-app, kb-cli | |
| P0 | [phase-0-skeleton.md](phase-0-skeleton.md) | Workspace 뼈대 + 도메인 계약 | kb-core, kb-parse-types, kb-config, kb-app, kb-cli | |
| P1 | [phase-1-markdown-ingestion.md](phase-1-markdown-ingestion.md) | Markdown ingestion 파이프라인 | kb-source-fs, kb-parse-md, kb-normalize, kb-chunk, kb-store-sqlite | P0 |
| P2 | [phase-2-lexical-search.md](phase-2-lexical-search.md) | SQLite FTS5 lexical 검색 + citation | kb-search (lexical) | P1 |
| P3 | [phase-3-vector-hybrid.md](phase-3-vector-hybrid.md) | Local embedding + LanceDB + hybrid | kb-embed, kb-embed-local, kb-store-vector, kb-search | P2 |

View File

@@ -14,7 +14,7 @@ contract_sections: [§3 (all), §4, §5.1 schema_meta+migrations, §6 (config +
## Goal
Stand up the Cargo workspace (Rust 2024, resolver=3) with `kb-core`, `kb-config`, `kb-app`, `kb-cli` crates. Freeze every domain type, trait, ID recipe, error type, and CLI entry shape per the frozen design doc so that all subsequent component tasks compile against stable contracts.
Stand up the Cargo workspace (Rust 2024, resolver=3) with `kb-core`, `kb-parse-types`, `kb-config`, `kb-app`, `kb-cli` crates. Freeze every domain type, trait, ID recipe, error type, and CLI entry shape per the frozen design doc so that all subsequent component tasks compile against stable contracts.
## Why now / why this size
@@ -25,6 +25,7 @@ Every other task imports `kb-core`. If types or trait signatures wobble after th
- workspace `[workspace.dependencies]`: `anyhow = "1"`, `thiserror = "2"`, `serde = { version = "1", features = ["derive"] }`, `serde_json = "1"`, `time = { version = "0.3", features = ["serde", "macros"] }`, `uuid = { version = "1", features = ["v7", "serde"] }`, `blake3 = "1"`, `tracing = "0.1"`
- per crate:
- `kb-core`: workspace deps + `serde_json::Map`, `serde-json-canonicalizer`, `unicode-normalization`
- `kb-parse-types`: workspace deps + `kb-core` ONLY (no parsers, no stores, no normalize). Defines parser intermediate representations per design §3.7b.
- `kb-config`: workspace deps + `toml = "0.8"`, `dirs = "5"` (XDG paths)
- `kb-app`: workspace deps + `kb-core`, `kb-config`, `tracing-subscriber`, `tracing-appender`
- `kb-cli`: workspace deps + `kb-core`, `kb-config`, `kb-app`, `clap = { version = "4", features = ["derive"] }`
@@ -32,6 +33,7 @@ Every other task imports `kb-core`. If types or trait signatures wobble after th
## Forbidden dependencies
- `kb-core` MUST NOT depend on any other `kb-*` crate.
- `kb-parse-types` MUST depend ONLY on `kb-core`. No parser libraries (`pulldown-cmark`, `pdf-extract`, `image`, `whisper-rs`, …), no other `kb-*` crate.
- `kb-config` MUST NOT depend on `kb-app`, `kb-cli`, parsers, stores, embedders, search, llm, rag, tui, desktop.
- `kb-app` MUST NOT yet depend on parsers/stores/embedders/search/llm/rag (those crates do not exist yet — facade methods stub out and return `unimplemented!()` or `anyhow::bail!("not yet wired (Pn-i)")`).
- `kb-cli` MUST NOT call any non-`kb-app` crate directly.
@@ -112,8 +114,7 @@ pub struct AudioRefBlock { /* per §3.4 */ }
pub enum Inline { /* per §3.4 */ }
pub enum SourceSpan { /* per §3.4 */ }
// ParsedBlock (intermediate, exposed via core for normalize — see p1-4 spec)
pub struct ParsedBlock { /* mirror of Block without BlockId */ }
// (ParsedBlock + parser intermediates live in kb-parse-types per design §3.7b — NOT in kb-core.)
// Chunk + Citation (§3.5)
pub struct Chunk { /* per §3.5 */ }
@@ -230,6 +231,42 @@ pub fn to_posix(path: &std::path::Path) -> WorkspacePath; // §6.6
pub fn nfc(input: &str) -> String; // §4.1
```
```rust
// ── kb-parse-types ──────────────────────────────────────────────────────────
// Per design §3.7b. Defines parser intermediate representations consumed by
// kb-normalize. Depends on kb-core only — never on parser libraries.
pub struct ParsedBlock {
pub kind: ParsedBlockKind,
pub heading_path: Vec<String>,
pub source_span: kb_core::SourceSpan,
pub payload: ParsedPayload,
}
pub enum ParsedBlockKind { Heading, Paragraph, List, Code, Table, Quote, ImageRef, AudioRef }
pub enum ParsedPayload {
Heading { level: u8, text: String },
Paragraph { text: String, inlines: Vec<kb_core::Inline> },
List { ordered: bool, items: Vec<Vec<kb_core::Inline>> },
Code { lang: Option<String>, code: String },
Table { headers: Vec<String>, rows: Vec<Vec<String>> },
Quote { text: String, inlines: Vec<kb_core::Inline> },
ImageRef { src: String, alt: String },
AudioRef { src: String },
}
// `Inline` itself lives in kb-core (§3.4) — parse-types references it, never duplicates it.
pub struct Warning { pub kind: WarningKind, pub note: String }
pub enum WarningKind { MalformedFrontmatter, MalformedTable, EncodingFallback, ExtractFailed }
// Forward-ref for P6/P7/P8 — defined when those phases land.
pub struct ParsedImageRegion;
pub struct ParsedPdfPage;
pub struct ParsedAudioSegment;
```
```rust
// ── kb-config ───────────────────────────────────────────────────────────────
pub struct Config { /* full schema per §6.4 */ }
@@ -306,16 +343,17 @@ All tests must run with no network, no Ollama, no models.
## Definition of Done
- [ ] `Cargo.toml` workspace lists `kb-core`, `kb-config`, `kb-app`, `kb-cli` and resolver=3, edition 2024
- [ ] `Cargo.toml` workspace lists `kb-core`, `kb-parse-types`, `kb-config`, `kb-app`, `kb-cli` and resolver=3, edition 2024
- [ ] `cargo check --workspace` passes
- [ ] `cargo test --workspace` passes
- [ ] `kb-parse-types` `cargo tree` shows ONLY `kb-core` + `serde`/`thiserror` style deps (no parser libs, no other `kb-*`)
- [ ] `kb --help` prints subcommands
- [ ] `kb init` creates XDG dirs idempotently and writes `config.toml`
- [ ] `kb doctor` returns wire JSON conforming to `doctor.v1` (in `--json` mode)
- [ ] `docs/wire-schema/v1/*.schema.json` stubs exist (7 files: citation, search_hit, answer, ingest_report, doc_summary, chunk_inspection, doctor)
- [ ] `docs/spec/` stubs exist linking to the frozen design (one file per: domain-model, ids, canonical-document, chunk-policy, citation-policy, module-boundaries, ai-generation-guidelines)
- [ ] No imports outside Allowed dependencies (CI deny check)
- [ ] PR body links design §3, §4, §6, §7, §8, §9, §10
- [ ] PR body links design §3, §3.7b, §4, §6, §7, §8, §9, §10
## Out of scope

View File

@@ -7,14 +7,14 @@ status: planned
depends_on: [p0-1]
unblocks: [p1-4]
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
contract_sections: [§3.4 Block, §3.4 SourceSpan, §0 Q3 citation]
contract_sections: [§3.4 Block, §3.4 SourceSpan, §3.7b kb-parse-types, §0 Q3 citation]
---
# p1-3 — Markdown body → Block tree
## Goal
Parse Markdown body bytes into a flat `Vec<ParsedBlock>` (intermediate, crate-private) with heading paths and line ranges preserved, ready for `kb-normalize` to lift into `CanonicalDocument`.
Parse Markdown body bytes into a flat `Vec<kb_parse_types::ParsedBlock>` with heading paths and line ranges preserved, ready for `kb-normalize` to lift into `CanonicalDocument`.
## Why now / why this size
@@ -23,6 +23,7 @@ This is the heaviest part of P1 parser. Separating it from frontmatter and from
## Allowed dependencies
- `kb-core`
- `kb-parse-types` (defines `ParsedBlock`, `ParsedPayload`, `Warning`)
- `pulldown-cmark` (CommonMark with source-map; GFM tables enabled via feature)
- `serde`
- `thiserror`
@@ -42,27 +43,30 @@ This is the heaviest part of P1 parser. Separating it from frontmatter and from
| output | type | downstream |
|--------|------|------------|
| `Vec<ParsedBlock>` (intermediate type, crate-private) | | `kb-normalize` |
| `Vec<Warning>` | | propagated into Provenance |
| `Vec<kb_parse_types::ParsedBlock>` | shared type from `kb-parse-types` | `kb-normalize` |
| `Vec<kb_parse_types::Warning>` | shared type | propagated into Provenance |
## Public surface (signatures only — no new types)
```rust
pub fn parse_blocks(body: &[u8], body_offset_lines: u32) -> anyhow::Result<(Vec<ParsedBlock>, Vec<Warning>)>;
pub fn parse_blocks(
body: &[u8],
body_offset_lines: u32,
) -> anyhow::Result<(Vec<kb_parse_types::ParsedBlock>, Vec<kb_parse_types::Warning>)>;
```
`ParsedBlock` is a crate-internal mirror that maps 1:1 to `kb_core::Block` variants once `kb-normalize` assigns `BlockId`s.
`ParsedBlock` is defined in `kb-parse-types` (design §3.7b). `kb-parse-md` does NOT define its own; it consumes the shared type. Lift to `kb_core::Block` (with `BlockId` assignment) is `kb-normalize`'s job (p1-4).
## Behavior contract
- Source-map: each `ParsedBlock` carries `SourceSpan::Line { start, end }` relative to the original file (i.e., add `body_offset_lines`).
- Heading tree: every block records its ancestor heading texts in order (e.g., `["아키텍처", "Chunking 정책"]`).
- Code blocks: language tag preserved (` ```rust ``Some("rust")`), fenced content not split.
- Tables: GFM tables produce `TableBlock` with header row + body rows; if a table cell is malformed, fall back to a `Paragraph` block + warning.
- Image references: `![alt](src)` produces `ImageRefBlock` with `asset_id = None`, `src = "..."`, `alt = "..."`. Resolution to `AssetId` happens later in `kb-normalize`.
- Lists: ordered/unordered preserved; nested list items flattened into one `ListBlock` with each top-level item's text.
- Inline elements: only `Text`, `Code`, `Link`, `Strong`, `Emph` (per design §3.4). Drop other inlines silently.
- Malformed input never panics. Worst case: empty `Vec<ParsedBlock>` + warning.
- Code blocks: `ParsedPayload::Code { lang: Some("rust"), code }` fenced content not split.
- Tables: GFM tables produce `ParsedPayload::Table { headers, rows }`; if a table cell is malformed, fall back to `ParsedPayload::Paragraph` + `Warning::MalformedTable`.
- Image references: `![alt](src)` produces `ParsedPayload::ImageRef { src, alt }`. `AssetId` resolution happens later in `kb-normalize` (when image src can be matched to a workspace asset).
- Lists: ordered/unordered preserved via `ParsedPayload::List { ordered, items }`; nested list items flattened so each `items[i]` is a `Vec<kb_core::Inline>` for one top-level item.
- Inline elements: only `Text`, `Code`, `Link`, `Strong`, `Emph` (per `kb_core::Inline` per design §3.4). Drop other inlines silently.
- Malformed input never panics. Worst case: empty `Vec<ParsedBlock>` + `Warning::ExtractFailed`.
## Storage / wire effects
@@ -95,7 +99,7 @@ All tests under `cargo test -p kb-parse-md --lib blocks`.
## Out of scope
- Frontmatter (p1-2).
- Lifting `ParsedBlock``kb_core::Block` with `BlockId` (p1-4).
- Lifting `kb_parse_types::ParsedBlock``kb_core::Block` with `BlockId` (p1-4 normalize).
- Chunking (p1-5).
## Risks / notes

View File

@@ -7,14 +7,14 @@ status: planned
depends_on: [p1-2, p1-3]
unblocks: [p1-5, p1-6]
contract_source: ../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
contract_sections: [§3.4, §4 ID recipe, §3.6 Provenance]
contract_sections: [§3.4, §3.7b kb-parse-types, §4 ID recipe, §3.6 Provenance, §8 module boundaries]
---
# p1-4 — Lift to CanonicalDocument
## Goal
Combine `Metadata` (p1-2) + `Vec<ParsedBlock>` (p1-3) + `RawAsset` (p1-1) into a `CanonicalDocument` with deterministic `doc_id` and `block_id`s per design §4 recipe.
Combine `Metadata` (p1-2) + `Vec<kb_parse_types::ParsedBlock>` (p1-3) + `RawAsset` (p1-1) into a `kb_core::CanonicalDocument` with deterministic `doc_id` and `block_id`s per design §4 recipe.
## Why now / why this size
@@ -23,6 +23,7 @@ Single responsibility: ID generation + struct assembly. Keeps `kb-parse-md` pure
## Allowed dependencies
- `kb-core`
- `kb-parse-types` (input shapes — `ParsedBlock`, `ParsedPayload`, `Warning`)
- `kb-config`
- `serde`
- `serde-json-canonicalizer` (canonical JSON for ID hashing)
@@ -33,17 +34,15 @@ Single responsibility: ID generation + struct assembly. Keeps `kb-parse-md` pure
## Forbidden dependencies
- `kb-source-fs`, `kb-parse-md` (consumed via plain types only — must not couple back), `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`
Note: this crate accepts `ParsedBlock` from `kb-parse-md` either by (a) exposing `ParsedBlock` as a `kb-core` type, or (b) `kb-parse-md` re-exporting via a public DTO. Pick (a): move `ParsedBlock` into `kb-core` so this task does not import `kb-parse-md`.
- `kb-source-fs`, `kb-parse-md` (consumed via shared `kb-parse-types` only — `kb-parse-md` must NOT appear in this crate's `cargo tree`), `kb-parse-pdf`, `kb-parse-image`, `kb-parse-audio`, `kb-chunk`, `kb-store-*`, `kb-embed*`, `kb-search`, `kb-llm*`, `kb-rag`, `kb-tui`, `kb-desktop`
## Inputs
| input | type | source |
|-------|------|--------|
| `RawAsset` | `kb_core::RawAsset` | p1-1 |
| `Metadata` + frontmatter span + warnings | from p1-2 | parser caller |
| `Vec<ParsedBlock>` + warnings | from p1-3 | parser caller |
| `Metadata` + frontmatter span + warnings | `(kb_core::Metadata, Option<FrontmatterSpan>, Vec<kb_parse_types::Warning>)` | p1-2 |
| `Vec<ParsedBlock>` + warnings | `(Vec<kb_parse_types::ParsedBlock>, Vec<kb_parse_types::Warning>)` | p1-3 |
| `parser_version` | `kb_core::ParserVersion` | constant in `kb-parse-md` |
## Outputs
@@ -58,9 +57,9 @@ Note: this crate accepts `ParsedBlock` from `kb-parse-md` either by (a) exposing
pub fn build_canonical_document(
asset: &kb_core::RawAsset,
metadata: kb_core::Metadata,
blocks: Vec<kb_core::ParsedBlock>,
blocks: Vec<kb_parse_types::ParsedBlock>,
parser_version: &kb_core::ParserVersion,
warnings: Vec<Warning>,
warnings: Vec<kb_parse_types::Warning>,
) -> anyhow::Result<kb_core::CanonicalDocument>;
pub fn id_for_doc(workspace_path: &kb_core::WorkspacePath, asset: &kb_core::AssetId, parser_version: &kb_core::ParserVersion) -> kb_core::DocumentId;
@@ -99,8 +98,8 @@ All tests under `cargo test -p kb-normalize`.
- [ ] `cargo check -p kb-normalize` passes
- [ ] `cargo test -p kb-normalize` passes
- [ ] Determinism test runs ≥ 1000 iterations under 1 second
- [ ] No `kb-parse-md` import (consumed via `kb-core::ParsedBlock`)
- [ ] PR links design §4.2, §4.3
- [ ] No `kb-parse-md` (or any other parser crate) appears in `cargo tree -p kb-normalize` — input types come from `kb-parse-types` only
- [ ] PR links design §3.7b, §4.2, §4.3, §8
## Out of scope

View File

@@ -17,6 +17,7 @@ compile 되는 Rust 2024 workspace 와 domain spec 확정. 이후 모든 phase
| crate | 역할 |
|-------|------|
| `kb-core` | domain type, trait, error, ID 규칙 |
| `kb-parse-types` | parser 중간 표현 (`ParsedBlock` 등) — `kb-core` 와 parsers/normalize 사이 thin layer (design §3.7b) |
| `kb-config` | config 로딩, 기본값, 경로 확장 |
| `kb-app` | facade. CLI/TUI/desktop 공통 진입점 |
| `kb-cli` | `kb` 바이너리 skeleton (`--help`만 동작) |
@@ -28,7 +29,7 @@ compile 되는 Rust 2024 workspace 와 domain spec 확정. 이후 모든 phase
```toml
[workspace]
resolver = "3"
members = ["crates/kb-core", "crates/kb-config", "crates/kb-app", "crates/kb-cli"]
members = ["crates/kb-core", "crates/kb-parse-types", "crates/kb-config", "crates/kb-app", "crates/kb-cli"]
[workspace.package]
edition = "2024"