Commit Graph

3 Commits

Author SHA1 Message Date
a86b463fc4 p1-2: scaffold kb-parse-md crate
Add the workspace member with the dep allow-list pinned by design §0 Q9
and the task spec. P1-2 will land the frontmatter submodule in the next
commit; P1-3 will add the block parser as a sibling.

Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024
so we use serde_yaml_ng, the maintained fork. lingua's per-language
features are explicitly enabled (default-features=false) to keep build
time + binary size sane — only the languages we need at parse time.
2026-04-30 12:55:20 +00:00
7c75e10b2c p1-1: scaffold kb-source-fs crate (FsSourceConnector)
Walk config.workspace.root, apply gitignore-style filters
(config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for
.DS_Store / ._*), stream BLAKE3 over each file, and emit a
deterministic Vec<RawAsset> sorted by workspace_path.

Modules:
  - hash:      streaming blake3::Hasher + 64 KiB read buffer (no whole-file
               loads); pinned digests for empty input and "hello world".
  - media:     extension → MediaType (markdown/pdf/image/audio/other).
  - walker:    ignore::OverrideBuilder for filter union; walkdir with
               manual visited-set cycle protection on top of follow_links.
  - connector: public FsSourceConnector::new(&Config) +
               SourceConnector::scan(&SourceScope) impl. Uses
               kb_core::to_posix for WorkspacePath construction (carries
               P0-1 # rejection through unchanged) and kb_core::id_for_asset
               for AssetId derivation. Storage variant signals intent only;
               actual byte copy is P1-6's responsibility.

Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:27:34 +00:00
f86df99fe9 p0-1: workspace + kb-core domain types, traits, and ID recipe
Stand up the Cargo workspace (Rust 2024 / resolver=3) with the kb-core
crate per the frozen design (§3, §4, §7, §10). kb-core has zero
deps on other kb-* crates and exposes:

- Newtype IDs (AssetId / DocumentId / BlockId / ChunkId / EmbeddingId /
  IndexId) with Display + FromStr that reject anything but 32 lower-hex.
- id_from + id_for_{asset,doc,block,chunk,embedding,index} per §4.2;
  pinned hex test values computed via an independent JCS+blake3 tool.
- CanonicalDocument, Block (8 variants), SourceSpan, Inline (§3.4).
- Citation (5 variants) with W3C Media Fragments to_uri / parse;
  round-trip property holds for every variant.
- Metadata + Provenance (§3.6); SearchQuery / SearchHit / RetrievalDetail
  (§3.7); DocFilter / DocSummary mirrors of wire §2.5.
- Answer / AnswerCitation / RefusalReason / ModelRef (§3.8).
- IngestReport, JobRepo support types, VectorRecord / VectorHit.
- Component traits (SourceConnector / Extractor / Chunker / Embedder /
  Retriever / LanguageModel / DocumentStore / VectorStore / JobRepo)
  plus their input helpers (SourceScope / ExtractContext / ChunkPolicy
  / EmbeddingInput / GenerateRequest / TokenChunk / FinishReason).
- CoreError (§10).
- nfc + to_posix helpers (§4.1, §6.6).

20 unit tests cover ID determinism (1000-run regression), key-order
invariance, two pinned hex values, newtype rejection of bad input,
Citation round-trip for all 5 variants, and to_posix collapsing +
Korean NFC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:16:37 +00:00