feat(p0-1): workspace skeleton + frozen contracts #5

Merged
altair823 merged 11 commits from feat/p0-1-skeleton into main 2026-04-30 12:16:50 +00:00

11 Commits

Author SHA1 Message Date
d2c8728095 p0-1: address review (drop unused thiserror dep, document kb-core reserve)
- Cargo.toml: remove `thiserror` from kb-config, kb-parse-types, kb-app
  (unused — none of those crates' src trees reference thiserror; CoreError
  in kb-core is the only consumer).
- kb-config keeps the `kb-core` dep with a one-line comment marking
  CoreError reserved for P1-* config-error wiring per the review thread.
- ids.rs: switch `validate_hex32` from a hand-rolled `matches!` byte range
  to `is_ascii_hexdigit()` so the hex check is the canonical idiom (and
  satisfies `clippy::manual_is_ascii_check` under `-D warnings`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:55:39 +00:00
d91b60325e p0-1: address review (apply_env full schema map, drop dead Option in logging::init)
- kb-config::apply_env now covers every leaf key in `Config` via an
  explicit grep-friendly match block (one arm per leaf), keyed
  `KB_<SECTION>_<KEY>`. Booleans flow through a shared `parse_bool` helper.
  Numeric leaves silently keep their prior value on parse failure so a
  malformed env entry can't crash startup.
- New tests: env_unknown_key_is_ignored,
  env_overrides_chunking_target_tokens,
  env_overrides_models_llm_endpoint_and_temperature,
  env_overrides_indexing_watch_filesystem_bool.
- kb-app::logging::init now returns `Result<WorkerGuard>` instead of
  `Result<Option<WorkerGuard>>` — the inner `Option` was always `Some` so
  the wrapper was dead. kb-cli/main.rs collapses the call from
  `.ok().flatten()` to `.ok()`, preserving fail-soft semantics on logging
  init.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:53:59 +00:00
ca0eb2f9cb p0-1: address review (reject # in WorkspacePath, parse error context)
- WorkspacePath: add `WorkspacePath::new(s)` validating constructor that
  rejects any string containing `#` (collides with the W3C-Media-Fragments
  separator that Citation URIs depend on). Doc-comment on the type now
  explains the invariant.
- normalize::to_posix changes signature to `Result<WorkspacePath, CoreError>`
  and now flows through `WorkspacePath::new`, so a path with `#` is rejected
  at construction rather than at every reader. Only one caller existed
  outside tests, so the signature change is contained.
- Citation::parse uses `WorkspacePath::new` on the path side. With multiple
  `#` separators in the input, `rsplit_once` would otherwise leave a `#` on
  the path; the new constructor closes the hole.
- Citation::parse + parse_hms_ms: every `bail!` / `anyhow!` site now quotes
  the offending substring (and the full input where useful) so error
  messages identify what went wrong without re-deriving from context.
- New tests: `to_posix_rejects_hash_in_path`,
  `parse_path_with_hash_rejected_at_to_posix_layer`. Existing
  `to_posix(...).0` call sites updated for the Result signature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:52:15 +00:00
286e62734d p0-1: address review (lowercase-hex normalize, more pinned ID tests)
- ids.rs: validate_hex32 now accepts upper+lower hex; FromStr lowercases
  the stored representation so equality and hashing stay canonical.
- Renamed test newtype_rejects_uppercase →
  newtype_accepts_uppercase_normalizes_to_lowercase and added
  newtype_rejects_invalid_chars_after_uppercase_pass.
- Added pinned-hex independence tests for id_for_block / id_for_chunk /
  id_for_embedding / id_for_index. Each test also asserts that the bytes
  serde-json-canonicalizer emits for the tuple match the literal JSON we
  hashed externally with b3sum, so a future field-rename can't silently
  drift the hash without flagging the JSON layer first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:49:48 +00:00
5af07c174d p0-1: address quality review (wire convention, IngestItemKind re-export, clippy)
Three follow-ups from the code-quality review pass on P0-1:

- Re-export `IngestItemKind` from `kb-core` so downstream tasks
  constructing `IngestItem` don't need `kb_core::ingest::IngestItemKind`.
- Document the `--json` wire-schema convention by introducing
  `kb-cli/src/wire.rs` with `wire_*` helpers paralleling the existing
  inline `wire_ingest`. Each Ok-path `--json` branch now routes through
  these helpers so future P1-5/P3/P4/P5 implementations slot the
  `schema_version` envelope in automatically. `DoctorReport` keeps its
  struct-field `schema_version` (the documented exception), and the
  helper round-trips it idempotently. Records the convention in
  `kb-app/src/lib.rs`'s top docstring.
- Fix clippy `single_char_add_str` in `kb_core::normalize` (replace
  `out.push_str(".")` with `out.push('.')`).

Verified: `cargo check`, `cargo test` (5 new wire-helper tests),
`cargo clippy -D warnings`, and `RUSTFLAGS=-D warnings cargo build` all
clean. Smoke-tested `kb doctor --json` still emits
`{"schema_version":"doctor.v1",...}`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:33:31 +00:00
a166b7051c p0-1: wire-schema stubs, doc/spec stubs, V001 migration, fixtures
- docs/wire-schema/v1/ ships 7 schema stubs (citation, search_hit,
  answer, ingest_report, doc_summary, chunk_inspection, doctor) that
  pin schema_version + required fields per design §2. Full property
  validation lands in later phases.
- docs/spec/ ships 7 markdown stubs each linking to the canonical
  frozen design (domain-model, ids, canonical-document, chunk-policy,
  citation-policy, module-boundaries, ai-generation-guidelines).
- migrations/V001__init.sql contains only schema_meta + migrations
  tables per design §5.1; data tables ship in P1-6/P2-1/P3-3.
- fixtures/ has the 11 subdirectories every downstream task references
  (markdown, source-fs, search/{lexical,hybrid}, embed, vector, rag,
  eval, image, pdf, audio). Empty subdirs use .gitkeep so they track.
  fixtures/markdown/ ships the 3 phase-0 fixtures: simple-note.md,
  nested-headings.md, code-and-table.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:17:32 +00:00
ec8a4ddb1b p0-1: kb-cli clap entry with §10 exit-code mapping
Adds the kb binary with clap v4 derive subcommands mapping 1:1 to
kb-app facade functions:

  init | ingest | list docs | inspect (doc|chunk) | search | ask
  | doctor | eval run

Global flags: --config, --verbose, --debug, --json. On --json, output
conforms to wire schema v1 (e.g. doctor.v1 emitted by kb-app::doctor).

Exit-code mapping per design §10:
  0 success
  1 RefusalSignal / NoHitSignal (kb ask refusal, kb search no-hit)
  2 any other anyhow::Error
  3 DoctorUnhealthy

Tracing initialized at startup with the file appender from kb-app.

Verified via:
  XDG_*=… cargo run -p kb-cli -- init    → idempotent
  XDG_*=… cargo run -p kb-cli -- doctor --json
    → {"schema_version":"doctor.v1","ok":true,…} exit 0
  XDG_*=… cargo run -p kb-cli -- doctor   (human form, ✓ marks)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:17:18 +00:00
237ada6e21 p0-1: kb-app facade stubs + tracing init helper
Adds the kb-app crate (§7) as the single facade between UI crates
(kb-cli / kb-tui / kb-desktop) and the rest of the workspace. Public
surface mirrors the task spec exactly:

- init_workspace(force) — XDG dir creation + config.toml seed; idempotent
  unless force=true. Honors XDG envs and tilde-expands the workspace
  root to $HOME/KnowledgeBase.
- doctor() — emits a doctor.v1 report with config_loaded +
  data_dir_writable checks; downstream checks land in later phases.
- ingest / list_docs / inspect_doc / inspect_chunk / search / ask —
  bail!("not yet wired (P<n>-<i>)") so kb-cli surfaces exit code 2
  cleanly per §10.
- AskOpts + DoctorReport + DoctorCheck.
- doctor_signal::{DoctorUnhealthy, RefusalSignal, NoHitSignal} —
  signal types the CLI downcasts on for §10 exit-code mapping.
- logging::init() — daily-rolling file appender at
  $XDG_STATE_HOME/kb/logs/kb.log, plus stderr-fallback EnvFilter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:17:11 +00:00
76a860296e p0-1: kb-config schema + XDG path resolution
Adds the kb-config crate per design §6. Provides the frozen Config
schema (§6.4) with serde + toml round-trip, defaults() that exactly
match the reference values (e.g. score_gate=0.30, target_tokens=500,
embedding.dimensions=384, rrf_k=60), and XDG path resolvers that honor
XDG_CONFIG_HOME / XDG_DATA_HOME / XDG_CACHE_HOME / XDG_STATE_HOME.

Layer order in load(): defaults → file → env (KB_<SECTION>_<KEY>);
CLI overrides apply later in kb-cli. Env mapping covers the keys
needed by P0 smoke tests; the rest land as their config sections wire
up.

5 unit tests cover serde round-trip, defaults pinned to design,
KB_RAG_SCORE_GATE / KB_SEARCH_DEFAULT_K env override, and
XDG_CONFIG_HOME handling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:16:57 +00:00
030986b37c p0-1: kb-parse-types thin parser-intermediate crate
Adds the kb-parse-types crate per design §3.7b. Depends only on kb-core
+ serde/thiserror — never on parser libraries. Defines:

- ParsedBlock + ParsedBlockKind + ParsedPayload (8 variants matching
  Block variants in kb-core).
- Warning + WarningKind for parser diagnostics.
- Forward-declared ParsedImageRegion / ParsedPdfPage / ParsedAudioSegment
  shells for P6/P7/P8.

`cargo tree -p kb-parse-types` shows only kb-core, serde, and thiserror.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:16:50 +00:00
f86df99fe9 p0-1: workspace + kb-core domain types, traits, and ID recipe
Stand up the Cargo workspace (Rust 2024 / resolver=3) with the kb-core
crate per the frozen design (§3, §4, §7, §10). kb-core has zero
deps on other kb-* crates and exposes:

- Newtype IDs (AssetId / DocumentId / BlockId / ChunkId / EmbeddingId /
  IndexId) with Display + FromStr that reject anything but 32 lower-hex.
- id_from + id_for_{asset,doc,block,chunk,embedding,index} per §4.2;
  pinned hex test values computed via an independent JCS+blake3 tool.
- CanonicalDocument, Block (8 variants), SourceSpan, Inline (§3.4).
- Citation (5 variants) with W3C Media Fragments to_uri / parse;
  round-trip property holds for every variant.
- Metadata + Provenance (§3.6); SearchQuery / SearchHit / RetrievalDetail
  (§3.7); DocFilter / DocSummary mirrors of wire §2.5.
- Answer / AnswerCitation / RefusalReason / ModelRef (§3.8).
- IngestReport, JobRepo support types, VectorRecord / VectorHit.
- Component traits (SourceConnector / Extractor / Chunker / Embedder /
  Retriever / LanguageModel / DocumentStore / VectorStore / JobRepo)
  plus their input helpers (SourceScope / ExtractContext / ChunkPolicy
  / EmbeddingInput / GenerateRequest / TokenChunk / FinishReason).
- CoreError (§10).
- nfc + to_posix helpers (§4.1, §6.6).

20 unit tests cover ID determinism (1000-run regression), key-order
invariance, two pinned hex values, newtype rejection of bad input,
Citation round-trip for all 5 variants, and to_posix collapsing +
Korean NFC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:16:37 +00:00