Implement the frontmatter submodule:
- detect_delimiters scans for a leading YAML (---) or TOML (+++) block at
byte 0. Strict per §0 Q9: no leading whitespace / BOM, no chars on the
delimiter line. Closing must be its own line. Unterminated → no FM.
- parse_raw deserializes into RawFrontmatter, a serde-flatten struct that
catches unknown keys into a serde_json::Map for verbatim preservation
in metadata.user.
- derive_metadata implements the §0 Q9 fallback chain:
title → frontmatter | BodyHints.first_h1 | (filename: caller)
aliases/tags→ frontmatter | []
lang → frontmatter | lingua autodetect on first 4 KB | hints
| "und"
created_at → frontmatter (RFC 3339, normalized to UTC) | fs_ctime
updated_at → frontmatter | fs_mtime
source_type → frontmatter | "markdown"
trust_level → frontmatter | "primary"
id → user_id_alias only — never a doc_id factor (§4.2)
- Non-UTC offsets are normalized to UTC; the original string is preserved
in user.original_timestamps[field] per §0 Q9.
- Warnings are emitted for: malformed YAML/TOML, unknown enum values,
malformed timestamps. Unknown keys are silent.
- lingua detector is cached in a OnceLock — first build is heavy.
- 15 unit tests cover every row of the derive table + delimiter edge
cases + an explicit pin that `id:` does not feed id_for_doc.
Add the workspace member with the dep allow-list pinned by design §0 Q9
and the task spec. P1-2 will land the frontmatter submodule in the next
commit; P1-3 will add the block parser as a sibling.
Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024
so we use serde_yaml_ng, the maintained fork. lingua's per-language
features are explicitly enabled (default-features=false) to keep build
time + binary size sane — only the languages we need at parse time.
- walker.rs: document why we pick walkdir over ignore::WalkBuilder
(explicit canonical-path comparison for sibling-subtree symlinks).
- walker.rs: log canonicalize failures via tracing::debug! (was a silent
`Err(_) => continue`) so broken/permission-denied symlink targets are
observable at debug verbosity.
- connector.rs: TODO marker on the scope.include debug-log noting the
filter belongs at the extractor router (P1-2/P1-3).
- connector.rs: TODO marker on expand_tilde to hoist tilde + ${VAR}
expansion into a kb-config helper once available.
- connector.rs: comment on the .kbignore read documenting the
re-read-on-every-scan() contract.
- connector.rs test: tighten the `.kbignore`-itself ADR comment and
upgrade the assertion to actively pin "`.kbignore` IS emitted" instead
of "either is fine"; future drift will now fail the test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Document AssetStorage::Copied / Reference path semantics so P1-6 (asset
writer) knows that at scan time `Copied.path` is the SOURCE path and the
writer is responsible for both copying bytes AND overwriting `path`
with the destination.
- Rename the dangling-symlink test to make its scope explicit
(`dangling_symlink_pseudo_cycle_does_not_crash`); the prior name implied
a real two-step directory cycle but the targets were broken links.
- Add `two_step_directory_cycle_visited_set_breaks_loop`: builds
`a/loop -> ../b` and `b/loop -> ../a` over real directories with real
files, asserting scan terminates with a finite, deterministic asset
list — exercises the canonical-path visited-set in walker::walk_files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fixtures/source-fs/tree-1/:
README.md
notes/alpha.md
notes/beta.md
ignored/skip.tmp (excluded by .kbignore *.tmp)
.kbignore ("*.tmp")
.DS_Store (implicitly excluded by FsSourceConnector)
The committed baseline (tree-1.snapshot.json) has discovered_at,
source_uri.value, and stored.path replaced with "<stripped>" so the
JSON is portable across checkout locations and CI runs. The test
applies the same stripping to scan output before comparing.
The determinism test runs scan twice and asserts byte-identical
serialized JSON (post-strip) — same filesystem state must yield the
same Vec<RawAsset>.
Regenerate baseline with `KB_REGEN_SNAPSHOT=1 cargo test -p kb-source-fs
--test snapshot_tree1 -- tree_1_snapshot_matches_baseline`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cases:
- root/notes -> root (single-link cycle through workspace root).
- root/a -> b, root/b -> a (two-step cycle of dangling symlinks).
Both must complete in O(seconds) and surface alpha.md exactly once.
Proves walker::walk_files's visited-set guard catches realistic cycle
shapes via the public SourceConnector API.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk config.workspace.root, apply gitignore-style filters
(config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for
.DS_Store / ._*), stream BLAKE3 over each file, and emit a
deterministic Vec<RawAsset> sorted by workspace_path.
Modules:
- hash: streaming blake3::Hasher + 64 KiB read buffer (no whole-file
loads); pinned digests for empty input and "hello world".
- media: extension → MediaType (markdown/pdf/image/audio/other).
- walker: ignore::OverrideBuilder for filter union; walkdir with
manual visited-set cycle protection on top of follow_links.
- connector: public FsSourceConnector::new(&Config) +
SourceConnector::scan(&SourceScope) impl. Uses
kb_core::to_posix for WorkspacePath construction (carries
P0-1 # rejection through unchanged) and kb_core::id_for_asset
for AssetId derivation. Storage variant signals intent only;
actual byte copy is P1-6's responsibility.
Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Cargo.toml: remove `thiserror` from kb-config, kb-parse-types, kb-app
(unused — none of those crates' src trees reference thiserror; CoreError
in kb-core is the only consumer).
- kb-config keeps the `kb-core` dep with a one-line comment marking
CoreError reserved for P1-* config-error wiring per the review thread.
- ids.rs: switch `validate_hex32` from a hand-rolled `matches!` byte range
to `is_ascii_hexdigit()` so the hex check is the canonical idiom (and
satisfies `clippy::manual_is_ascii_check` under `-D warnings`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- kb-config::apply_env now covers every leaf key in `Config` via an
explicit grep-friendly match block (one arm per leaf), keyed
`KB_<SECTION>_<KEY>`. Booleans flow through a shared `parse_bool` helper.
Numeric leaves silently keep their prior value on parse failure so a
malformed env entry can't crash startup.
- New tests: env_unknown_key_is_ignored,
env_overrides_chunking_target_tokens,
env_overrides_models_llm_endpoint_and_temperature,
env_overrides_indexing_watch_filesystem_bool.
- kb-app::logging::init now returns `Result<WorkerGuard>` instead of
`Result<Option<WorkerGuard>>` — the inner `Option` was always `Some` so
the wrapper was dead. kb-cli/main.rs collapses the call from
`.ok().flatten()` to `.ok()`, preserving fail-soft semantics on logging
init.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- WorkspacePath: add `WorkspacePath::new(s)` validating constructor that
rejects any string containing `#` (collides with the W3C-Media-Fragments
separator that Citation URIs depend on). Doc-comment on the type now
explains the invariant.
- normalize::to_posix changes signature to `Result<WorkspacePath, CoreError>`
and now flows through `WorkspacePath::new`, so a path with `#` is rejected
at construction rather than at every reader. Only one caller existed
outside tests, so the signature change is contained.
- Citation::parse uses `WorkspacePath::new` on the path side. With multiple
`#` separators in the input, `rsplit_once` would otherwise leave a `#` on
the path; the new constructor closes the hole.
- Citation::parse + parse_hms_ms: every `bail!` / `anyhow!` site now quotes
the offending substring (and the full input where useful) so error
messages identify what went wrong without re-deriving from context.
- New tests: `to_posix_rejects_hash_in_path`,
`parse_path_with_hash_rejected_at_to_posix_layer`. Existing
`to_posix(...).0` call sites updated for the Result signature.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ids.rs: validate_hex32 now accepts upper+lower hex; FromStr lowercases
the stored representation so equality and hashing stay canonical.
- Renamed test newtype_rejects_uppercase →
newtype_accepts_uppercase_normalizes_to_lowercase and added
newtype_rejects_invalid_chars_after_uppercase_pass.
- Added pinned-hex independence tests for id_for_block / id_for_chunk /
id_for_embedding / id_for_index. Each test also asserts that the bytes
serde-json-canonicalizer emits for the tuple match the literal JSON we
hashed externally with b3sum, so a future field-rename can't silently
drift the hash without flagging the JSON layer first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups from the code-quality review pass on P0-1:
- Re-export `IngestItemKind` from `kb-core` so downstream tasks
constructing `IngestItem` don't need `kb_core::ingest::IngestItemKind`.
- Document the `--json` wire-schema convention by introducing
`kb-cli/src/wire.rs` with `wire_*` helpers paralleling the existing
inline `wire_ingest`. Each Ok-path `--json` branch now routes through
these helpers so future P1-5/P3/P4/P5 implementations slot the
`schema_version` envelope in automatically. `DoctorReport` keeps its
struct-field `schema_version` (the documented exception), and the
helper round-trips it idempotently. Records the convention in
`kb-app/src/lib.rs`'s top docstring.
- Fix clippy `single_char_add_str` in `kb_core::normalize` (replace
`out.push_str(".")` with `out.push('.')`).
Verified: `cargo check`, `cargo test` (5 new wire-helper tests),
`cargo clippy -D warnings`, and `RUSTFLAGS=-D warnings cargo build` all
clean. Smoke-tested `kb doctor --json` still emits
`{"schema_version":"doctor.v1",...}`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the kb-app crate (§7) as the single facade between UI crates
(kb-cli / kb-tui / kb-desktop) and the rest of the workspace. Public
surface mirrors the task spec exactly:
- init_workspace(force) — XDG dir creation + config.toml seed; idempotent
unless force=true. Honors XDG envs and tilde-expands the workspace
root to $HOME/KnowledgeBase.
- doctor() — emits a doctor.v1 report with config_loaded +
data_dir_writable checks; downstream checks land in later phases.
- ingest / list_docs / inspect_doc / inspect_chunk / search / ask —
bail!("not yet wired (P<n>-<i>)") so kb-cli surfaces exit code 2
cleanly per §10.
- AskOpts + DoctorReport + DoctorCheck.
- doctor_signal::{DoctorUnhealthy, RefusalSignal, NoHitSignal} —
signal types the CLI downcasts on for §10 exit-code mapping.
- logging::init() — daily-rolling file appender at
$XDG_STATE_HOME/kb/logs/kb.log, plus stderr-fallback EnvFilter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the kb-config crate per design §6. Provides the frozen Config
schema (§6.4) with serde + toml round-trip, defaults() that exactly
match the reference values (e.g. score_gate=0.30, target_tokens=500,
embedding.dimensions=384, rrf_k=60), and XDG path resolvers that honor
XDG_CONFIG_HOME / XDG_DATA_HOME / XDG_CACHE_HOME / XDG_STATE_HOME.
Layer order in load(): defaults → file → env (KB_<SECTION>_<KEY>);
CLI overrides apply later in kb-cli. Env mapping covers the keys
needed by P0 smoke tests; the rest land as their config sections wire
up.
5 unit tests cover serde round-trip, defaults pinned to design,
KB_RAG_SCORE_GATE / KB_SEARCH_DEFAULT_K env override, and
XDG_CONFIG_HOME handling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the kb-parse-types crate per design §3.7b. Depends only on kb-core
+ serde/thiserror — never on parser libraries. Defines:
- ParsedBlock + ParsedBlockKind + ParsedPayload (8 variants matching
Block variants in kb-core).
- Warning + WarningKind for parser diagnostics.
- Forward-declared ParsedImageRegion / ParsedPdfPage / ParsedAudioSegment
shells for P6/P7/P8.
`cargo tree -p kb-parse-types` shows only kb-core, serde, and thiserror.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>