Adds the new workspace member with the bare Chunker impl shape:
chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes
canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk()
is an empty stub the next commits fill in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M8: kb-parse-md frontmatter doc-comment claimed filename fallback was
P1-4's job; P1-4 spec did not include it. Reconcile: defer to a later
phase (P1-7 / kb-app integration) where the workspace_path filename is
known to the caller. Updated comment in build_metadata().
M9: kb-parse-md tests use the #[ignore] regenerator pattern, while
kb-normalize's integration test uses an UPDATE_SNAPSHOTS=1 env-var.
Migrating kb-parse-md is out of scope; one-line note added to
blocks_snapshots.rs mod doc-comment to flag the intentional split.
M11, M12: doc-only comments in lift_block (already added in the
previous commit) — list-item shared block_id rationale and the
intentional camel-case Debug-format for WarningKind in Provenance
notes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
I1: warning_agent maps ExtractFailed → "kb-parse-md" (the panic-recovery
emitter in kb-parse-md/src/blocks.rs). Lift-stage warnings from
build_canonical_document are tracked separately and attributed to
"kb-normalize", so the I1 mapping change does not lie about
kb-normalize-originated drops.
I2: ParsedPayload::AudioRef no longer synthesizes Block::AudioRef with
an invalid empty AssetId (would violate AssetId::from_str's 32-hex
invariant). Block is dropped, Warning surfaces in Provenance with src
mention, attributed to kb-normalize (lift-stage decision). TODO(P8)
comment marks this as a placeholder until the audio extractor lands.
I3: NFC-normalize each heading_path string in lift_block before feeding
into id_for_block AND into CommonBlock.heading_path. pulldown-cmark does
not NFC heading text and serde_json_canonicalizer v0.3 does not either,
so canonically-equivalent NFD/NFC inputs would produce different
block_ids without this normalization. Mirrors the existing doc_id NFC
handling via to_posix.
Minors:
- M4: trim Cargo.toml — drop kb-config, serde_json_canonicalizer,
blake3 (unused); keep tracing (now wired) + unicode-normalization
(now used by I3).
- M5: determinism_1000_iterations_under_1s now uses the same 5-block
fixture as block_ordinals_scoped_per_heading_and_kind (extracted into
fixture_blocks_five helper) so the determinism property is exercised
on a real lift_block path, not just an empty Vec. Still < 1s.
- M6: snapshot integration test now passes BodyHints { first_h1:
Some("Code And Table"), .. } and asserts doc.title == "Code And Table"
end-to-end. Baseline JSON updated.
- M7: title/lang edge-case unit tests pin policy: empty string lifts to
empty string; non-stringy values silently drop. Rustdoc updated.
- M10: provenance_contains_stage_events_in_order asserts events[1].at
== events[2].at to pin the shared-now_utc invariant.
New tests (unit, kb-normalize):
- provenance_with_extract_failed_warning_attributes_to_kb_parse_md (I1)
- audio_ref_block_skipped_with_warning (I2)
- nfc_nfd_korean_heading_path_same_block_id (I3)
- title_empty_string_in_user_map_falls_back_to_default (M7)
- title_non_string_in_user_map_silently_drops (M7)
- lang_invalid_shape_silently_drops (M7)
kb-normalize unit tests: 9 → 14. Integration snapshot: 1 (unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the integration snapshot test pinning the full `CanonicalDocument`
JSON for `fixtures/markdown/code-and-table.md` (run through the real
`kb-parse-md::parse_frontmatter` + `parse_blocks`, dev-dep only).
Non-deterministic `provenance.events[*].at` for the Parsed and
Normalized events is stripped before comparison; the Discovered
event's `at` is pinned by constructing the test `RawAsset` with a
fixed `discovered_at`. Run with `UPDATE_SNAPSHOTS=1` to regenerate.
Add the 1000-iteration determinism property: same inputs ⇒ byte-
identical JSON (modulo the same stripped timestamps), in under one
second of wall-clock time. A regression in canonical JSON, BLAKE3
hashing, ordinal counting, or any other deterministic field would
surface here immediately.
The integration test depends on `kb-parse-md` only as a dev-dep, so
`cargo tree -p kb-normalize --depth 1 --edges normal` confirms no
parser implementation appears in the production dep tree per design
§8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Build a `Provenance` with one event per pipeline stage (`Discovered`
sourced from `RawAsset.discovered_at`, then `Parsed` and `Normalized`
stamped with one shared `now_utc()` reading), plus one `Warning` event
per upstream warning. Sharing `now` between Parsed and Normalized
bounds intra-call timestamp jitter — event ordering is preserved by
`Vec` position regardless. Warning agents are routed back to the
upstream component (`kb-parse-md` for parse warnings,
`kb-normalize` for `ExtractFailed`).
Lift `metadata.user["title"]` and `metadata.user["lang"]` (where P1-2
stashes them since the `Metadata` struct itself does not carry those
fields) into `CanonicalDocument.title` / `CanonicalDocument.lang`.
Both keys are removed from the user map after lifting so the wire
form does not duplicate the data; missing keys default to empty
string / empty `Lang`. Other user-map keys survive.
Tests pin the event ordering, the warning routing, and the lift
behavior (including non-duplication in `metadata.user`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement the §4.3 ordinal rule and §3.4 block lift. Each `ParsedBlock`
maps to a `kb_core::Block` variant carrying a `CommonBlock` whose
`block_id = id_for_block(doc_id, payload_kind, heading_path, ordinal,
source_span)`. Ordinals are scoped to `(heading_path, payload_kind)`,
0-based, in document order — three paragraphs under one H1 get 0/1/2,
a code block under the same H1 starts fresh at 0, a paragraph under a
different H1 also starts at 0.
`payload_kind` is the lowercase-no-spaces convention from §4.2:
"heading", "paragraph", "list", "code", "table", "quote", "imageref",
"audioref".
`ListBlock.items` re-uses the parent list's `CommonBlock` per §3.4 (no
per-item BlockId is allocated). `AudioRefBlock` placeholder fields
(`asset_id`, `duration_ms`) are filled in by P8 — for now we synthesize
the minimal record so the document is well-typed.
Tests pin the four §4.4 ID properties (1000-iteration determinism, NFC
≡ NFD Korean path, `./a/b.md` ≡ `a/b.md`, ordinal grouping). Provenance
and title/lang lift land in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the workspace member, `Cargo.toml` with the §8-allowed dep set
(kb-core, kb-parse-types, kb-config, serde, serde_json_canonicalizer,
blake3, unicode-normalization, time, anyhow, tracing) and a stubbed
`build_canonical_document` that pins the public signature plus
`doc_id` derivation. `kb-parse-md` is permitted only as a *dev*-dep so
the integration snapshot test (added later in this series) can drive
a fixture through the real parser without violating the production
boundary — `cargo tree -p kb-normalize --depth 1 --edges normal`
confirms no parser implementation appears in the regular dep tree.
`id_for_doc` and `id_for_block` are re-exported from kb-core (which
holds the canonical recipe per §4.2); kb-normalize is the canonical
*entry point* per design §8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mechanical sweep over `Inline::Text(_)` / `Code(_)` / `Strong(_)` / `Emph(_)`
construction and match sites under the new struct-variant shape introduced
in the previous commit. `Inline::Link { text, href }` is unchanged.
The snapshot test in `tests/blocks_snapshots.rs` previously projected
`ParsedBlock` into a `BlockView`/`PayloadView` shim because the old
`Inline` could not serialize. With the schema fix in place we now
serialize `ParsedBlock` directly through serde — the shim and its
`flatten_inline` helper are removed. Inlines surface as structured
objects (`{"kind":"text","text":"…"}` etc.). Regenerated
`nested-headings.blocks.snapshot.json` to reflect the new shape via
the existing `--ignored` emitter; `code-and-table.blocks.snapshot.json`
has no inlines and is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`#[serde(tag = "kind")]` rejects newtype variants whose payload is not a
struct, so 4 of 5 `Inline` variants (`Text(String)`, `Code(String)`,
`Strong(Vec<…>)`, `Emph(Vec<…>)`) failed to serialize at runtime — only
`Link { text, href }` worked. Convert every variant to struct form so the
internally-tagged shape is well-formed and round-trips through JSON.
Add `inline_serde_round_trip` covering all five variants. Per design §9,
this is a wire-schema migration; no `docs/wire-schema/v1/*.json` change
required since `Inline` is not directly referenced there. Callers in
kb-parse-md follow in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The quality reviewer named three specific input probes for the C1/C2/
C3 fixes. Encode each as a verbatim test so future regressions on
those exact inputs surface immediately:
- probe_overflow: parse_blocks(b"# h\nbody\n", u32::MAX) → empty +
Warning::ExtractFailed.
- probe_list_escape: list with embedded code block → single List
block, two items.
- probe_empty_heading: `# \n# Real\nbody\n` → body's heading_path is
`["Real"]`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ParsedPayload::Quote { text, inlines }` cannot represent block-level
children (lists, code, tables, images), so the BlockQuote end handler
silently drops them when assembling the Quote payload. This matches
§3.4 for now but is non-obvious and easy to regress without an
explicit pin.
Add a TODO(P1-future) comment near the Quote emission code and a
regression test (`quote_with_list_inside_drops_list`) that fixes the
current shape: a `> - item` blockquote produces a Quote with empty
text and empty inlines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`[multi\nline](http://x)` produced `Inline::Link.text = "multiline"`
because the SoftBreak/HardBreak handler called `push_text(" ")` —
which updates `paragraph.text` and the inline buffer, but NOT the
open link frame's flattened text accumulator. Text events flowed
through `push_link_text`; line breaks didn't.
Add `push_link_text(" ")` alongside the existing `push_text(" ")` in
the break handler so a line break inside `[ ... ](href)` collapses
to a visible space rather than disappearing.
New tests:
- link_with_soft_break_preserves_space_in_text
- link_with_hard_break_preserves_space_in_text
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous block-level image detector scanned paragraph source bytes
for the literal `` shape. That was fragile in three ways:
- `` leaked the title into `src` (`src "title"`)
- `` kept the angle brackets verbatim
- `![]()` had undefined behavior
Replace the byte-scan with state on `Frame::Paragraph` that observes
the actual `Tag::Image` events from pulldown-cmark:
- `image_count` increments on each `Start(Tag::Image)` and `image_src`
captures `dest_url` (which already strips angle brackets and excludes
the title).
- Text events seen while `image_depth > 0` are routed into `image_alt`
and suppressed from the inline buffer.
- Strong/Emph/Link starts and any non-image text outside the image
flag `non_image_text_seen`.
At `End(Paragraph)`, the paragraph is lifted to `ImageRef` iff
`image_count == 1 && !non_image_text_seen`. The byte-scanner
`match_block_image` is removed.
New tests:
- image_with_title_attribute (title dropped, no leak into src)
- image_with_angle_bracketed_url (brackets stripped)
- empty_image_alt_and_src (`![]()` pins to empty/empty)
Existing image tests (`image_ref_block_captures_src_and_alt`,
`inline_image_inside_paragraph_is_dropped`) continue to pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`# ` (a heading with no following text) used to seed the heading
stack with `Some("")`, which then propagated into every child block's
`heading_path` as a `""` segment — visibly polluting the path that
downstream consumers index by.
Filter empty entries from both `heading_path()` and the in-line
ancestor collection at heading-end. We deliberately keep `Some("")`
in the stack rather than skipping the assignment so the slot remains
occupied and a subsequent deeper heading is still positioned
correctly relative to its level — only the visible path drops the
empty.
New tests:
- empty_heading_does_not_pollute_path
- empty_h1_then_h2_does_not_break_stack
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`emit_block` previously walked the frame stack looking only for a Quote
container, falling back to top-level on miss. That caused any block
emitted inside a list item — code blocks, images, tables, headings —
to escape the list and appear at the top of `blocks`, after the
entire list and out of source order. `ParsedPayload::List { items:
Vec<Vec<Inline>> }` cannot represent a child block structurally, so
the choice is between dropping content and flattening.
Extend the reverse-walk to also recognize `Frame::ListItem` and route
the block into a textual rendering appended to the item's inline
buffer (`flatten_block_into_item`):
- Code → fenced text approximation, preserving lang hint + body
- Image → `` text
- Audio → `[audio](src)` text
- Heading → leading hashes + text
- Quote → `> text`
- Nested List → same rendering as `nested_in_item` flatten
- Table → pipe-table approximation
Document order is preserved because flattening happens inside the
item's frame, before the item closes.
New tests:
- code_block_inside_list_item_flattens_into_parent
- image_inside_list_item_flattens_into_parent
- block_content_in_list_preserves_document_order
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`span_for` previously used `u32 + u32` directly, so callers passing a
large `body_offset_lines` could panic (debug, then masked by
`catch_unwind` and the entire body discarded) or wrap to an inverted
span with `start > end` (release).
Switch to `checked_add`; on overflow flag the walk state and at the
end of `parse_blocks_inner` discard accumulated blocks and surface a
single `Warning::ExtractFailed` carrying the offending body line.
This degrades cleanly without panicking and without emitting a
silently-broken span.
Also extend `random_bytes_do_not_panic` to mix u32::MAX-style offsets
across the fuzz iterations so the overflow path is exercised by the
randomized corpus.
New tests:
- body_offset_lines_max_returns_extract_failed
- body_offset_lines_zero_at_max_minus_one_no_overflow
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes the leftover `let _ = level_u8;` and `let _ = header_count;`
discards. The Heading frame already carries the canonical level (we
populated it from `Tag::Heading` at Start), so we destructure that
directly and ignore the redundant `TagEnd::Heading(level)` payload.
The header_count helper was dead — Frame::Table tracks `cols`
internally and we never consumed `header_count`.
Adds two snapshot tests (`nested-headings.md`, `code-and-table.md`) under
crates/kb-parse-md/tests/blocks_snapshots.rs, with matching baseline JSON
next to each fixture. The snapshot view projects `kb_core::Inline` to
flat strings — `Inline` carries `serde(tag = "kind")` which is
incompatible with newtype variants holding a primitive (`Text(String)`),
so direct serialization of `ParsedBlock` would fail today. The view
preserves the contract that matters for P1-3 (heading paths, source
spans, payload kinds, payload text/code/table content) and will keep
working once kb-core fixes the Inline schema in a later task.
Also tightens `level_to_use >= 1 && <= 6` into `(1..=6).contains(&_)` to
satisfy `clippy::manual_range_contains`.
Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a
flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a
small frame-based state machine that tracks heading paths, accumulates
inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and
reports SourceSpan::Line spans in 1-indexed file-line coordinates.
Covers headings, paragraphs, code blocks (lang from info string), GFM
tables (with malformed fallback to paragraph + MalformedTable warning),
lists (nested sub-lists flattened into parent item), and block-level image
references. Inline images are dropped silently per the inline filter.
Adversarial inputs are caught with `catch_unwind` and degrade to an empty
output + ExtractFailed warning.
15 unit tests cover heading-path correctness, code lang, table parsing,
malformed-table fallback (driven via synthetic events since pulldown-cmark
auto-normalizes table widths), LF/CRLF line-range parity, image refs,
nested-list flattening, inline filter, and 100-iteration random-bytes plus
hand-crafted adversarial-input no-panic guards.
M1: Reword the FrontmatterSpan doc-comment from "technically meant to be
crate-internal" to a forward-looking note about P1-3 / P1-4 callers using
bytes[span.end..] for body slicing.
M3: Add an explicit `# Errors` section to parse_frontmatter's rustdoc.
The current implementation never returns Err — all recoverable problems
are downgraded to warnings — but the Result is kept on the signature so
future hard-fail conditions can be added without breaking callers.
M4: Mention serde_yml in the library-choice rationale alongside
serde_yaml_ng, with a one-line note on why _ng was preferred (stricter
adherence to original serde_yaml semantics around null / tagged enums).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
C1: detect_delimiters now returns (DelimKind, FrontmatterSpan, Range<usize>)
where the inner range is the YAML/TOML payload byte range — derived in one
place rather than recomputed by the parser via fixed-width opening_len /
closing_len constants that wrongly assumed LF endings. CRLF input now parses
correctly end-to-end; the originally-failing reviewer probe
"---\r\ntitle: Doc\r\n---\r\nbody\r\n" now yields title="Doc" with no
warnings.
I1: Trailing horizontal whitespace (spaces / tabs) on either delimiter
line is now accepted, matching Hugo / Jekyll. Editors that auto-trim
trailing whitespace no longer silently break otherwise-valid frontmatter.
I2: A leading UTF-8 BOM (EF BB BF, byte 0 only) is tolerated and skipped
before delimiter scanning. The returned span.start accounts for the BOM
(=3) so callers using bytes[span.end..] for body slicing still get the
correct range without further bookkeeping. Mid-input BOMs are not stripped.
M2: Drop the now-dead DelimKind::opening_len / closing_len constants —
the inner range is encoded once at detection time.
12 new tests covering CRLF (YAML / TOML / mixed-EOL / end-to-end),
trailing whitespace on opener / closer / tabs, leading BOM (detection +
full pipeline), and mid-input BOM non-stripping.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §"Behavior contract" line 74 says `id:` is captured into
`metadata.user_id_alias` only. Remove the redundant `user.insert`
that was also writing it into the user map, and update the snapshot
baseline accordingly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two markdown fixtures with hand-authored JSON baselines that pin the
§0 Q9 derive output across runs:
- frontmatter-only.md exercises the YAML happy path with most fields,
unknown keys, an `id:` field, and a non-UTC created_at (so the
baseline shows original_timestamps preservation).
- mixed-lang.md is body-only with no `lang:` field; baseline pins the
lingua autodetect result for our enabled language set.
A separate `emit_snapshots` test (marked `#[ignore]`) regenerates the
baselines from the current parser output. A determinism test parses
the fixture twice and asserts equality so any non-determinism (e.g.
key ordering, lingua nondeterminism) fails fast.
Implement the frontmatter submodule:
- detect_delimiters scans for a leading YAML (---) or TOML (+++) block at
byte 0. Strict per §0 Q9: no leading whitespace / BOM, no chars on the
delimiter line. Closing must be its own line. Unterminated → no FM.
- parse_raw deserializes into RawFrontmatter, a serde-flatten struct that
catches unknown keys into a serde_json::Map for verbatim preservation
in metadata.user.
- derive_metadata implements the §0 Q9 fallback chain:
title → frontmatter | BodyHints.first_h1 | (filename: caller)
aliases/tags→ frontmatter | []
lang → frontmatter | lingua autodetect on first 4 KB | hints
| "und"
created_at → frontmatter (RFC 3339, normalized to UTC) | fs_ctime
updated_at → frontmatter | fs_mtime
source_type → frontmatter | "markdown"
trust_level → frontmatter | "primary"
id → user_id_alias only — never a doc_id factor (§4.2)
- Non-UTC offsets are normalized to UTC; the original string is preserved
in user.original_timestamps[field] per §0 Q9.
- Warnings are emitted for: malformed YAML/TOML, unknown enum values,
malformed timestamps. Unknown keys are silent.
- lingua detector is cached in a OnceLock — first build is heavy.
- 15 unit tests cover every row of the derive table + delimiter edge
cases + an explicit pin that `id:` does not feed id_for_doc.
Add the workspace member with the dep allow-list pinned by design §0 Q9
and the task spec. P1-2 will land the frontmatter submodule in the next
commit; P1-3 will add the block parser as a sibling.
Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024
so we use serde_yaml_ng, the maintained fork. lingua's per-language
features are explicitly enabled (default-features=false) to keep build
time + binary size sane — only the languages we need at parse time.
- walker.rs: document why we pick walkdir over ignore::WalkBuilder
(explicit canonical-path comparison for sibling-subtree symlinks).
- walker.rs: log canonicalize failures via tracing::debug! (was a silent
`Err(_) => continue`) so broken/permission-denied symlink targets are
observable at debug verbosity.
- connector.rs: TODO marker on the scope.include debug-log noting the
filter belongs at the extractor router (P1-2/P1-3).
- connector.rs: TODO marker on expand_tilde to hoist tilde + ${VAR}
expansion into a kb-config helper once available.
- connector.rs: comment on the .kbignore read documenting the
re-read-on-every-scan() contract.
- connector.rs test: tighten the `.kbignore`-itself ADR comment and
upgrade the assertion to actively pin "`.kbignore` IS emitted" instead
of "either is fine"; future drift will now fail the test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Document AssetStorage::Copied / Reference path semantics so P1-6 (asset
writer) knows that at scan time `Copied.path` is the SOURCE path and the
writer is responsible for both copying bytes AND overwriting `path`
with the destination.
- Rename the dangling-symlink test to make its scope explicit
(`dangling_symlink_pseudo_cycle_does_not_crash`); the prior name implied
a real two-step directory cycle but the targets were broken links.
- Add `two_step_directory_cycle_visited_set_breaks_loop`: builds
`a/loop -> ../b` and `b/loop -> ../a` over real directories with real
files, asserting scan terminates with a finite, deterministic asset
list — exercises the canonical-path visited-set in walker::walk_files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fixtures/source-fs/tree-1/:
README.md
notes/alpha.md
notes/beta.md
ignored/skip.tmp (excluded by .kbignore *.tmp)
.kbignore ("*.tmp")
.DS_Store (implicitly excluded by FsSourceConnector)
The committed baseline (tree-1.snapshot.json) has discovered_at,
source_uri.value, and stored.path replaced with "<stripped>" so the
JSON is portable across checkout locations and CI runs. The test
applies the same stripping to scan output before comparing.
The determinism test runs scan twice and asserts byte-identical
serialized JSON (post-strip) — same filesystem state must yield the
same Vec<RawAsset>.
Regenerate baseline with `KB_REGEN_SNAPSHOT=1 cargo test -p kb-source-fs
--test snapshot_tree1 -- tree_1_snapshot_matches_baseline`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cases:
- root/notes -> root (single-link cycle through workspace root).
- root/a -> b, root/b -> a (two-step cycle of dangling symlinks).
Both must complete in O(seconds) and surface alpha.md exactly once.
Proves walker::walk_files's visited-set guard catches realistic cycle
shapes via the public SourceConnector API.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk config.workspace.root, apply gitignore-style filters
(config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for
.DS_Store / ._*), stream BLAKE3 over each file, and emit a
deterministic Vec<RawAsset> sorted by workspace_path.
Modules:
- hash: streaming blake3::Hasher + 64 KiB read buffer (no whole-file
loads); pinned digests for empty input and "hello world".
- media: extension → MediaType (markdown/pdf/image/audio/other).
- walker: ignore::OverrideBuilder for filter union; walkdir with
manual visited-set cycle protection on top of follow_links.
- connector: public FsSourceConnector::new(&Config) +
SourceConnector::scan(&SourceScope) impl. Uses
kb_core::to_posix for WorkspacePath construction (carries
P0-1 # rejection through unchanged) and kb_core::id_for_asset
for AssetId derivation. Storage variant signals intent only;
actual byte copy is P1-6's responsibility.
Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Cargo.toml: remove `thiserror` from kb-config, kb-parse-types, kb-app
(unused — none of those crates' src trees reference thiserror; CoreError
in kb-core is the only consumer).
- kb-config keeps the `kb-core` dep with a one-line comment marking
CoreError reserved for P1-* config-error wiring per the review thread.
- ids.rs: switch `validate_hex32` from a hand-rolled `matches!` byte range
to `is_ascii_hexdigit()` so the hex check is the canonical idiom (and
satisfies `clippy::manual_is_ascii_check` under `-D warnings`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- kb-config::apply_env now covers every leaf key in `Config` via an
explicit grep-friendly match block (one arm per leaf), keyed
`KB_<SECTION>_<KEY>`. Booleans flow through a shared `parse_bool` helper.
Numeric leaves silently keep their prior value on parse failure so a
malformed env entry can't crash startup.
- New tests: env_unknown_key_is_ignored,
env_overrides_chunking_target_tokens,
env_overrides_models_llm_endpoint_and_temperature,
env_overrides_indexing_watch_filesystem_bool.
- kb-app::logging::init now returns `Result<WorkerGuard>` instead of
`Result<Option<WorkerGuard>>` — the inner `Option` was always `Some` so
the wrapper was dead. kb-cli/main.rs collapses the call from
`.ok().flatten()` to `.ok()`, preserving fail-soft semantics on logging
init.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- WorkspacePath: add `WorkspacePath::new(s)` validating constructor that
rejects any string containing `#` (collides with the W3C-Media-Fragments
separator that Citation URIs depend on). Doc-comment on the type now
explains the invariant.
- normalize::to_posix changes signature to `Result<WorkspacePath, CoreError>`
and now flows through `WorkspacePath::new`, so a path with `#` is rejected
at construction rather than at every reader. Only one caller existed
outside tests, so the signature change is contained.
- Citation::parse uses `WorkspacePath::new` on the path side. With multiple
`#` separators in the input, `rsplit_once` would otherwise leave a `#` on
the path; the new constructor closes the hole.
- Citation::parse + parse_hms_ms: every `bail!` / `anyhow!` site now quotes
the offending substring (and the full input where useful) so error
messages identify what went wrong without re-deriving from context.
- New tests: `to_posix_rejects_hash_in_path`,
`parse_path_with_hash_rejected_at_to_posix_layer`. Existing
`to_posix(...).0` call sites updated for the Result signature.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ids.rs: validate_hex32 now accepts upper+lower hex; FromStr lowercases
the stored representation so equality and hashing stay canonical.
- Renamed test newtype_rejects_uppercase →
newtype_accepts_uppercase_normalizes_to_lowercase and added
newtype_rejects_invalid_chars_after_uppercase_pass.
- Added pinned-hex independence tests for id_for_block / id_for_chunk /
id_for_embedding / id_for_index. Each test also asserts that the bytes
serde-json-canonicalizer emits for the tuple match the literal JSON we
hashed externally with b3sum, so a future field-rename can't silently
drift the hash without flagging the JSON layer first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups from the code-quality review pass on P0-1:
- Re-export `IngestItemKind` from `kb-core` so downstream tasks
constructing `IngestItem` don't need `kb_core::ingest::IngestItemKind`.
- Document the `--json` wire-schema convention by introducing
`kb-cli/src/wire.rs` with `wire_*` helpers paralleling the existing
inline `wire_ingest`. Each Ok-path `--json` branch now routes through
these helpers so future P1-5/P3/P4/P5 implementations slot the
`schema_version` envelope in automatically. `DoctorReport` keeps its
struct-field `schema_version` (the documented exception), and the
helper round-trips it idempotently. Records the convention in
`kb-app/src/lib.rs`'s top docstring.
- Fix clippy `single_char_add_str` in `kb_core::normalize` (replace
`out.push_str(".")` with `out.push('.')`).
Verified: `cargo check`, `cargo test` (5 new wire-helper tests),
`cargo clippy -D warnings`, and `RUSTFLAGS=-D warnings cargo build` all
clean. Smoke-tested `kb doctor --json` still emits
`{"schema_version":"doctor.v1",...}`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the kb-app crate (§7) as the single facade between UI crates
(kb-cli / kb-tui / kb-desktop) and the rest of the workspace. Public
surface mirrors the task spec exactly:
- init_workspace(force) — XDG dir creation + config.toml seed; idempotent
unless force=true. Honors XDG envs and tilde-expands the workspace
root to $HOME/KnowledgeBase.
- doctor() — emits a doctor.v1 report with config_loaded +
data_dir_writable checks; downstream checks land in later phases.
- ingest / list_docs / inspect_doc / inspect_chunk / search / ask —
bail!("not yet wired (P<n>-<i>)") so kb-cli surfaces exit code 2
cleanly per §10.
- AskOpts + DoctorReport + DoctorCheck.
- doctor_signal::{DoctorUnhealthy, RefusalSignal, NoHitSignal} —
signal types the CLI downcasts on for §10 exit-code mapping.
- logging::init() — daily-rolling file appender at
$XDG_STATE_HOME/kb/logs/kb.log, plus stderr-fallback EnvFilter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the kb-config crate per design §6. Provides the frozen Config
schema (§6.4) with serde + toml round-trip, defaults() that exactly
match the reference values (e.g. score_gate=0.30, target_tokens=500,
embedding.dimensions=384, rrf_k=60), and XDG path resolvers that honor
XDG_CONFIG_HOME / XDG_DATA_HOME / XDG_CACHE_HOME / XDG_STATE_HOME.
Layer order in load(): defaults → file → env (KB_<SECTION>_<KEY>);
CLI overrides apply later in kb-cli. Env mapping covers the keys
needed by P0 smoke tests; the rest land as their config sections wire
up.
5 unit tests cover serde round-trip, defaults pinned to design,
KB_RAG_SCORE_GATE / KB_SEARCH_DEFAULT_K env override, and
XDG_CONFIG_HOME handling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the kb-parse-types crate per design §3.7b. Depends only on kb-core
+ serde/thiserror — never on parser libraries. Defines:
- ParsedBlock + ParsedBlockKind + ParsedPayload (8 variants matching
Block variants in kb-core).
- Warning + WarningKind for parser diagnostics.
- Forward-declared ParsedImageRegion / ParsedPdfPage / ParsedAudioSegment
shells for P6/P7/P8.
`cargo tree -p kb-parse-types` shows only kb-core, serde, and thiserror.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address 8 issues found in spec audit (post PR #2):
1. §refs label: distinguish design vs report sections in p3-1 / p3-2 / p4-2 /
p9-1 / p9-5 contract_sections (e.g., "report §11.2 Ollama" not "§11.2").
2. mock feature gate: gate MockEmbedder (p3-1) and MockLanguageModel (p4-1)
behind `mock` cargo feature, default OFF; add CI symbol-scan as DoD item.
3. Warning type unification: p1-2 frontmatter now emits
`kb_parse_types::Warning` (matches p1-3 / p1-4); drops crate-internal type.
4. p4-3 streaming thread: explicitly single-threaded inside RagPipeline::ask;
collection + sink.send share the calling thread, no race. UI concurrency
is callers responsibility (TUI worker thread pattern in p9-3).
5. p6-2 tesseract version: noted that `tesseract` 0.13 has no stable Rust
`version()` accessor; use TessVersion FFI or shell-out + cache approach.
6. p9-* App struct extensions: introduce `kb_tui::{Library,Search,Ask,Inspect}State`
slots in p9-1 forward-decl form; p9-2/3/4 fill bodies in their own crate
without editing `App`. Parallel-safety contract added.
7. p3-3 cosine score: shift `(sim+1)/2` instead of clamp; preserve ranking
signal between unrelated and opposite vectors. Clamp reserved for NaN.
8. fixtures/ root: p0-1 DoD now creates all fixture subdirs with .gitkeep so
downstream tasks have a stable target path.