1125 Commits

Author SHA1 Message Date
8142449eb7 p1-5: scaffold kb-chunk crate with MdHeadingV1Chunker skeleton
Adds the new workspace member with the bare Chunker impl shape:
chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes
canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk()
is an empty stub the next commits fill in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:27:42 +00:00
4665910370 Merge pull request 'feat(p1-4): kb-normalize + kb-core Inline schema hotfix' (#9) from feat/p1-4-normalize into main
Reviewed-on: altair823-org/kb#9
2026-04-30 16:23:16 +00:00
557275c04e p1-4: doc-only follow-ups for deferred review minors (M8, M9, M11, M12)
M8: kb-parse-md frontmatter doc-comment claimed filename fallback was
P1-4's job; P1-4 spec did not include it. Reconcile: defer to a later
phase (P1-7 / kb-app integration) where the workspace_path filename is
known to the caller. Updated comment in build_metadata().

M9: kb-parse-md tests use the #[ignore] regenerator pattern, while
kb-normalize's integration test uses an UPDATE_SNAPSHOTS=1 env-var.
Migrating kb-parse-md is out of scope; one-line note added to
blocks_snapshots.rs mod doc-comment to flag the intentional split.

M11, M12: doc-only comments in lift_block (already added in the
previous commit) — list-item shared block_id rationale and the
intentional camel-case Debug-format for WarningKind in Provenance
notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:42:02 +00:00
e0df42984e p1-4: address review I1-I3 + minors (extract attribution, audio-ref skip, NFC heading_path)
I1: warning_agent maps ExtractFailed → "kb-parse-md" (the panic-recovery
emitter in kb-parse-md/src/blocks.rs). Lift-stage warnings from
build_canonical_document are tracked separately and attributed to
"kb-normalize", so the I1 mapping change does not lie about
kb-normalize-originated drops.

I2: ParsedPayload::AudioRef no longer synthesizes Block::AudioRef with
an invalid empty AssetId (would violate AssetId::from_str's 32-hex
invariant). Block is dropped, Warning surfaces in Provenance with src
mention, attributed to kb-normalize (lift-stage decision). TODO(P8)
comment marks this as a placeholder until the audio extractor lands.

I3: NFC-normalize each heading_path string in lift_block before feeding
into id_for_block AND into CommonBlock.heading_path. pulldown-cmark does
not NFC heading text and serde_json_canonicalizer v0.3 does not either,
so canonically-equivalent NFD/NFC inputs would produce different
block_ids without this normalization. Mirrors the existing doc_id NFC
handling via to_posix.

Minors:
- M4: trim Cargo.toml — drop kb-config, serde_json_canonicalizer,
  blake3 (unused); keep tracing (now wired) + unicode-normalization
  (now used by I3).
- M5: determinism_1000_iterations_under_1s now uses the same 5-block
  fixture as block_ordinals_scoped_per_heading_and_kind (extracted into
  fixture_blocks_five helper) so the determinism property is exercised
  on a real lift_block path, not just an empty Vec. Still < 1s.
- M6: snapshot integration test now passes BodyHints { first_h1:
  Some("Code And Table"), .. } and asserts doc.title == "Code And Table"
  end-to-end. Baseline JSON updated.
- M7: title/lang edge-case unit tests pin policy: empty string lifts to
  empty string; non-stringy values silently drop. Rustdoc updated.
- M10: provenance_contains_stage_events_in_order asserts events[1].at
  == events[2].at to pin the shared-now_utc invariant.

New tests (unit, kb-normalize):
- provenance_with_extract_failed_warning_attributes_to_kb_parse_md (I1)
- audio_ref_block_skipped_with_warning (I2)
- nfc_nfd_korean_heading_path_same_block_id (I3)
- title_empty_string_in_user_map_falls_back_to_default (M7)
- title_non_string_in_user_map_silently_drops (M7)
- lang_invalid_shape_silently_drops (M7)

kb-normalize unit tests: 9 → 14. Integration snapshot: 1 (unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:41:50 +00:00
5352bede5c p1-4: snapshot + determinism tests
Add the integration snapshot test pinning the full `CanonicalDocument`
JSON for `fixtures/markdown/code-and-table.md` (run through the real
`kb-parse-md::parse_frontmatter` + `parse_blocks`, dev-dep only).
Non-deterministic `provenance.events[*].at` for the Parsed and
Normalized events is stripped before comparison; the Discovered
event's `at` is pinned by constructing the test `RawAsset` with a
fixed `discovered_at`. Run with `UPDATE_SNAPSHOTS=1` to regenerate.

Add the 1000-iteration determinism property: same inputs ⇒ byte-
identical JSON (modulo the same stripped timestamps), in under one
second of wall-clock time. A regression in canonical JSON, BLAKE3
hashing, ordinal counting, or any other deterministic field would
surface here immediately.

The integration test depends on `kb-parse-md` only as a dev-dep, so
`cargo tree -p kb-normalize --depth 1 --edges normal` confirms no
parser implementation appears in the production dep tree per design
§8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:20:18 +00:00
1cc0ba9f37 p1-4: provenance + title/lang lift
Build a `Provenance` with one event per pipeline stage (`Discovered`
sourced from `RawAsset.discovered_at`, then `Parsed` and `Normalized`
stamped with one shared `now_utc()` reading), plus one `Warning` event
per upstream warning. Sharing `now` between Parsed and Normalized
bounds intra-call timestamp jitter — event ordering is preserved by
`Vec` position regardless. Warning agents are routed back to the
upstream component (`kb-parse-md` for parse warnings,
`kb-normalize` for `ExtractFailed`).

Lift `metadata.user["title"]` and `metadata.user["lang"]` (where P1-2
stashes them since the `Metadata` struct itself does not carry those
fields) into `CanonicalDocument.title` / `CanonicalDocument.lang`.
Both keys are removed from the user map after lifting so the wire
form does not duplicate the data; missing keys default to empty
string / empty `Lang`. Other user-map keys survive.

Tests pin the event ordering, the warning routing, and the lift
behavior (including non-duplication in `metadata.user`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:19:33 +00:00
fc05f3a2be p1-4: build_canonical_document core + ID assignment
Implement the §4.3 ordinal rule and §3.4 block lift. Each `ParsedBlock`
maps to a `kb_core::Block` variant carrying a `CommonBlock` whose
`block_id = id_for_block(doc_id, payload_kind, heading_path, ordinal,
source_span)`. Ordinals are scoped to `(heading_path, payload_kind)`,
0-based, in document order — three paragraphs under one H1 get 0/1/2,
a code block under the same H1 starts fresh at 0, a paragraph under a
different H1 also starts at 0.

`payload_kind` is the lowercase-no-spaces convention from §4.2:
"heading", "paragraph", "list", "code", "table", "quote", "imageref",
"audioref".

`ListBlock.items` re-uses the parent list's `CommonBlock` per §3.4 (no
per-item BlockId is allocated). `AudioRefBlock` placeholder fields
(`asset_id`, `duration_ms`) are filled in by P8 — for now we synthesize
the minimal record so the document is well-typed.

Tests pin the four §4.4 ID properties (1000-iteration determinism, NFC
≡ NFD Korean path, `./a/b.md` ≡ `a/b.md`, ordinal grouping). Provenance
and title/lang lift land in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:18:19 +00:00
c0096ce44b p1-4: scaffold kb-normalize crate
Add the workspace member, `Cargo.toml` with the §8-allowed dep set
(kb-core, kb-parse-types, kb-config, serde, serde_json_canonicalizer,
blake3, unicode-normalization, time, anyhow, tracing) and a stubbed
`build_canonical_document` that pins the public signature plus
`doc_id` derivation. `kb-parse-md` is permitted only as a *dev*-dep so
the integration snapshot test (added later in this series) can drive
a fixture through the real parser without violating the production
boundary — `cargo tree -p kb-normalize --depth 1 --edges normal`
confirms no parser implementation appears in the regular dep tree.

`id_for_doc` and `id_for_block` are re-exported from kb-core (which
holds the canonical recipe per §4.2); kb-normalize is the canonical
*entry point* per design §8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:16:53 +00:00
cfccb3687d p1-3: update kb-parse-md callers + drop BlockView projection in snapshots
Mechanical sweep over `Inline::Text(_)` / `Code(_)` / `Strong(_)` / `Emph(_)`
construction and match sites under the new struct-variant shape introduced
in the previous commit. `Inline::Link { text, href }` is unchanged.

The snapshot test in `tests/blocks_snapshots.rs` previously projected
`ParsedBlock` into a `BlockView`/`PayloadView` shim because the old
`Inline` could not serialize. With the schema fix in place we now
serialize `ParsedBlock` directly through serde — the shim and its
`flatten_inline` helper are removed. Inlines surface as structured
objects (`{"kind":"text","text":"…"}` etc.). Regenerated
`nested-headings.blocks.snapshot.json` to reflect the new shape via
the existing `--ignored` emitter; `code-and-table.blocks.snapshot.json`
has no inlines and is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:10:54 +00:00
606ce1cf66 kb-core: hotfix Inline serde schema (struct variants)
`#[serde(tag = "kind")]` rejects newtype variants whose payload is not a
struct, so 4 of 5 `Inline` variants (`Text(String)`, `Code(String)`,
`Strong(Vec<…>)`, `Emph(Vec<…>)`) failed to serialize at runtime — only
`Link { text, href }` worked. Convert every variant to struct form so the
internally-tagged shape is well-formed and round-trips through JSON.

Add `inline_serde_round_trip` covering all five variants. Per design §9,
this is a wire-schema migration; no `docs/wire-schema/v1/*.json` change
required since `Inline` is not directly referenced there. Callers in
kb-parse-md follow in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:10:40 +00:00
8ce44af95a Merge pull request 'feat(p1-3): kb-parse-md blocks (Markdown body → ParsedBlock tree)' (#8) from feat/p1-3-parse-md-blocks into main
Reviewed-on: altair823-org/kb#8
2026-04-30 15:03:24 +00:00
80123e9e27 p1-3: pin reviewer probe inputs as regression tests
The quality reviewer named three specific input probes for the C1/C2/
C3 fixes. Encode each as a verbatim test so future regressions on
those exact inputs surface immediately:

- probe_overflow: parse_blocks(b"# h\nbody\n", u32::MAX) → empty +
  Warning::ExtractFailed.
- probe_list_escape: list with embedded code block → single List
  block, two items.
- probe_empty_heading: `# \n# Real\nbody\n` → body's heading_path is
  `["Real"]`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:42:21 +00:00
2b6d9abc0f p1-3: doc-comment + test pin Quote drops non-text children
`ParsedPayload::Quote { text, inlines }` cannot represent block-level
children (lists, code, tables, images), so the BlockQuote end handler
silently drops them when assembling the Quote payload. This matches
§3.4 for now but is non-obvious and easy to regress without an
explicit pin.

Add a TODO(P1-future) comment near the Quote emission code and a
regression test (`quote_with_list_inside_drops_list`) that fixes the
current shape: a `> - item` blockquote produces a Quote with empty
text and empty inlines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:41:47 +00:00
23ff4d68af p1-3: preserve whitespace in link text across SoftBreak/HardBreak
`[multi\nline](http://x)` produced `Inline::Link.text = "multiline"`
because the SoftBreak/HardBreak handler called `push_text(" ")` —
which updates `paragraph.text` and the inline buffer, but NOT the
open link frame's flattened text accumulator. Text events flowed
through `push_link_text`; line breaks didn't.

Add `push_link_text(" ")` alongside the existing `push_text(" ")` in
the break handler so a line break inside `[ ... ](href)` collapses
to a visible space rather than disappearing.

New tests:
- link_with_soft_break_preserves_space_in_text
- link_with_hard_break_preserves_space_in_text

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:40:42 +00:00
73040cab30 p1-3: capture image refs from pulldown-cmark Tag::Image events
The previous block-level image detector scanned paragraph source bytes
for the literal `![alt](src)` shape. That was fragile in three ways:

- `![alt](src "title")` leaked the title into `src` (`src "title"`)
- `![alt](<https://x.com/a b>)` kept the angle brackets verbatim
- `![]()` had undefined behavior

Replace the byte-scan with state on `Frame::Paragraph` that observes
the actual `Tag::Image` events from pulldown-cmark:

- `image_count` increments on each `Start(Tag::Image)` and `image_src`
  captures `dest_url` (which already strips angle brackets and excludes
  the title).
- Text events seen while `image_depth > 0` are routed into `image_alt`
  and suppressed from the inline buffer.
- Strong/Emph/Link starts and any non-image text outside the image
  flag `non_image_text_seen`.

At `End(Paragraph)`, the paragraph is lifted to `ImageRef` iff
`image_count == 1 && !non_image_text_seen`. The byte-scanner
`match_block_image` is removed.

New tests:
- image_with_title_attribute (title dropped, no leak into src)
- image_with_angle_bracketed_url (brackets stripped)
- empty_image_alt_and_src (`![]()` pins to empty/empty)

Existing image tests (`image_ref_block_captures_src_and_alt`,
`inline_image_inside_paragraph_is_dropped`) continue to pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:40:04 +00:00
d49dbc1926 p1-3: skip empty heading text when building heading_path
`# ` (a heading with no following text) used to seed the heading
stack with `Some("")`, which then propagated into every child block's
`heading_path` as a `""` segment — visibly polluting the path that
downstream consumers index by.

Filter empty entries from both `heading_path()` and the in-line
ancestor collection at heading-end. We deliberately keep `Some("")`
in the stack rather than skipping the assignment so the slot remains
occupied and a subsequent deeper heading is still positioned
correctly relative to its level — only the visible path drops the
empty.

New tests:
- empty_heading_does_not_pollute_path
- empty_h1_then_h2_does_not_break_stack

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:37:40 +00:00
0050cf32ea p1-3: route block-level content inside list items into parent inline buffer
`emit_block` previously walked the frame stack looking only for a Quote
container, falling back to top-level on miss. That caused any block
emitted inside a list item — code blocks, images, tables, headings —
to escape the list and appear at the top of `blocks`, after the
entire list and out of source order. `ParsedPayload::List { items:
Vec<Vec<Inline>> }` cannot represent a child block structurally, so
the choice is between dropping content and flattening.

Extend the reverse-walk to also recognize `Frame::ListItem` and route
the block into a textual rendering appended to the item's inline
buffer (`flatten_block_into_item`):

- Code → fenced text approximation, preserving lang hint + body
- Image → `![alt](src)` text
- Audio → `[audio](src)` text
- Heading → leading hashes + text
- Quote → `> text`
- Nested List → same rendering as `nested_in_item` flatten
- Table → pipe-table approximation

Document order is preserved because flattening happens inside the
item's frame, before the item closes.

New tests:
- code_block_inside_list_item_flattens_into_parent
- image_inside_list_item_flattens_into_parent
- block_content_in_list_preserves_document_order

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:36:27 +00:00
de9164802b p1-3: fix span arithmetic overflow + body_offset_lines fuzz
`span_for` previously used `u32 + u32` directly, so callers passing a
large `body_offset_lines` could panic (debug, then masked by
`catch_unwind` and the entire body discarded) or wrap to an inverted
span with `start > end` (release).

Switch to `checked_add`; on overflow flag the walk state and at the
end of `parse_blocks_inner` discard accumulated blocks and surface a
single `Warning::ExtractFailed` carrying the offending body line.
This degrades cleanly without panicking and without emitting a
silently-broken span.

Also extend `random_bytes_do_not_panic` to mix u32::MAX-style offsets
across the fuzz iterations so the overflow path is exercised by the
randomized corpus.

New tests:
- body_offset_lines_max_returns_extract_failed
- body_offset_lines_zero_at_max_minus_one_no_overflow

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:34:58 +00:00
2ac0f56105 p1-3: drop dead bindings in heading + table end handlers
Removes the leftover `let _ = level_u8;` and `let _ = header_count;`
discards. The Heading frame already carries the canonical level (we
populated it from `Tag::Heading` at Start), so we destructure that
directly and ignore the redundant `TagEnd::Heading(level)` payload.
The header_count helper was dead — Frame::Table tracks `cols`
internally and we never consumed `header_count`.
2026-04-30 14:18:54 +00:00
f604a381df p1-3: snapshot tests + clippy fix
Adds two snapshot tests (`nested-headings.md`, `code-and-table.md`) under
crates/kb-parse-md/tests/blocks_snapshots.rs, with matching baseline JSON
next to each fixture. The snapshot view projects `kb_core::Inline` to
flat strings — `Inline` carries `serde(tag = "kind")` which is
incompatible with newtype variants holding a primitive (`Text(String)`),
so direct serialization of `ParsedBlock` would fail today. The view
preserves the contract that matters for P1-3 (heading paths, source
spans, payload kinds, payload text/code/table content) and will keep
working once kb-core fixes the Inline schema in a later task.

Also tightens `level_to_use >= 1 && <= 6` into `(1..=6).contains(&_)` to
satisfy `clippy::manual_range_contains`.
2026-04-30 14:17:41 +00:00
4e7e9cad87 p1-3: add parse_blocks (pulldown-cmark walker) submodule
Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a
flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a
small frame-based state machine that tracks heading paths, accumulates
inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and
reports SourceSpan::Line spans in 1-indexed file-line coordinates.

Covers headings, paragraphs, code blocks (lang from info string), GFM
tables (with malformed fallback to paragraph + MalformedTable warning),
lists (nested sub-lists flattened into parent item), and block-level image
references. Inline images are dropped silently per the inline filter.
Adversarial inputs are caught with `catch_unwind` and degrade to an empty
output + ExtractFailed warning.

15 unit tests cover heading-path correctness, code lang, table parsing,
malformed-table fallback (driven via synthetic events since pulldown-cmark
auto-normalizes table widths), LF/CRLF line-range parity, image refs,
nested-list flattening, inline filter, and 100-iteration random-bytes plus
hand-crafted adversarial-input no-panic guards.
2026-04-30 14:14:34 +00:00
ff37ea5927 Merge pull request 'feat(p1-2): kb-parse-md frontmatter (YAML/TOML → Metadata)' (#7) from feat/p1-2-parse-md-frontmatter into main
Reviewed-on: altair823-org/kb#7
2026-04-30 14:06:54 +00:00
5850bfcf7a p1-2: address review minors (FrontmatterSpan doc, parse_frontmatter rustdoc, YAML library note)
M1: Reword the FrontmatterSpan doc-comment from "technically meant to be
crate-internal" to a forward-looking note about P1-3 / P1-4 callers using
bytes[span.end..] for body slicing.

M3: Add an explicit `# Errors` section to parse_frontmatter's rustdoc.
The current implementation never returns Err — all recoverable problems
are downgraded to warnings — but the Result is kept on the signature so
future hard-fail conditions can be added without breaking callers.

M4: Mention serde_yml in the library-choice rationale alongside
serde_yaml_ng, with a one-line note on why _ng was preferred (stricter
adherence to original serde_yaml semantics around null / tagged enums).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 13:13:16 +00:00
6a4db624b6 p1-2: fix CRLF / trailing whitespace / BOM in frontmatter delimiter detection
C1: detect_delimiters now returns (DelimKind, FrontmatterSpan, Range<usize>)
where the inner range is the YAML/TOML payload byte range — derived in one
place rather than recomputed by the parser via fixed-width opening_len /
closing_len constants that wrongly assumed LF endings. CRLF input now parses
correctly end-to-end; the originally-failing reviewer probe
"---\r\ntitle: Doc\r\n---\r\nbody\r\n" now yields title="Doc" with no
warnings.

I1: Trailing horizontal whitespace (spaces / tabs) on either delimiter
line is now accepted, matching Hugo / Jekyll. Editors that auto-trim
trailing whitespace no longer silently break otherwise-valid frontmatter.

I2: A leading UTF-8 BOM (EF BB BF, byte 0 only) is tolerated and skipped
before delimiter scanning. The returned span.start accounts for the BOM
(=3) so callers using bytes[span.end..] for body slicing still get the
correct range without further bookkeeping. Mid-input BOMs are not stripped.

M2: Drop the now-dead DelimKind::opening_len / closing_len constants —
the inner range is encoded once at detection time.

12 new tests covering CRLF (YAML / TOML / mixed-EOL / end-to-end),
trailing whitespace on opener / closer / tabs, leading BOM (detection +
full pipeline), and mid-input BOM non-stripping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 13:12:34 +00:00
1fab6b0207 p1-2: address spec review (drop user_id_alias mirror in user map)
Spec §"Behavior contract" line 74 says `id:` is captured into
`metadata.user_id_alias` only. Remove the redundant `user.insert`
that was also writing it into the user map, and update the snapshot
baseline accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 13:02:28 +00:00
42a7d53e5d p1-2: fixtures + snapshot tests for frontmatter parser
Two markdown fixtures with hand-authored JSON baselines that pin the
§0 Q9 derive output across runs:

- frontmatter-only.md exercises the YAML happy path with most fields,
  unknown keys, an `id:` field, and a non-UTC created_at (so the
  baseline shows original_timestamps preservation).
- mixed-lang.md is body-only with no `lang:` field; baseline pins the
  lingua autodetect result for our enabled language set.

A separate `emit_snapshots` test (marked `#[ignore]`) regenerates the
baselines from the current parser output. A determinism test parses
the fixture twice and asserts equality so any non-determinism (e.g.
key ordering, lingua nondeterminism) fails fast.
2026-04-30 12:56:19 +00:00
cc8f7dad3f p1-2: parse_frontmatter + §0 Q9 derive table
Implement the frontmatter submodule:

- detect_delimiters scans for a leading YAML (---) or TOML (+++) block at
  byte 0. Strict per §0 Q9: no leading whitespace / BOM, no chars on the
  delimiter line. Closing must be its own line. Unterminated → no FM.
- parse_raw deserializes into RawFrontmatter, a serde-flatten struct that
  catches unknown keys into a serde_json::Map for verbatim preservation
  in metadata.user.
- derive_metadata implements the §0 Q9 fallback chain:
    title       → frontmatter | BodyHints.first_h1 | (filename: caller)
    aliases/tags→ frontmatter | []
    lang        → frontmatter | lingua autodetect on first 4 KB | hints
                  | "und"
    created_at  → frontmatter (RFC 3339, normalized to UTC) | fs_ctime
    updated_at  → frontmatter | fs_mtime
    source_type → frontmatter | "markdown"
    trust_level → frontmatter | "primary"
    id          → user_id_alias only — never a doc_id factor (§4.2)
- Non-UTC offsets are normalized to UTC; the original string is preserved
  in user.original_timestamps[field] per §0 Q9.
- Warnings are emitted for: malformed YAML/TOML, unknown enum values,
  malformed timestamps. Unknown keys are silent.
- lingua detector is cached in a OnceLock — first build is heavy.
- 15 unit tests cover every row of the derive table + delimiter edge
  cases + an explicit pin that `id:` does not feed id_for_doc.
2026-04-30 12:56:02 +00:00
a86b463fc4 p1-2: scaffold kb-parse-md crate
Add the workspace member with the dep allow-list pinned by design §0 Q9
and the task spec. P1-2 will land the frontmatter submodule in the next
commit; P1-3 will add the block parser as a sibling.

Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024
so we use serde_yaml_ng, the maintained fork. lingua's per-language
features are explicitly enabled (default-features=false) to keep build
time + binary size sane — only the languages we need at parse time.
2026-04-30 12:55:20 +00:00
69a0dbb79d Merge pull request 'feat(p1-1): kb-source-fs filesystem source connector' (#6) from feat/p1-1-source-fs into main
Reviewed-on: altair823-org/kb#6
2026-04-30 12:45:45 +00:00
967a6a62c5 p1-1: address review (walker module doc, TODO markers, .kbignore ADR)
- walker.rs: document why we pick walkdir over ignore::WalkBuilder
  (explicit canonical-path comparison for sibling-subtree symlinks).
- walker.rs: log canonicalize failures via tracing::debug! (was a silent
  `Err(_) => continue`) so broken/permission-denied symlink targets are
  observable at debug verbosity.
- connector.rs: TODO marker on the scope.include debug-log noting the
  filter belongs at the extractor router (P1-2/P1-3).
- connector.rs: TODO marker on expand_tilde to hoist tilde + ${VAR}
  expansion into a kb-config helper once available.
- connector.rs: comment on the .kbignore read documenting the
  re-read-on-every-scan() contract.
- connector.rs test: tighten the `.kbignore`-itself ADR comment and
  upgrade the assertion to actively pin "`.kbignore` IS emitted" instead
  of "either is fine"; future drift will now fail the test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:42:25 +00:00
f8d00bdaf6 p1-1: address review (AssetStorage doc-comment, real directory-cycle test)
- Document AssetStorage::Copied / Reference path semantics so P1-6 (asset
  writer) knows that at scan time `Copied.path` is the SOURCE path and the
  writer is responsible for both copying bytes AND overwriting `path`
  with the destination.
- Rename the dangling-symlink test to make its scope explicit
  (`dangling_symlink_pseudo_cycle_does_not_crash`); the prior name implied
  a real two-step directory cycle but the targets were broken links.
- Add `two_step_directory_cycle_visited_set_breaks_loop`: builds
  `a/loop -> ../b` and `b/loop -> ../a` over real directories with real
  files, asserting scan terminates with a finite, deterministic asset
  list — exercises the canonical-path visited-set in walker::walk_files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:42:12 +00:00
0699220d5d p1-1: tree-1 fixture + snapshot/determinism tests
fixtures/source-fs/tree-1/:
  README.md
  notes/alpha.md
  notes/beta.md
  ignored/skip.tmp        (excluded by .kbignore *.tmp)
  .kbignore               ("*.tmp")
  .DS_Store               (implicitly excluded by FsSourceConnector)

The committed baseline (tree-1.snapshot.json) has discovered_at,
source_uri.value, and stored.path replaced with "<stripped>" so the
JSON is portable across checkout locations and CI runs. The test
applies the same stripping to scan output before comparing.

The determinism test runs scan twice and asserts byte-identical
serialized JSON (post-strip) — same filesystem state must yield the
same Vec<RawAsset>.

Regenerate baseline with `KB_REGEN_SNAPSHOT=1 cargo test -p kb-source-fs
--test snapshot_tree1 -- tree_1_snapshot_matches_baseline`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:28:00 +00:00
3d4c485415 p1-1: integration test — symlink cycles do not loop
Two cases:
  - root/notes -> root  (single-link cycle through workspace root).
  - root/a -> b, root/b -> a  (two-step cycle of dangling symlinks).

Both must complete in O(seconds) and surface alpha.md exactly once.
Proves walker::walk_files's visited-set guard catches realistic cycle
shapes via the public SourceConnector API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:27:44 +00:00
7c75e10b2c p1-1: scaffold kb-source-fs crate (FsSourceConnector)
Walk config.workspace.root, apply gitignore-style filters
(config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for
.DS_Store / ._*), stream BLAKE3 over each file, and emit a
deterministic Vec<RawAsset> sorted by workspace_path.

Modules:
  - hash:      streaming blake3::Hasher + 64 KiB read buffer (no whole-file
               loads); pinned digests for empty input and "hello world".
  - media:     extension → MediaType (markdown/pdf/image/audio/other).
  - walker:    ignore::OverrideBuilder for filter union; walkdir with
               manual visited-set cycle protection on top of follow_links.
  - connector: public FsSourceConnector::new(&Config) +
               SourceConnector::scan(&SourceScope) impl. Uses
               kb_core::to_posix for WorkspacePath construction (carries
               P0-1 # rejection through unchanged) and kb_core::id_for_asset
               for AssetId derivation. Storage variant signals intent only;
               actual byte copy is P1-6's responsibility.

Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:27:34 +00:00
98280c6249 Merge pull request 'feat(p0-1): workspace skeleton + frozen contracts' (#5) from feat/p0-1-skeleton into main
Reviewed-on: altair823-org/kb#5
2026-04-30 12:16:46 +00:00
d2c8728095 p0-1: address review (drop unused thiserror dep, document kb-core reserve)
- Cargo.toml: remove `thiserror` from kb-config, kb-parse-types, kb-app
  (unused — none of those crates' src trees reference thiserror; CoreError
  in kb-core is the only consumer).
- kb-config keeps the `kb-core` dep with a one-line comment marking
  CoreError reserved for P1-* config-error wiring per the review thread.
- ids.rs: switch `validate_hex32` from a hand-rolled `matches!` byte range
  to `is_ascii_hexdigit()` so the hex check is the canonical idiom (and
  satisfies `clippy::manual_is_ascii_check` under `-D warnings`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:55:39 +00:00
d91b60325e p0-1: address review (apply_env full schema map, drop dead Option in logging::init)
- kb-config::apply_env now covers every leaf key in `Config` via an
  explicit grep-friendly match block (one arm per leaf), keyed
  `KB_<SECTION>_<KEY>`. Booleans flow through a shared `parse_bool` helper.
  Numeric leaves silently keep their prior value on parse failure so a
  malformed env entry can't crash startup.
- New tests: env_unknown_key_is_ignored,
  env_overrides_chunking_target_tokens,
  env_overrides_models_llm_endpoint_and_temperature,
  env_overrides_indexing_watch_filesystem_bool.
- kb-app::logging::init now returns `Result<WorkerGuard>` instead of
  `Result<Option<WorkerGuard>>` — the inner `Option` was always `Some` so
  the wrapper was dead. kb-cli/main.rs collapses the call from
  `.ok().flatten()` to `.ok()`, preserving fail-soft semantics on logging
  init.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:53:59 +00:00
ca0eb2f9cb p0-1: address review (reject # in WorkspacePath, parse error context)
- WorkspacePath: add `WorkspacePath::new(s)` validating constructor that
  rejects any string containing `#` (collides with the W3C-Media-Fragments
  separator that Citation URIs depend on). Doc-comment on the type now
  explains the invariant.
- normalize::to_posix changes signature to `Result<WorkspacePath, CoreError>`
  and now flows through `WorkspacePath::new`, so a path with `#` is rejected
  at construction rather than at every reader. Only one caller existed
  outside tests, so the signature change is contained.
- Citation::parse uses `WorkspacePath::new` on the path side. With multiple
  `#` separators in the input, `rsplit_once` would otherwise leave a `#` on
  the path; the new constructor closes the hole.
- Citation::parse + parse_hms_ms: every `bail!` / `anyhow!` site now quotes
  the offending substring (and the full input where useful) so error
  messages identify what went wrong without re-deriving from context.
- New tests: `to_posix_rejects_hash_in_path`,
  `parse_path_with_hash_rejected_at_to_posix_layer`. Existing
  `to_posix(...).0` call sites updated for the Result signature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:52:15 +00:00
286e62734d p0-1: address review (lowercase-hex normalize, more pinned ID tests)
- ids.rs: validate_hex32 now accepts upper+lower hex; FromStr lowercases
  the stored representation so equality and hashing stay canonical.
- Renamed test newtype_rejects_uppercase →
  newtype_accepts_uppercase_normalizes_to_lowercase and added
  newtype_rejects_invalid_chars_after_uppercase_pass.
- Added pinned-hex independence tests for id_for_block / id_for_chunk /
  id_for_embedding / id_for_index. Each test also asserts that the bytes
  serde-json-canonicalizer emits for the tuple match the literal JSON we
  hashed externally with b3sum, so a future field-rename can't silently
  drift the hash without flagging the JSON layer first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:49:48 +00:00
5af07c174d p0-1: address quality review (wire convention, IngestItemKind re-export, clippy)
Three follow-ups from the code-quality review pass on P0-1:

- Re-export `IngestItemKind` from `kb-core` so downstream tasks
  constructing `IngestItem` don't need `kb_core::ingest::IngestItemKind`.
- Document the `--json` wire-schema convention by introducing
  `kb-cli/src/wire.rs` with `wire_*` helpers paralleling the existing
  inline `wire_ingest`. Each Ok-path `--json` branch now routes through
  these helpers so future P1-5/P3/P4/P5 implementations slot the
  `schema_version` envelope in automatically. `DoctorReport` keeps its
  struct-field `schema_version` (the documented exception), and the
  helper round-trips it idempotently. Records the convention in
  `kb-app/src/lib.rs`'s top docstring.
- Fix clippy `single_char_add_str` in `kb_core::normalize` (replace
  `out.push_str(".")` with `out.push('.')`).

Verified: `cargo check`, `cargo test` (5 new wire-helper tests),
`cargo clippy -D warnings`, and `RUSTFLAGS=-D warnings cargo build` all
clean. Smoke-tested `kb doctor --json` still emits
`{"schema_version":"doctor.v1",...}`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:33:31 +00:00
a166b7051c p0-1: wire-schema stubs, doc/spec stubs, V001 migration, fixtures
- docs/wire-schema/v1/ ships 7 schema stubs (citation, search_hit,
  answer, ingest_report, doc_summary, chunk_inspection, doctor) that
  pin schema_version + required fields per design §2. Full property
  validation lands in later phases.
- docs/spec/ ships 7 markdown stubs each linking to the canonical
  frozen design (domain-model, ids, canonical-document, chunk-policy,
  citation-policy, module-boundaries, ai-generation-guidelines).
- migrations/V001__init.sql contains only schema_meta + migrations
  tables per design §5.1; data tables ship in P1-6/P2-1/P3-3.
- fixtures/ has the 11 subdirectories every downstream task references
  (markdown, source-fs, search/{lexical,hybrid}, embed, vector, rag,
  eval, image, pdf, audio). Empty subdirs use .gitkeep so they track.
  fixtures/markdown/ ships the 3 phase-0 fixtures: simple-note.md,
  nested-headings.md, code-and-table.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:17:32 +00:00
ec8a4ddb1b p0-1: kb-cli clap entry with §10 exit-code mapping
Adds the kb binary with clap v4 derive subcommands mapping 1:1 to
kb-app facade functions:

  init | ingest | list docs | inspect (doc|chunk) | search | ask
  | doctor | eval run

Global flags: --config, --verbose, --debug, --json. On --json, output
conforms to wire schema v1 (e.g. doctor.v1 emitted by kb-app::doctor).

Exit-code mapping per design §10:
  0 success
  1 RefusalSignal / NoHitSignal (kb ask refusal, kb search no-hit)
  2 any other anyhow::Error
  3 DoctorUnhealthy

Tracing initialized at startup with the file appender from kb-app.

Verified via:
  XDG_*=… cargo run -p kb-cli -- init    → idempotent
  XDG_*=… cargo run -p kb-cli -- doctor --json
    → {"schema_version":"doctor.v1","ok":true,…} exit 0
  XDG_*=… cargo run -p kb-cli -- doctor   (human form, ✓ marks)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:17:18 +00:00
237ada6e21 p0-1: kb-app facade stubs + tracing init helper
Adds the kb-app crate (§7) as the single facade between UI crates
(kb-cli / kb-tui / kb-desktop) and the rest of the workspace. Public
surface mirrors the task spec exactly:

- init_workspace(force) — XDG dir creation + config.toml seed; idempotent
  unless force=true. Honors XDG envs and tilde-expands the workspace
  root to $HOME/KnowledgeBase.
- doctor() — emits a doctor.v1 report with config_loaded +
  data_dir_writable checks; downstream checks land in later phases.
- ingest / list_docs / inspect_doc / inspect_chunk / search / ask —
  bail!("not yet wired (P<n>-<i>)") so kb-cli surfaces exit code 2
  cleanly per §10.
- AskOpts + DoctorReport + DoctorCheck.
- doctor_signal::{DoctorUnhealthy, RefusalSignal, NoHitSignal} —
  signal types the CLI downcasts on for §10 exit-code mapping.
- logging::init() — daily-rolling file appender at
  $XDG_STATE_HOME/kb/logs/kb.log, plus stderr-fallback EnvFilter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:17:11 +00:00
76a860296e p0-1: kb-config schema + XDG path resolution
Adds the kb-config crate per design §6. Provides the frozen Config
schema (§6.4) with serde + toml round-trip, defaults() that exactly
match the reference values (e.g. score_gate=0.30, target_tokens=500,
embedding.dimensions=384, rrf_k=60), and XDG path resolvers that honor
XDG_CONFIG_HOME / XDG_DATA_HOME / XDG_CACHE_HOME / XDG_STATE_HOME.

Layer order in load(): defaults → file → env (KB_<SECTION>_<KEY>);
CLI overrides apply later in kb-cli. Env mapping covers the keys
needed by P0 smoke tests; the rest land as their config sections wire
up.

5 unit tests cover serde round-trip, defaults pinned to design,
KB_RAG_SCORE_GATE / KB_SEARCH_DEFAULT_K env override, and
XDG_CONFIG_HOME handling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:16:57 +00:00
030986b37c p0-1: kb-parse-types thin parser-intermediate crate
Adds the kb-parse-types crate per design §3.7b. Depends only on kb-core
+ serde/thiserror — never on parser libraries. Defines:

- ParsedBlock + ParsedBlockKind + ParsedPayload (8 variants matching
  Block variants in kb-core).
- Warning + WarningKind for parser diagnostics.
- Forward-declared ParsedImageRegion / ParsedPdfPage / ParsedAudioSegment
  shells for P6/P7/P8.

`cargo tree -p kb-parse-types` shows only kb-core, serde, and thiserror.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:16:50 +00:00
f86df99fe9 p0-1: workspace + kb-core domain types, traits, and ID recipe
Stand up the Cargo workspace (Rust 2024 / resolver=3) with the kb-core
crate per the frozen design (§3, §4, §7, §10). kb-core has zero
deps on other kb-* crates and exposes:

- Newtype IDs (AssetId / DocumentId / BlockId / ChunkId / EmbeddingId /
  IndexId) with Display + FromStr that reject anything but 32 lower-hex.
- id_from + id_for_{asset,doc,block,chunk,embedding,index} per §4.2;
  pinned hex test values computed via an independent JCS+blake3 tool.
- CanonicalDocument, Block (8 variants), SourceSpan, Inline (§3.4).
- Citation (5 variants) with W3C Media Fragments to_uri / parse;
  round-trip property holds for every variant.
- Metadata + Provenance (§3.6); SearchQuery / SearchHit / RetrievalDetail
  (§3.7); DocFilter / DocSummary mirrors of wire §2.5.
- Answer / AnswerCitation / RefusalReason / ModelRef (§3.8).
- IngestReport, JobRepo support types, VectorRecord / VectorHit.
- Component traits (SourceConnector / Extractor / Chunker / Embedder /
  Retriever / LanguageModel / DocumentStore / VectorStore / JobRepo)
  plus their input helpers (SourceScope / ExtractContext / ChunkPolicy
  / EmbeddingInput / GenerateRequest / TokenChunk / FinishReason).
- CoreError (§10).
- nfc + to_posix helpers (§4.1, §6.6).

20 unit tests cover ID determinism (1000-run regression), key-order
invariance, two pinned hex values, newtype rejection of bad input,
Citation round-trip for all 5 variants, and to_posix collapsing +
Korean NFC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:16:37 +00:00
d3cb06f60d Merge pull request 'docs: add Korean README' (#4) from docs/readme into main 2026-04-27 23:51:43 +00:00
kb
bf2179b9c4 docs: add Korean README
- Status (spec frozen, P0 not started)
- Command summary (init/ingest/search/ask/inspect/doctor/eval)
- Locked decisions table (Rust 2024, SQLite+FTS5, LanceDB, fastembed, Ollama, whisper, Tauri)
- Dependency graph + module-boundary references
- Phase roadmap (P0..P9) with crates per phase
- Directory layout (current + post-P0)
- Build/run snippet (post-P0)
- Non-goals
- External AI integration paths (skill / MCP / HTTP)
- Contribution workflow + license note
- Cross-links to frozen design / plan / INDEX
2026-04-27 23:50:39 +00:00
ec9c571475 Merge pull request 'refactor(spec): cleanup audit pass — §refs / mock gate / Warning unification / streaming threading / cosine shift / fixtures' (#3) from refactor/spec-cleanup into main 2026-04-27 23:39:55 +00:00
kb
bc1b3147cd refactor(spec): cleanup pass over component specs
Address 8 issues found in spec audit (post PR #2):

1. §refs label: distinguish design vs report sections in p3-1 / p3-2 / p4-2 /
   p9-1 / p9-5 contract_sections (e.g., "report §11.2 Ollama" not "§11.2").
2. mock feature gate: gate MockEmbedder (p3-1) and MockLanguageModel (p4-1)
   behind `mock` cargo feature, default OFF; add CI symbol-scan as DoD item.
3. Warning type unification: p1-2 frontmatter now emits
   `kb_parse_types::Warning` (matches p1-3 / p1-4); drops crate-internal type.
4. p4-3 streaming thread: explicitly single-threaded inside RagPipeline::ask;
   collection + sink.send share the calling thread, no race. UI concurrency
   is callers responsibility (TUI worker thread pattern in p9-3).
5. p6-2 tesseract version: noted that `tesseract` 0.13 has no stable Rust
   `version()` accessor; use TessVersion FFI or shell-out + cache approach.
6. p9-* App struct extensions: introduce `kb_tui::{Library,Search,Ask,Inspect}State`
   slots in p9-1 forward-decl form; p9-2/3/4 fill bodies in their own crate
   without editing `App`. Parallel-safety contract added.
7. p3-3 cosine score: shift `(sim+1)/2` instead of clamp; preserve ranking
   signal between unrelated and opposite vectors. Clamp reserved for NaN.
8. fixtures/ root: p0-1 DoD now creates all fixture subdirs with .gitkeep so
   downstream tasks have a stable target path.
2026-04-27 23:38:13 +00:00