Commit Graph

56 Commits

Author SHA1 Message Date
f604a381df p1-3: snapshot tests + clippy fix
Adds two snapshot tests (`nested-headings.md`, `code-and-table.md`) under
crates/kb-parse-md/tests/blocks_snapshots.rs, with matching baseline JSON
next to each fixture. The snapshot view projects `kb_core::Inline` to
flat strings — `Inline` carries `serde(tag = "kind")` which is
incompatible with newtype variants holding a primitive (`Text(String)`),
so direct serialization of `ParsedBlock` would fail today. The view
preserves the contract that matters for P1-3 (heading paths, source
spans, payload kinds, payload text/code/table content) and will keep
working once kb-core fixes the Inline schema in a later task.

Also tightens `level_to_use >= 1 && <= 6` into `(1..=6).contains(&_)` to
satisfy `clippy::manual_range_contains`.
2026-04-30 14:17:41 +00:00
4e7e9cad87 p1-3: add parse_blocks (pulldown-cmark walker) submodule
Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a
flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a
small frame-based state machine that tracks heading paths, accumulates
inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and
reports SourceSpan::Line spans in 1-indexed file-line coordinates.

Covers headings, paragraphs, code blocks (lang from info string), GFM
tables (with malformed fallback to paragraph + MalformedTable warning),
lists (nested sub-lists flattened into parent item), and block-level image
references. Inline images are dropped silently per the inline filter.
Adversarial inputs are caught with `catch_unwind` and degrade to an empty
output + ExtractFailed warning.

15 unit tests cover heading-path correctness, code lang, table parsing,
malformed-table fallback (driven via synthetic events since pulldown-cmark
auto-normalizes table widths), LF/CRLF line-range parity, image refs,
nested-list flattening, inline filter, and 100-iteration random-bytes plus
hand-crafted adversarial-input no-panic guards.
2026-04-30 14:14:34 +00:00
ff37ea5927 Merge pull request 'feat(p1-2): kb-parse-md frontmatter (YAML/TOML → Metadata)' (#7) from feat/p1-2-parse-md-frontmatter into main
Reviewed-on: altair823-org/kb#7
2026-04-30 14:06:54 +00:00
5850bfcf7a p1-2: address review minors (FrontmatterSpan doc, parse_frontmatter rustdoc, YAML library note)
M1: Reword the FrontmatterSpan doc-comment from "technically meant to be
crate-internal" to a forward-looking note about P1-3 / P1-4 callers using
bytes[span.end..] for body slicing.

M3: Add an explicit `# Errors` section to parse_frontmatter's rustdoc.
The current implementation never returns Err — all recoverable problems
are downgraded to warnings — but the Result is kept on the signature so
future hard-fail conditions can be added without breaking callers.

M4: Mention serde_yml in the library-choice rationale alongside
serde_yaml_ng, with a one-line note on why _ng was preferred (stricter
adherence to original serde_yaml semantics around null / tagged enums).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 13:13:16 +00:00
6a4db624b6 p1-2: fix CRLF / trailing whitespace / BOM in frontmatter delimiter detection
C1: detect_delimiters now returns (DelimKind, FrontmatterSpan, Range<usize>)
where the inner range is the YAML/TOML payload byte range — derived in one
place rather than recomputed by the parser via fixed-width opening_len /
closing_len constants that wrongly assumed LF endings. CRLF input now parses
correctly end-to-end; the originally-failing reviewer probe
"---\r\ntitle: Doc\r\n---\r\nbody\r\n" now yields title="Doc" with no
warnings.

I1: Trailing horizontal whitespace (spaces / tabs) on either delimiter
line is now accepted, matching Hugo / Jekyll. Editors that auto-trim
trailing whitespace no longer silently break otherwise-valid frontmatter.

I2: A leading UTF-8 BOM (EF BB BF, byte 0 only) is tolerated and skipped
before delimiter scanning. The returned span.start accounts for the BOM
(=3) so callers using bytes[span.end..] for body slicing still get the
correct range without further bookkeeping. Mid-input BOMs are not stripped.

M2: Drop the now-dead DelimKind::opening_len / closing_len constants —
the inner range is encoded once at detection time.

12 new tests covering CRLF (YAML / TOML / mixed-EOL / end-to-end),
trailing whitespace on opener / closer / tabs, leading BOM (detection +
full pipeline), and mid-input BOM non-stripping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 13:12:34 +00:00
1fab6b0207 p1-2: address spec review (drop user_id_alias mirror in user map)
Spec §"Behavior contract" line 74 says `id:` is captured into
`metadata.user_id_alias` only. Remove the redundant `user.insert`
that was also writing it into the user map, and update the snapshot
baseline accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 13:02:28 +00:00
42a7d53e5d p1-2: fixtures + snapshot tests for frontmatter parser
Two markdown fixtures with hand-authored JSON baselines that pin the
§0 Q9 derive output across runs:

- frontmatter-only.md exercises the YAML happy path with most fields,
  unknown keys, an `id:` field, and a non-UTC created_at (so the
  baseline shows original_timestamps preservation).
- mixed-lang.md is body-only with no `lang:` field; baseline pins the
  lingua autodetect result for our enabled language set.

A separate `emit_snapshots` test (marked `#[ignore]`) regenerates the
baselines from the current parser output. A determinism test parses
the fixture twice and asserts equality so any non-determinism (e.g.
key ordering, lingua nondeterminism) fails fast.
2026-04-30 12:56:19 +00:00
cc8f7dad3f p1-2: parse_frontmatter + §0 Q9 derive table
Implement the frontmatter submodule:

- detect_delimiters scans for a leading YAML (---) or TOML (+++) block at
  byte 0. Strict per §0 Q9: no leading whitespace / BOM, no chars on the
  delimiter line. Closing must be its own line. Unterminated → no FM.
- parse_raw deserializes into RawFrontmatter, a serde-flatten struct that
  catches unknown keys into a serde_json::Map for verbatim preservation
  in metadata.user.
- derive_metadata implements the §0 Q9 fallback chain:
    title       → frontmatter | BodyHints.first_h1 | (filename: caller)
    aliases/tags→ frontmatter | []
    lang        → frontmatter | lingua autodetect on first 4 KB | hints
                  | "und"
    created_at  → frontmatter (RFC 3339, normalized to UTC) | fs_ctime
    updated_at  → frontmatter | fs_mtime
    source_type → frontmatter | "markdown"
    trust_level → frontmatter | "primary"
    id          → user_id_alias only — never a doc_id factor (§4.2)
- Non-UTC offsets are normalized to UTC; the original string is preserved
  in user.original_timestamps[field] per §0 Q9.
- Warnings are emitted for: malformed YAML/TOML, unknown enum values,
  malformed timestamps. Unknown keys are silent.
- lingua detector is cached in a OnceLock — first build is heavy.
- 15 unit tests cover every row of the derive table + delimiter edge
  cases + an explicit pin that `id:` does not feed id_for_doc.
2026-04-30 12:56:02 +00:00
a86b463fc4 p1-2: scaffold kb-parse-md crate
Add the workspace member with the dep allow-list pinned by design §0 Q9
and the task spec. P1-2 will land the frontmatter submodule in the next
commit; P1-3 will add the block parser as a sibling.

Notable choice: serde_yaml (dtolnay) was archived as unmaintained in 2024
so we use serde_yaml_ng, the maintained fork. lingua's per-language
features are explicitly enabled (default-features=false) to keep build
time + binary size sane — only the languages we need at parse time.
2026-04-30 12:55:20 +00:00
69a0dbb79d Merge pull request 'feat(p1-1): kb-source-fs filesystem source connector' (#6) from feat/p1-1-source-fs into main
Reviewed-on: altair823-org/kb#6
2026-04-30 12:45:45 +00:00
967a6a62c5 p1-1: address review (walker module doc, TODO markers, .kbignore ADR)
- walker.rs: document why we pick walkdir over ignore::WalkBuilder
  (explicit canonical-path comparison for sibling-subtree symlinks).
- walker.rs: log canonicalize failures via tracing::debug! (was a silent
  `Err(_) => continue`) so broken/permission-denied symlink targets are
  observable at debug verbosity.
- connector.rs: TODO marker on the scope.include debug-log noting the
  filter belongs at the extractor router (P1-2/P1-3).
- connector.rs: TODO marker on expand_tilde to hoist tilde + ${VAR}
  expansion into a kb-config helper once available.
- connector.rs: comment on the .kbignore read documenting the
  re-read-on-every-scan() contract.
- connector.rs test: tighten the `.kbignore`-itself ADR comment and
  upgrade the assertion to actively pin "`.kbignore` IS emitted" instead
  of "either is fine"; future drift will now fail the test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:42:25 +00:00
f8d00bdaf6 p1-1: address review (AssetStorage doc-comment, real directory-cycle test)
- Document AssetStorage::Copied / Reference path semantics so P1-6 (asset
  writer) knows that at scan time `Copied.path` is the SOURCE path and the
  writer is responsible for both copying bytes AND overwriting `path`
  with the destination.
- Rename the dangling-symlink test to make its scope explicit
  (`dangling_symlink_pseudo_cycle_does_not_crash`); the prior name implied
  a real two-step directory cycle but the targets were broken links.
- Add `two_step_directory_cycle_visited_set_breaks_loop`: builds
  `a/loop -> ../b` and `b/loop -> ../a` over real directories with real
  files, asserting scan terminates with a finite, deterministic asset
  list — exercises the canonical-path visited-set in walker::walk_files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:42:12 +00:00
0699220d5d p1-1: tree-1 fixture + snapshot/determinism tests
fixtures/source-fs/tree-1/:
  README.md
  notes/alpha.md
  notes/beta.md
  ignored/skip.tmp        (excluded by .kbignore *.tmp)
  .kbignore               ("*.tmp")
  .DS_Store               (implicitly excluded by FsSourceConnector)

The committed baseline (tree-1.snapshot.json) has discovered_at,
source_uri.value, and stored.path replaced with "<stripped>" so the
JSON is portable across checkout locations and CI runs. The test
applies the same stripping to scan output before comparing.

The determinism test runs scan twice and asserts byte-identical
serialized JSON (post-strip) — same filesystem state must yield the
same Vec<RawAsset>.

Regenerate baseline with `KB_REGEN_SNAPSHOT=1 cargo test -p kb-source-fs
--test snapshot_tree1 -- tree_1_snapshot_matches_baseline`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:28:00 +00:00
3d4c485415 p1-1: integration test — symlink cycles do not loop
Two cases:
  - root/notes -> root  (single-link cycle through workspace root).
  - root/a -> b, root/b -> a  (two-step cycle of dangling symlinks).

Both must complete in O(seconds) and surface alpha.md exactly once.
Proves walker::walk_files's visited-set guard catches realistic cycle
shapes via the public SourceConnector API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:27:44 +00:00
7c75e10b2c p1-1: scaffold kb-source-fs crate (FsSourceConnector)
Walk config.workspace.root, apply gitignore-style filters
(config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for
.DS_Store / ._*), stream BLAKE3 over each file, and emit a
deterministic Vec<RawAsset> sorted by workspace_path.

Modules:
  - hash:      streaming blake3::Hasher + 64 KiB read buffer (no whole-file
               loads); pinned digests for empty input and "hello world".
  - media:     extension → MediaType (markdown/pdf/image/audio/other).
  - walker:    ignore::OverrideBuilder for filter union; walkdir with
               manual visited-set cycle protection on top of follow_links.
  - connector: public FsSourceConnector::new(&Config) +
               SourceConnector::scan(&SourceScope) impl. Uses
               kb_core::to_posix for WorkspacePath construction (carries
               P0-1 # rejection through unchanged) and kb_core::id_for_asset
               for AssetId derivation. Storage variant signals intent only;
               actual byte copy is P1-6's responsibility.

Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:27:34 +00:00
98280c6249 Merge pull request 'feat(p0-1): workspace skeleton + frozen contracts' (#5) from feat/p0-1-skeleton into main
Reviewed-on: altair823-org/kb#5
2026-04-30 12:16:46 +00:00
d2c8728095 p0-1: address review (drop unused thiserror dep, document kb-core reserve)
- Cargo.toml: remove `thiserror` from kb-config, kb-parse-types, kb-app
  (unused — none of those crates' src trees reference thiserror; CoreError
  in kb-core is the only consumer).
- kb-config keeps the `kb-core` dep with a one-line comment marking
  CoreError reserved for P1-* config-error wiring per the review thread.
- ids.rs: switch `validate_hex32` from a hand-rolled `matches!` byte range
  to `is_ascii_hexdigit()` so the hex check is the canonical idiom (and
  satisfies `clippy::manual_is_ascii_check` under `-D warnings`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:55:39 +00:00
d91b60325e p0-1: address review (apply_env full schema map, drop dead Option in logging::init)
- kb-config::apply_env now covers every leaf key in `Config` via an
  explicit grep-friendly match block (one arm per leaf), keyed
  `KB_<SECTION>_<KEY>`. Booleans flow through a shared `parse_bool` helper.
  Numeric leaves silently keep their prior value on parse failure so a
  malformed env entry can't crash startup.
- New tests: env_unknown_key_is_ignored,
  env_overrides_chunking_target_tokens,
  env_overrides_models_llm_endpoint_and_temperature,
  env_overrides_indexing_watch_filesystem_bool.
- kb-app::logging::init now returns `Result<WorkerGuard>` instead of
  `Result<Option<WorkerGuard>>` — the inner `Option` was always `Some` so
  the wrapper was dead. kb-cli/main.rs collapses the call from
  `.ok().flatten()` to `.ok()`, preserving fail-soft semantics on logging
  init.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:53:59 +00:00
ca0eb2f9cb p0-1: address review (reject # in WorkspacePath, parse error context)
- WorkspacePath: add `WorkspacePath::new(s)` validating constructor that
  rejects any string containing `#` (collides with the W3C-Media-Fragments
  separator that Citation URIs depend on). Doc-comment on the type now
  explains the invariant.
- normalize::to_posix changes signature to `Result<WorkspacePath, CoreError>`
  and now flows through `WorkspacePath::new`, so a path with `#` is rejected
  at construction rather than at every reader. Only one caller existed
  outside tests, so the signature change is contained.
- Citation::parse uses `WorkspacePath::new` on the path side. With multiple
  `#` separators in the input, `rsplit_once` would otherwise leave a `#` on
  the path; the new constructor closes the hole.
- Citation::parse + parse_hms_ms: every `bail!` / `anyhow!` site now quotes
  the offending substring (and the full input where useful) so error
  messages identify what went wrong without re-deriving from context.
- New tests: `to_posix_rejects_hash_in_path`,
  `parse_path_with_hash_rejected_at_to_posix_layer`. Existing
  `to_posix(...).0` call sites updated for the Result signature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:52:15 +00:00
286e62734d p0-1: address review (lowercase-hex normalize, more pinned ID tests)
- ids.rs: validate_hex32 now accepts upper+lower hex; FromStr lowercases
  the stored representation so equality and hashing stay canonical.
- Renamed test newtype_rejects_uppercase →
  newtype_accepts_uppercase_normalizes_to_lowercase and added
  newtype_rejects_invalid_chars_after_uppercase_pass.
- Added pinned-hex independence tests for id_for_block / id_for_chunk /
  id_for_embedding / id_for_index. Each test also asserts that the bytes
  serde-json-canonicalizer emits for the tuple match the literal JSON we
  hashed externally with b3sum, so a future field-rename can't silently
  drift the hash without flagging the JSON layer first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:49:48 +00:00
5af07c174d p0-1: address quality review (wire convention, IngestItemKind re-export, clippy)
Three follow-ups from the code-quality review pass on P0-1:

- Re-export `IngestItemKind` from `kb-core` so downstream tasks
  constructing `IngestItem` don't need `kb_core::ingest::IngestItemKind`.
- Document the `--json` wire-schema convention by introducing
  `kb-cli/src/wire.rs` with `wire_*` helpers paralleling the existing
  inline `wire_ingest`. Each Ok-path `--json` branch now routes through
  these helpers so future P1-5/P3/P4/P5 implementations slot the
  `schema_version` envelope in automatically. `DoctorReport` keeps its
  struct-field `schema_version` (the documented exception), and the
  helper round-trips it idempotently. Records the convention in
  `kb-app/src/lib.rs`'s top docstring.
- Fix clippy `single_char_add_str` in `kb_core::normalize` (replace
  `out.push_str(".")` with `out.push('.')`).

Verified: `cargo check`, `cargo test` (5 new wire-helper tests),
`cargo clippy -D warnings`, and `RUSTFLAGS=-D warnings cargo build` all
clean. Smoke-tested `kb doctor --json` still emits
`{"schema_version":"doctor.v1",...}`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:33:31 +00:00
a166b7051c p0-1: wire-schema stubs, doc/spec stubs, V001 migration, fixtures
- docs/wire-schema/v1/ ships 7 schema stubs (citation, search_hit,
  answer, ingest_report, doc_summary, chunk_inspection, doctor) that
  pin schema_version + required fields per design §2. Full property
  validation lands in later phases.
- docs/spec/ ships 7 markdown stubs each linking to the canonical
  frozen design (domain-model, ids, canonical-document, chunk-policy,
  citation-policy, module-boundaries, ai-generation-guidelines).
- migrations/V001__init.sql contains only schema_meta + migrations
  tables per design §5.1; data tables ship in P1-6/P2-1/P3-3.
- fixtures/ has the 11 subdirectories every downstream task references
  (markdown, source-fs, search/{lexical,hybrid}, embed, vector, rag,
  eval, image, pdf, audio). Empty subdirs use .gitkeep so they track.
  fixtures/markdown/ ships the 3 phase-0 fixtures: simple-note.md,
  nested-headings.md, code-and-table.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:17:32 +00:00
ec8a4ddb1b p0-1: kb-cli clap entry with §10 exit-code mapping
Adds the kb binary with clap v4 derive subcommands mapping 1:1 to
kb-app facade functions:

  init | ingest | list docs | inspect (doc|chunk) | search | ask
  | doctor | eval run

Global flags: --config, --verbose, --debug, --json. On --json, output
conforms to wire schema v1 (e.g. doctor.v1 emitted by kb-app::doctor).

Exit-code mapping per design §10:
  0 success
  1 RefusalSignal / NoHitSignal (kb ask refusal, kb search no-hit)
  2 any other anyhow::Error
  3 DoctorUnhealthy

Tracing initialized at startup with the file appender from kb-app.

Verified via:
  XDG_*=… cargo run -p kb-cli -- init    → idempotent
  XDG_*=… cargo run -p kb-cli -- doctor --json
    → {"schema_version":"doctor.v1","ok":true,…} exit 0
  XDG_*=… cargo run -p kb-cli -- doctor   (human form, ✓ marks)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:17:18 +00:00
237ada6e21 p0-1: kb-app facade stubs + tracing init helper
Adds the kb-app crate (§7) as the single facade between UI crates
(kb-cli / kb-tui / kb-desktop) and the rest of the workspace. Public
surface mirrors the task spec exactly:

- init_workspace(force) — XDG dir creation + config.toml seed; idempotent
  unless force=true. Honors XDG envs and tilde-expands the workspace
  root to $HOME/KnowledgeBase.
- doctor() — emits a doctor.v1 report with config_loaded +
  data_dir_writable checks; downstream checks land in later phases.
- ingest / list_docs / inspect_doc / inspect_chunk / search / ask —
  bail!("not yet wired (P<n>-<i>)") so kb-cli surfaces exit code 2
  cleanly per §10.
- AskOpts + DoctorReport + DoctorCheck.
- doctor_signal::{DoctorUnhealthy, RefusalSignal, NoHitSignal} —
  signal types the CLI downcasts on for §10 exit-code mapping.
- logging::init() — daily-rolling file appender at
  $XDG_STATE_HOME/kb/logs/kb.log, plus stderr-fallback EnvFilter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:17:11 +00:00
76a860296e p0-1: kb-config schema + XDG path resolution
Adds the kb-config crate per design §6. Provides the frozen Config
schema (§6.4) with serde + toml round-trip, defaults() that exactly
match the reference values (e.g. score_gate=0.30, target_tokens=500,
embedding.dimensions=384, rrf_k=60), and XDG path resolvers that honor
XDG_CONFIG_HOME / XDG_DATA_HOME / XDG_CACHE_HOME / XDG_STATE_HOME.

Layer order in load(): defaults → file → env (KB_<SECTION>_<KEY>);
CLI overrides apply later in kb-cli. Env mapping covers the keys
needed by P0 smoke tests; the rest land as their config sections wire
up.

5 unit tests cover serde round-trip, defaults pinned to design,
KB_RAG_SCORE_GATE / KB_SEARCH_DEFAULT_K env override, and
XDG_CONFIG_HOME handling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:16:57 +00:00
030986b37c p0-1: kb-parse-types thin parser-intermediate crate
Adds the kb-parse-types crate per design §3.7b. Depends only on kb-core
+ serde/thiserror — never on parser libraries. Defines:

- ParsedBlock + ParsedBlockKind + ParsedPayload (8 variants matching
  Block variants in kb-core).
- Warning + WarningKind for parser diagnostics.
- Forward-declared ParsedImageRegion / ParsedPdfPage / ParsedAudioSegment
  shells for P6/P7/P8.

`cargo tree -p kb-parse-types` shows only kb-core, serde, and thiserror.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:16:50 +00:00
f86df99fe9 p0-1: workspace + kb-core domain types, traits, and ID recipe
Stand up the Cargo workspace (Rust 2024 / resolver=3) with the kb-core
crate per the frozen design (§3, §4, §7, §10). kb-core has zero
deps on other kb-* crates and exposes:

- Newtype IDs (AssetId / DocumentId / BlockId / ChunkId / EmbeddingId /
  IndexId) with Display + FromStr that reject anything but 32 lower-hex.
- id_from + id_for_{asset,doc,block,chunk,embedding,index} per §4.2;
  pinned hex test values computed via an independent JCS+blake3 tool.
- CanonicalDocument, Block (8 variants), SourceSpan, Inline (§3.4).
- Citation (5 variants) with W3C Media Fragments to_uri / parse;
  round-trip property holds for every variant.
- Metadata + Provenance (§3.6); SearchQuery / SearchHit / RetrievalDetail
  (§3.7); DocFilter / DocSummary mirrors of wire §2.5.
- Answer / AnswerCitation / RefusalReason / ModelRef (§3.8).
- IngestReport, JobRepo support types, VectorRecord / VectorHit.
- Component traits (SourceConnector / Extractor / Chunker / Embedder /
  Retriever / LanguageModel / DocumentStore / VectorStore / JobRepo)
  plus their input helpers (SourceScope / ExtractContext / ChunkPolicy
  / EmbeddingInput / GenerateRequest / TokenChunk / FinishReason).
- CoreError (§10).
- nfc + to_posix helpers (§4.1, §6.6).

20 unit tests cover ID determinism (1000-run regression), key-order
invariance, two pinned hex values, newtype rejection of bad input,
Citation round-trip for all 5 variants, and to_posix collapsing +
Korean NFC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:16:37 +00:00
d3cb06f60d Merge pull request 'docs: add Korean README' (#4) from docs/readme into main 2026-04-27 23:51:43 +00:00
kb
bf2179b9c4 docs: add Korean README
- Status (spec frozen, P0 not started)
- Command summary (init/ingest/search/ask/inspect/doctor/eval)
- Locked decisions table (Rust 2024, SQLite+FTS5, LanceDB, fastembed, Ollama, whisper, Tauri)
- Dependency graph + module-boundary references
- Phase roadmap (P0..P9) with crates per phase
- Directory layout (current + post-P0)
- Build/run snippet (post-P0)
- Non-goals
- External AI integration paths (skill / MCP / HTTP)
- Contribution workflow + license note
- Cross-links to frozen design / plan / INDEX
2026-04-27 23:50:39 +00:00
ec9c571475 Merge pull request 'refactor(spec): cleanup audit pass — §refs / mock gate / Warning unification / streaming threading / cosine shift / fixtures' (#3) from refactor/spec-cleanup into main 2026-04-27 23:39:55 +00:00
kb
bc1b3147cd refactor(spec): cleanup pass over component specs
Address 8 issues found in spec audit (post PR #2):

1. §refs label: distinguish design vs report sections in p3-1 / p3-2 / p4-2 /
   p9-1 / p9-5 contract_sections (e.g., "report §11.2 Ollama" not "§11.2").
2. mock feature gate: gate MockEmbedder (p3-1) and MockLanguageModel (p4-1)
   behind `mock` cargo feature, default OFF; add CI symbol-scan as DoD item.
3. Warning type unification: p1-2 frontmatter now emits
   `kb_parse_types::Warning` (matches p1-3 / p1-4); drops crate-internal type.
4. p4-3 streaming thread: explicitly single-threaded inside RagPipeline::ask;
   collection + sink.send share the calling thread, no race. UI concurrency
   is callers responsibility (TUI worker thread pattern in p9-3).
5. p6-2 tesseract version: noted that `tesseract` 0.13 has no stable Rust
   `version()` accessor; use TessVersion FFI or shell-out + cache approach.
6. p9-* App struct extensions: introduce `kb_tui::{Library,Search,Ask,Inspect}State`
   slots in p9-1 forward-decl form; p9-2/3/4 fill bodies in their own crate
   without editing `App`. Parallel-safety contract added.
7. p3-3 cosine score: shift `(sim+1)/2` instead of clamp; preserve ranking
   signal between unrelated and opposite vectors. Clamp reserved for NaN.
8. fixtures/ root: p0-1 DoD now creates all fixture subdirs with .gitkeep so
   downstream tasks have a stable target path.
2026-04-27 23:38:13 +00:00
c29ccc7925 Merge pull request 'refactor(spec): introduce kb-parse-types thin crate' (#2) from refactor/parse-types-crate into main 2026-04-27 20:43:21 +00:00
kb
9fa38543a8 refactor(spec): introduce kb-parse-types thin crate
PR #1 review left a design-debt note: ParsedBlock landing in kb-core would
(a) force every crate to recompile on parser-internal changes, and
(b) cause namespace pollution when P6/P7/P8 parsers add their own variants.

Resolution: a new thin crate kb-parse-types sits between kb-core and parsers.
Owns ParsedBlock + ParsedPayload + Warning + forward-refs for image/pdf/audio
parser intermediates. Depends on kb-core only (for SourceSpan / Inline).

Updates:
- design §3.7b: add new section defining kb-parse-types
- design §8: add kb-parse-types to module-boundary diagram + forbidden list
- design §3.4 Inline stays in kb-core; kb-parse-types references it (no duplication)
- p0-1 skeleton: workspace + Cargo deps + public surface block
- p1-3 parse-md-blocks: outputs Vec<kb_parse_types::ParsedBlock> directly
- p1-4 normalize: Allowed gains kb-parse-types, drops cross-coupling note
- INDEX + phase-0 epic: list kb-parse-types in P0 deliverables
2026-04-27 20:41:35 +00:00
2de0572625 Merge pull request 'feat(spec): frozen design v1 + 30 component task specs' (#1) from feat/spec-and-task-decomposition into main 2026-04-27 13:19:21 +00:00
kb
b999a12ab5 tasks: address PR #1 review
- p3-3: SQLite-first/Lance-second + status marker (V003__embedding_status); drop "best-effort 2PC" misnomer
- p4-3: replace print_stream FnMut closure with mpsc::Sender<String> (RagPipeline stays Send+Sync)
- p4-3: tighten citation regex to strict [#<n>] only — reject [n]/prose/code-block false positives
- p5-2: compare_runs across chunker_version is graceful (doc + span overlap fallback) with chunker_version_match audit field; --strict-chunker-version restores refusal
- p7-1: per-page text via lopdf (pdf-extract has no per-page Rust API); use char count for spans
- p8-1: explicit rubato (FftFixedIn) for 16 kHz mono resample; symphonia decode only
- p9-5: drop cmd_read_pdf_page + pdfium native dep; cmd_read_file_bytes + frontend pdfjs; add traversal tests
2026-04-27 13:10:31 +00:00
kb
eedaeccff2 tasks: update INDEX with full component-task tree (30 specs) 2026-04-27 12:14:58 +00:00
kb
f8b9f51d94 tasks: add P9 component specs (tui x4, desktop) 2026-04-27 12:14:16 +00:00
kb
7c10b15ad7 tasks: add P8 component specs (whisper, audio-chunker) 2026-04-27 12:09:51 +00:00
kb
d96d9cc56c tasks: add P7 component specs (pdf-extractor, pdf-chunker) 2026-04-27 12:07:56 +00:00
kb
c84ab03404 tasks: add P6 component specs (image-exif, ocr, caption) 2026-04-27 12:06:20 +00:00
kb
597a848af9 tasks: add P5 component specs (runner, metrics) 2026-04-27 12:04:06 +00:00
kb
ab7f6f110e tasks: add P4 component specs (llm-trait, ollama, rag-pipeline) 2026-04-27 12:02:18 +00:00
kb
5b813ce39e tasks: add P3 component specs (embedder, fastembed, lancedb, hybrid) 2026-04-27 11:59:46 +00:00
kb
c044b97a34 tasks: add P2 component specs (fts-schema, lexical-retriever) 2026-04-27 11:57:00 +00:00
kb
1cffed25ff tasks: add p0-1 skeleton component spec 2026-04-27 11:55:26 +00:00
kb
46f146584f tasks: add p1-6 store-sqlite component spec 2026-04-27 11:47:49 +00:00
kb
b711cfe5fd tasks: add p1-5 chunk component spec 2026-04-27 11:46:43 +00:00
kb
d4315dc602 tasks: add p1-4 normalize component spec 2026-04-27 11:45:53 +00:00
kb
4c0f2df44f tasks: add p1-3 parse-md blocks component spec 2026-04-27 11:44:59 +00:00
kb
7ae21424ca tasks: add p1-2 parse-md frontmatter component spec 2026-04-27 11:44:09 +00:00