kebab

Author	SHA1	Message	Date
altair823	bcbe2b8531	feat(p3-2): kb-embed-local crate — fastembed adapter for multilingual-e5-small First real Embedder implementation. Wraps fastembed-rs (ONNX runtime) with the e5 prefix convention, batching, and {data_dir}/${XDG_DATA_HOME} template expansion so model files land under config.storage.model_dir/ fastembed/ without polluting kb-config's public API. Public surface: - pub struct FastembedEmbedder - pub fn new(config: &Config) -> Result<Self> - impl kb_core::Embedder (via kb-embed re-export) Behavior: - Default model multilingual-e5-small (384 dims). model_id and model_version come from config.models.embedding.{model,version}. - Pre-load dim check via TextEmbedding::get_model_info: dim mismatch bails before paying the ~470MB ONNX init cost. - e5 prefix applied BEFORE tokenization: "passage: " for EmbeddingKind::Document, "query: " for EmbeddingKind::Query. Pinned by prefix_input unit tests. - Batches inputs into chunks of config.models.embedding.batch_size, concatenates results in input order. - L2 normalization is performed by fastembed 4.9's default transformer pipeline (verified at fastembed/src/text_embedding/output.rs:43); we skip re-normalization. Integration test pins ‖v‖ ≈ 1.0 ± 1e-3 so a future fastembed bump that drops this invariant fails loudly. - Synchronous (no async runtime). Mutex serializes calls into the underlying ONNX session — conservative; ORT Session is Send+Sync but callers (kb-app indexer) batch sequentially anyway. Revisit if profiling shows contention. - First-run model download surfaces via tracing::info before/after TextEmbedding::try_new — users no longer stare at a silent 30-60s pause during the 470MB pull. Tests: - 11 default-lane tests covering: check_dim match/mismatch (no model load), prefix_input Document/Query/empty, resolve_model known/unknown, expand_path substitution + no-op + XDG_DATA_HOME set + XDG_DATA_HOME unset (falls back to ~/.local/share with recursive ~ expansion). XDG tests serialize on a Mutex + RAII guard since edition 2024 makes set_var/remove_var unsafe. - 7 #[ignore] integration tests covering: full construction with default config, dim-mismatch belt-and-braces, Document vs Query cosine differential, L2 unit norm, byte-equal determinism, batch-64 performance under 5s, snapshot-hash stability over a 5-sentence multilingual fixture. - Snapshot test fails LOUDLY when SNAPSHOT_HASH_BASELINE is 0 — prints the captured hash and panics with paste-back instructions, so first --ignored run forces the maintainer to pin the baseline rather than silently passing. - Workspace: 222 tests pass (default lane); clippy clean. Allowed deps respected: kb-config, kb-embed (re-exports kb-core trait surface), fastembed = "4.9", tracing, anyhow. tokenizers and ort enter transitively through fastembed; reqwest/hyper/hf-hub also transitive (model download is fastembed's responsibility per spec carve-out). No direct kb-core dep needed — re-exports cover it. Pinned to fastembed 4.x rather than the recent 5.x to limit blast radius; consider bump when p3-3 (lancedb-store) consumes the embedder output shape. Out of scope: reranker (P+), Ollama embedding endpoint, candle adapter, image embeddings (P6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:39:38 +00:00
altair823	9f2afc73dc	Merge pull request 'feat(p3-1): embedder-trait — kb-embed 크레이트 + MockEmbedder' (#14 ) from feat/p3-1-embedder-trait into main Reviewed-on: altair823-org/kb#14	2026-05-01 08:21:37 +00:00
altair823	2e3eb8f437	feat(p3-1): kb-embed crate — Embedder trait re-export + MockEmbedder Establishes the kb-embed trait crate so concrete embedding adapters (p3-2 fastembed, future ollama-embed/candle) target a stable surface. Pure re-export of kb_core::{Embedder, EmbeddingInput, EmbeddingKind, EmbeddingModelId, EmbeddingVersion} plus a feature-gated deterministic mock for downstream tests. MockEmbedder (cfg(feature = "mock"), default OFF): - Per-component hash recipe: blake3(seed_le8 \|\| kind_byte \|\| text_len_le8 \|\| text \|\| i_le8). Length-prefixed text avoids the domain-separation ambiguity where two (text, i) pairs could shift bytes between text tail and the i field. - Document = 0u8, Query = 1u8 — same text different kind yields different vectors (mirrors e5 prefix behaviour). - Per component: blake3 first 8 bytes → u64 → reinterpret as i64 → f64/i64::MAX → f32. i64::MIN gives -1.0000000000000002 which f32 rounds to -1.0; range [-1, 1] holds. - L2 unit-normalised. Norm sums in f64 (avoid catastrophic precision loss) before f32 cast. Zero-norm guard skips the divide. - with_seed(...) constructor lets two embedders share identity but produce different vectors — useful for downstream parametric tests. Helpers: - assert_vector_shape(vecs, dims) — len + finite check. - assert_unit_norm(vecs, tolerance) — caller-supplied tolerance; 5e-4 documented as safe for dims=384 under f32 epsilon × √dims. Tests: - cargo test -p kb-embed (no features): 2 reexport/dyn-dispatch tests. - cargo test -p kb-embed --features mock: 7 tests including 100-case proptest asserting len == dims, all finite, ‖v‖ ≈ 1.0 within tolerance, Doc(text) byte-equal Doc(text), Doc(text) ≠ Query(text), Doc(text1) ≠ Doc(text2). - All 220 workspace tests pass; clippy clean for both default and mock-on feature configurations. Symbol gating: nm on the release rlib confirms zero MockEmbedder symbols under default features; three trait impl symbols under --features mock. Spec invariant "release builds MUST NOT include MockEmbedder" verified at the symbol level. Allowed deps respected: kb-core, kb-config, serde, thiserror, tracing, plus anyhow (forced by trait return type) and blake3 (justified by the determinism contract; already in workspace lockfile via kb-core). No fastembed/ort/tokenizers anywhere. Out of scope: real adapter (p3-2), reranker traits (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:15:44 +00:00
altair823	ed6a595672	Merge pull request 'feat(p2-2): lexical-retriever — kb-search crate + LexicalRetriever (FTS5 + bm25)' (#13 ) from feat/p2-2-lexical-retriever into main Reviewed-on: altair823-org/kb#13	2026-05-01 08:03:22 +00:00
altair823	b335151d18	feat(p2-2): kb-search crate + LexicalRetriever (FTS5 + bm25) Adds the first concrete kb_core::Retriever, exercising chunks_fts (P2-1) to answer SearchMode::Lexical queries. Returns Vec<SearchHit> with bm25-derived ranking, snippet() previews, and W3C-fragment-style Citation built from the chunk's first source_spans entry. New crate kb-search: - LexicalRetriever::new(Arc<SqliteStore>, IndexVersion). - search() builds an FTS5 MATCH expression by escaping every whitespace token into a quoted literal (inner " doubled); single-quote-wrapped text passes through verbatim as raw FTS5 syntax. Empty query short-circuits to Ok(vec![]). - bm25 normalization: score = -bm25 / (1 + \|bm25\|), bounded (0, 1] for any FTS5-returned negative bm25. - Snippet via snippet(chunks_fts, 3, '', '', '…', word_budget) where word_budget = snippet_chars / 4 clamped to [1, 64]; trim_snippet enforces the char cap on the way out (chars per design §6.4 — accepts the combining-mark trade-off). - Citation from chunks.source_spans_json first span: Line / Page / Region / Time forwarded; Byte / empty array fall back to Line{1,1} with a tracing::warn so forward-compat regressions surface. - Filters: tags_any (subquery on document_tags), lang (= column), trust_min (CASE-rank in SQL) all applied at SQL level. path_glob uses globset with literal_separator(true) — guarantees '' does not cross '/' per spec Risks/notes — applied as Rust post-filter with +128 row over-fetch when set, then rank reassigned 1..k contiguously. - Determinism: ORDER BY score, f.chunk_id (lexicographic blake3 hex tiebreaker on identical bm25). Tested explicitly with two chunks of identical text content. - RetrievalDetail: method=Lexical, both lexical_score and fusion_score set, vector_ None. kb-store-sqlite: - Adds pub fn read_conn(&self) -> MutexGuard<'_, Connection>. Read-only contract is doc-only — kb-search needs MutexGuard for prepare_cached + iter, which a closure-scoped wrapper would awkwardly constrain. Closure variant left as a P3 follow-up. Tests (26 new): empty corpus, empty query, single hit + citation round-trip, snippet length cap, tags_any exclusion, lang+trust composition, path_glob with '' not crossing '/', citation line round- trip, bm25 top-1 ∈ (0, 1], determinism (varied scores AND identical- score tiebreaker), index_version passthrough, snapshot (crates/kb-search/tests/fixtures/search/lexical/run-1.json — stable under bundled SQLite; KB_UPDATE_SNAPSHOTS=1 to regenerate). Workspace: 211 tests pass, cargo clippy --workspace --all-targets -D warnings clean. Allowed deps respected: kb-core, kb-config, kb-store-sqlite, rusqlite, tracing, thiserror, anyhow (forced by trait return type), serde_json (parses _json TEXT columns), globset (path_glob '*' boundary). Out of scope (deferred): vector retriever (p3-3), hybrid fusion (p3-4), reranker (P+), Korean morphological tokenizer (P+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 05:20:35 +00:00
altair823	5aef478b96	Merge pull request 'feat(p2-1): fts-schema — chunks_fts + sync triggers + V002 migration' (#12 ) from feat/p2-1-fts-schema into main Reviewed-on: altair823-org/kb#12	2026-05-01 04:58:55 +00:00
altair823	94bfc50efd	feat(p2-1): chunks_fts virtual table + sync triggers (V002 migration) Adds FTS5 lexical index for chunks per design §5.5: chunks_fts virtual table (unicode61 remove_diacritics 2 tokenizer, contentless w/ UNINDEXED chunk_id+doc_id) plus chunks_ai/chunks_ad/chunks_au triggers that mirror every chunks mutation into chunks_fts inside the host transaction. V002 ships the verbatim §5.5 SQL block plus a one-shot backfill INSERT so existing P1 databases gain searchability without re-ingest. Refinery bookkeeping makes double-apply naturally idempotent. Adds rebuild_chunks_fts(&Connection) escape hatch for kb index --rebuild-fts (CLI wiring deferred to a later task). Uses SAVEPOINT instead of Transaction so callers can invoke from inside an outer transaction; WAL serializes writers so no DELETE/INSERT race vs. concurrent chunks mutators is possible. Tests (10): V001-only → V002 cold-upgrade backfill (literal path), chunks_ai/ad/au trigger sync, MATCH-token verification, rebuild idempotency, drift recovery, double-run no-op, V002 ↔ design §5.5 verbatim diff guard (anchored extraction from both files), WAL/SHM release on store drop. All 185 workspace tests pass. Allowed deps respected (kb-core, kb-config, rusqlite, refinery — no new deps). FTS query implementation deferred to p2-2 (lexical-retriever). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:42:15 +00:00
altair823	fd11dd054b	Merge pull request 'feat(p1-6): kb-store-sqlite — P1 terminal task' (#11 ) from feat/p1-6-store-sqlite into main Reviewed-on: altair823-org/kb#11	2026-04-30 17:47:45 +00:00
altair823	b7367dedfe	p1-6: doc-only TODO markers (section_label, doc_version invariant) M9: add a `TODO(P2/P3)` comment near the NULL persistence at documents.rs (put_chunks). The `section_label` column exists in the §5.5 DDL but neither the in-memory Chunk struct nor the §2.6 wire schema carries the field, so NULL is the correct canonical value today — flag the future-bump intent in-line rather than leaving it implicit. M10: add a one-line invariant comment near the i64 -> u32 narrowing for `doc_version` in `get_document`. The invariant is documented at the write site (UPSERT bumps by 1 per re-ingest) — restate it at the read site so the cast is not silently load-bearing. No behaviour change. No tests touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:34:17 +00:00
altair823	15b4d80efc	p1-6: rename StoreError::Sqlx -> Sqlite, drop dead assets_root helper M1: `Sqlx` is a misleading leftover — this crate uses `rusqlite`, not sqlx. Rename the variant (and the doc reference to it) to `Sqlite`. No external pattern matches; the variant is reached only via `#[from]`. M11: `assets_root` was an `#[allow(dead_code)]` helper introduced for a test that never landed. Delete it so the dead-code allow goes with it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:33:50 +00:00
altair823	e41279de96	p1-6: harden store boundary (atomic asset write, poison-tolerant mutex, AssetId validation) Three Important review fixes on the kb-store-sqlite write path: I1. Atomic asset write. put_asset_with_bytes now stages bytes to `<final>.tmp.<pid>.<n>`, fsyncs, UPSERTs the row, then `rename`s into place (atomic on POSIX same-fs). On any failure between staging and rename we best-effort `remove_file` the temp so the previous orphan risk on UPSERT failure is gone. Reference mode is unchanged (no I/O, no orphan risk). I2. Poison-tolerant mutex. New `lock_conn` helper does `.lock().unwrap_or_else(\|p\| p.into_inner())`, so a single panic mid- transaction no longer poisons every subsequent store call. The rusqlite Transaction Drop already rolls back on panic, leaving the Connection state safe to reuse. All 13 prior `.expect("sqlite mutex poisoned")` sites in store.rs / documents.rs / jobs.rs now route through `lock_conn`. I3. AssetId shape validation. `kb_core::AssetId(pub String)` lets a hand-construction bypass the `FromStr` 32-hex invariant. Added `validate_asset_id` (32 ASCII hex chars) at every store entry that turns an AssetId into a path: `put_asset_with_bytes` and `DocumentStore::put_asset`. This shuts a potential path-traversal via `assets_path_for`'s `&id[..2]` shard slice. Tests: - `put_asset_with_bytes_orphan_cleanup_on_upsert_failure` — pre-seeds a row that takes the same `workspace_path` (UNIQUE), so the UPSERT trips a constraint not covered by `ON CONFLICT(asset_id)`. Asserts no final file and no leaked `.tmp.`. - `put_asset_with_bytes_rejects_invalid_asset_id` — passes `AssetId("../etc/passwd_padded_to_xx_xxxxx")` (32 chars, contains `/`). Asserts error and zero filesystem artifacts under `data_dir/assets/`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:33:19 +00:00
altair823	efdb71b1c3	p1-6: list_documents filter coverage test Round-trip three docs (en/ko, varied tags, varied trust) and exercise each DocFilter axis: default (all), lang, path_glob (workspace_path GLOB), tags_any (intersection via document_tags subquery + per-row tag hydration), and trust_min (Primary > Secondary > Generated rank gate).	2026-04-30 17:14:17 +00:00
altair823	111f40ddf0	p1-6: kb-store-sqlite test suite (8 categories) All 8 test categories from the task plan, plus a JobRepo subset: migration — tests/migration.rs: fresh DB after run_migrations exposes every required §5 table + index. unit (copy) — tests/asset_writer.rs: copy mode writes file with mode 0o644 + correct bytes. unit (ref) — tests/asset_writer.rs: reference mode does not write file; row records source path. unit (cs) — tests/asset_writer.rs: tampered checksum returns a Conflict-flavoured anyhow error. unit (idem) — tests/idempotency.rs: same put_document twice → 1 row, doc_version 1→2; tags re-derived. unit (rb) — tests/idempotency.rs: put_blocks with FK violation rolls back; pre-existing rows unchanged. contract — tests/contract_roundtrip.rs: drives kb-parse-md + kb-normalize + kb-chunk on fixtures/markdown/code-and-table.md, persists, then reloads via DocumentStore::get_document / get_chunk and asserts byte-equal round-trip. snapshot — tests/ingest_report_snapshot.rs + snapshots/ingest_report.snapshot.json: pin the wire JSON form of kb_core::IngestReport for an inline fixture run. jobs — tests/jobs.rs: create → progress → finish flow; error message round-trip; list filters on status/kind. Drops the unused `serde` direct dep from Cargo.toml; serde_json brings its own. Dev-deps confirmed via `cargo tree -p kb-store-sqlite --depth 1` to live only in the dev tree.	2026-04-30 17:13:03 +00:00
altair823	a3390d5171	p1-6: scaffold kb-store-sqlite crate + V001 full §5 DDL New workspace member crate `kb-store-sqlite` (allowed deps only: kb-core, kb-config, rusqlite[bundled], refinery, serde, serde_json, time, blake3, tracing, anyhow, thiserror; dev-deps add kb-parse-md / kb-normalize / kb-chunk for the contract round-trip test). Migration V001 replaces the P0-1 stub with the full §5 DDL (assets, documents, document_tags, blocks, chunks with policy_hash, embedding_records, jobs, ingest_runs, answers, eval_runs, eval_query_results) plus the §5 indexes. FTS5 virtual table + triggers remain deferred to V002 (P2-1). Public surface per task spec: SqliteStore::open / run_migrations / put_asset_with_bytes impl DocumentStore for SqliteStore (7 trait methods) impl JobRepo for SqliteStore (4 trait methods) StoreError { Sqlx, Migration, Conflict } Behavior: - Pragmas at open: foreign_keys=ON, journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY. - Asset writer: byte_len ≤ copy_threshold_mb * 1MiB → copy to data_dir/assets/<aa>/<asset_id> (mode 0o644 on Unix), else reference. blake3(bytes) verified against asset.checksum; mismatch → Conflict. - Idempotency: put_document UPSERTs and bumps doc_version + 1 on conflict; put_blocks / put_chunks DELETE-then-INSERT; document_tags re-derived per put_document. - get_document rehydrates blocks via payload_json ordered by stream ordinal. - list_documents builds dynamic WHERE from DocFilter (lang / trust_min / path_glob via GLOB / tags_any via document_tags subquery). - JobRepo: jobs.kind/status are stored as lowercase enum tags; create mints a 32-hex JobId via blake3(kind \|\| payload \|\| nanos). Tests follow in subsequent commits.	2026-04-30 17:08:36 +00:00
altair823	207a0ff61e	kb-chunk: regenerate long-section.chunks.snapshot.json baseline The snapshot now includes the policy_hash field on every Chunk per the preceding kb-core schema change. chunk_ids are unchanged because the chunk_id recipe (§4.2) already incorporated policy_hash via the chunker — the field is simply now visible in the wire form. Regenerated via: UPDATE_SNAPSHOTS=1 cargo test -p kb-chunk long_section_chunks_snapshot	2026-04-30 17:02:53 +00:00
altair823	094c4641ba	kb-chunk: populate Chunk.policy_hash field Set the new policy_hash field on every emitted Chunk to the same hex already computed for the chunk_id recipe (§4.2). No recipe / chunk_id change — only the field on the struct is now populated. Pairs with the kb-core hotfix (preceding commit) and unblocks P1-6's DocumentStore::put_chunks to read chunk.policy_hash directly per §5.5.	2026-04-30 17:02:17 +00:00
altair823	16b2a5c150	kb-core: add policy_hash field to Chunk struct (P1-6 schema reconcile) Add policy_hash: String to kb_core::Chunk to align with the §5.5 SQLite schema (chunks.policy_hash NOT NULL), so kb-store-sqlite persistence is a straight field copy rather than a recompute. This is a §9 schema migration: - §5.5 (the persistence schema) is authoritative. - §3.5 (the domain model) must accommodate. The chunker already computed policy_hash for the chunk_id recipe (§4.2); P1-5 stored it implicitly. We now hold it explicitly on the Chunk so any DocumentStore::put_chunks impl can read it directly. Follow-up commits update kb-chunk to populate the field and refresh the P1-5 snapshot baseline accordingly.	2026-04-30 17:02:11 +00:00
altair823	b46e69b9c0	Merge pull request 'feat(p1-5): kb-chunk md-heading-v1 chunker' (#10 ) from feat/p1-5-chunk into main Reviewed-on: altair823-org/kb#10	2026-04-30 16:58:33 +00:00
altair823	ceeac9f974	p1-5: doc rationale for respect_markdown_headings, policy_hash panic, overlap accounting Doc-only follow-ups for review minors I1, M3, M4. No behavior change. * I1: rustdoc on `MdHeadingV1Chunker` now records that `policy.respect_markdown_headings` flows into `policy_hash` only; the `md-heading-v1` variant unconditionally treats headings as boundaries by name. To disable heading awareness, ship a different `chunker_version` (none in P1-5). * M3: `# Panics` rustdoc on `policy_hash` documents the unreachable-in-practice failure mode of `serde_json_canonicalizer::to_vec` and explains why the `expect` is retained as future-proofing. * M4: Inline comment at the `would_exceed` decision noting that `acc.text_tokens` already includes the prior chunk's overlap seed, and that the I3 clamp guarantees a flush here never produces a chunk smaller than the seed budget. * Heading-path bullet in the behavior contract updated to reflect the I2 fix wording. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:52:18 +00:00
altair823	a81460f9d0	p1-5: clamp overlap seed budget to target_tokens / 2 A pathological `ChunkPolicy { overlap_tokens >= target_tokens }` caused `md-heading-v1` to degenerate into 1-block-per-chunk: the seeded `acc.text_tokens` already exceeded `target_tokens` before any fresh content landed, so the next paragraph immediately tripped the `would_exceed` flush. Clamp the seed budget in `collect_overlap_seed` to `min(overlap_tokens, target_tokens / 2)`. This guarantees that after seeding, the chunk has at least `target/2` worth of room for new content before flushing, restoring the intended paragraph-overlap behavior on every reasonable and unreasonable policy. Adds a regression test pinning a 50/200 (overlap = 4× target) policy and asserting every emitted chunk holds ≥2 blocks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:51:00 +00:00
altair823	f780c71ce0	p1-5: heading-only chunk carries self in heading_path When a Heading block opens a chunk and is followed by another Heading or an atomic block (Code, Table, ImageRef, AudioRef) with no intervening prose, the prior fallback used `common.heading_path` from the heading itself — which per kb-normalize convention does NOT include the heading's own text. Result: heading-only and heading-led chunks for `# Alpha\n## Beta\n...` patterns landed with `heading_path = []`, losing citation context. Synthesize the leading heading into the chunk's heading_path when blocks[0] is a Heading: parent path + heading.text. The first non-Heading branch (existing logic for normal mid-section chunks) is unchanged. `chunk_id` recipe is `(doc_id, chunker_version, block_ids, policy_hash)` — `heading_path` is not in the recipe, so this fix does NOT shift chunk_ids. Snapshot baseline `long-section.chunks.snapshot.json` also unchanged because every heading in that fixture is followed by a paragraph (the bug only triggers on direct heading→heading or heading→atomic adjacency). Adds `heading_with_parents` test helper and a regression test pinning the `# Alpha\n## Beta\n[code]` pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:50:12 +00:00
altair823	58f7b8573d	p1-5: add long-section fixture + Vec<Chunk> snapshot test Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s, nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table, and a multi-paragraph Gamma section) into the JSON snapshot baseline. Confirms the priority rules end-to-end: - Heading boundaries hold across H1 → H2 → H1 transitions - The code block emits one chunk at 427 tokens > target=200 - The table stays single-chunk - Gamma's paragraph stream splits with one block of overlap seed A second test runs the full parse → normalize → chunk pipeline 5 times and asserts identical chunk_ids each pass. Drops the unused `kb-config` and `serde` from regular dependencies — they were declared but no source path imports them; `serde` flows in transitively via `kb-core` as a public API requirement, and `ChunkingCfg` lives in `kb-config` but the chunker takes `ChunkPolicy` directly. Production deps are now exactly the allowed set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer, tracing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:33:29 +00:00
altair823	0237022d0e	p1-5: implement md-heading-v1 chunking rules Fills in MdHeadingV1Chunker::chunk with the priority-ordered ruleset from the design (§0 / §14): 1. Heading is a hard boundary; the heading itself starts and is included in its chunk so heading text is retrievable. 2. Code blocks never split, regardless of size. 3. Tables stay single-chunk (row-split deferred per task spec). 4. Long sections split at target_tokens with paragraph-level overlap_tokens worth of seeded tail blocks. 5. ImageRef / AudioRef each become their own chunk with token_estimate = 0. 6. heading_path lifts from the first contributing non-Heading block; source_spans concatenate in document order. 7. chunk_id derives from id_for_chunk(doc_id, chunker_version, block_ids, policy_hash) per §4.2. Covers the unit + determinism rows of the P1-5 test plan: heading boundary respected, 800-token code block stays single, small table stays single, long paragraph chain splits with overlap, ImageRef chunk has token_estimate=0, and 1000-iter chunk_id determinism. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:30:26 +00:00
altair823	8142449eb7	p1-5: scaffold kb-chunk crate with MdHeadingV1Chunker skeleton Adds the new workspace member with the bare Chunker impl shape: chunker_version() returns "md-heading-v1"; policy_hash() blake3-hashes canonical JSON of ChunkPolicy and truncates to 16 hex chars; chunk() is an empty stub the next commits fill in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:27:42 +00:00
altair823	4665910370	Merge pull request 'feat(p1-4): kb-normalize + kb-core Inline schema hotfix' (#9 ) from feat/p1-4-normalize into main Reviewed-on: altair823-org/kb#9	2026-04-30 16:23:16 +00:00
altair823	557275c04e	p1-4: doc-only follow-ups for deferred review minors (M8, M9, M11, M12) M8: kb-parse-md frontmatter doc-comment claimed filename fallback was P1-4's job; P1-4 spec did not include it. Reconcile: defer to a later phase (P1-7 / kb-app integration) where the workspace_path filename is known to the caller. Updated comment in build_metadata(). M9: kb-parse-md tests use the #[ignore] regenerator pattern, while kb-normalize's integration test uses an UPDATE_SNAPSHOTS=1 env-var. Migrating kb-parse-md is out of scope; one-line note added to blocks_snapshots.rs mod doc-comment to flag the intentional split. M11, M12: doc-only comments in lift_block (already added in the previous commit) — list-item shared block_id rationale and the intentional camel-case Debug-format for WarningKind in Provenance notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:42:02 +00:00
altair823	e0df42984e	p1-4: address review I1-I3 + minors (extract attribution, audio-ref skip, NFC heading_path) I1: warning_agent maps ExtractFailed → "kb-parse-md" (the panic-recovery emitter in kb-parse-md/src/blocks.rs). Lift-stage warnings from build_canonical_document are tracked separately and attributed to "kb-normalize", so the I1 mapping change does not lie about kb-normalize-originated drops. I2: ParsedPayload::AudioRef no longer synthesizes Block::AudioRef with an invalid empty AssetId (would violate AssetId::from_str's 32-hex invariant). Block is dropped, Warning surfaces in Provenance with src mention, attributed to kb-normalize (lift-stage decision). TODO(P8) comment marks this as a placeholder until the audio extractor lands. I3: NFC-normalize each heading_path string in lift_block before feeding into id_for_block AND into CommonBlock.heading_path. pulldown-cmark does not NFC heading text and serde_json_canonicalizer v0.3 does not either, so canonically-equivalent NFD/NFC inputs would produce different block_ids without this normalization. Mirrors the existing doc_id NFC handling via to_posix. Minors: - M4: trim Cargo.toml — drop kb-config, serde_json_canonicalizer, blake3 (unused); keep tracing (now wired) + unicode-normalization (now used by I3). - M5: determinism_1000_iterations_under_1s now uses the same 5-block fixture as block_ordinals_scoped_per_heading_and_kind (extracted into fixture_blocks_five helper) so the determinism property is exercised on a real lift_block path, not just an empty Vec. Still < 1s. - M6: snapshot integration test now passes BodyHints { first_h1: Some("Code And Table"), .. } and asserts doc.title == "Code And Table" end-to-end. Baseline JSON updated. - M7: title/lang edge-case unit tests pin policy: empty string lifts to empty string; non-stringy values silently drop. Rustdoc updated. - M10: provenance_contains_stage_events_in_order asserts events[1].at == events[2].at to pin the shared-now_utc invariant. New tests (unit, kb-normalize): - provenance_with_extract_failed_warning_attributes_to_kb_parse_md (I1) - audio_ref_block_skipped_with_warning (I2) - nfc_nfd_korean_heading_path_same_block_id (I3) - title_empty_string_in_user_map_falls_back_to_default (M7) - title_non_string_in_user_map_silently_drops (M7) - lang_invalid_shape_silently_drops (M7) kb-normalize unit tests: 9 → 14. Integration snapshot: 1 (unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:41:50 +00:00
altair823	5352bede5c	p1-4: snapshot + determinism tests Add the integration snapshot test pinning the full `CanonicalDocument` JSON for `fixtures/markdown/code-and-table.md` (run through the real `kb-parse-md::parse_frontmatter` + `parse_blocks`, dev-dep only). Non-deterministic `provenance.events[*].at` for the Parsed and Normalized events is stripped before comparison; the Discovered event's `at` is pinned by constructing the test `RawAsset` with a fixed `discovered_at`. Run with `UPDATE_SNAPSHOTS=1` to regenerate. Add the 1000-iteration determinism property: same inputs ⇒ byte- identical JSON (modulo the same stripped timestamps), in under one second of wall-clock time. A regression in canonical JSON, BLAKE3 hashing, ordinal counting, or any other deterministic field would surface here immediately. The integration test depends on `kb-parse-md` only as a dev-dep, so `cargo tree -p kb-normalize --depth 1 --edges normal` confirms no parser implementation appears in the production dep tree per design §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:20:18 +00:00
altair823	1cc0ba9f37	p1-4: provenance + title/lang lift Build a `Provenance` with one event per pipeline stage (`Discovered` sourced from `RawAsset.discovered_at`, then `Parsed` and `Normalized` stamped with one shared `now_utc()` reading), plus one `Warning` event per upstream warning. Sharing `now` between Parsed and Normalized bounds intra-call timestamp jitter — event ordering is preserved by `Vec` position regardless. Warning agents are routed back to the upstream component (`kb-parse-md` for parse warnings, `kb-normalize` for `ExtractFailed`). Lift `metadata.user["title"]` and `metadata.user["lang"]` (where P1-2 stashes them since the `Metadata` struct itself does not carry those fields) into `CanonicalDocument.title` / `CanonicalDocument.lang`. Both keys are removed from the user map after lifting so the wire form does not duplicate the data; missing keys default to empty string / empty `Lang`. Other user-map keys survive. Tests pin the event ordering, the warning routing, and the lift behavior (including non-duplication in `metadata.user`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:19:33 +00:00
altair823	fc05f3a2be	p1-4: build_canonical_document core + ID assignment Implement the §4.3 ordinal rule and §3.4 block lift. Each `ParsedBlock` maps to a `kb_core::Block` variant carrying a `CommonBlock` whose `block_id = id_for_block(doc_id, payload_kind, heading_path, ordinal, source_span)`. Ordinals are scoped to `(heading_path, payload_kind)`, 0-based, in document order — three paragraphs under one H1 get 0/1/2, a code block under the same H1 starts fresh at 0, a paragraph under a different H1 also starts at 0. `payload_kind` is the lowercase-no-spaces convention from §4.2: "heading", "paragraph", "list", "code", "table", "quote", "imageref", "audioref". `ListBlock.items` re-uses the parent list's `CommonBlock` per §3.4 (no per-item BlockId is allocated). `AudioRefBlock` placeholder fields (`asset_id`, `duration_ms`) are filled in by P8 — for now we synthesize the minimal record so the document is well-typed. Tests pin the four §4.4 ID properties (1000-iteration determinism, NFC ≡ NFD Korean path, `./a/b.md` ≡ `a/b.md`, ordinal grouping). Provenance and title/lang lift land in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:18:19 +00:00
altair823	c0096ce44b	p1-4: scaffold kb-normalize crate Add the workspace member, `Cargo.toml` with the §8-allowed dep set (kb-core, kb-parse-types, kb-config, serde, serde_json_canonicalizer, blake3, unicode-normalization, time, anyhow, tracing) and a stubbed `build_canonical_document` that pins the public signature plus `doc_id` derivation. `kb-parse-md` is permitted only as a dev-dep so the integration snapshot test (added later in this series) can drive a fixture through the real parser without violating the production boundary — `cargo tree -p kb-normalize --depth 1 --edges normal` confirms no parser implementation appears in the regular dep tree. `id_for_doc` and `id_for_block` are re-exported from kb-core (which holds the canonical recipe per §4.2); kb-normalize is the canonical entry point per design §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:16:53 +00:00
altair823	cfccb3687d	p1-3: update kb-parse-md callers + drop BlockView projection in snapshots Mechanical sweep over `Inline::Text(_)` / `Code(_)` / `Strong(_)` / `Emph(_)` construction and match sites under the new struct-variant shape introduced in the previous commit. `Inline::Link { text, href }` is unchanged. The snapshot test in `tests/blocks_snapshots.rs` previously projected `ParsedBlock` into a `BlockView`/`PayloadView` shim because the old `Inline` could not serialize. With the schema fix in place we now serialize `ParsedBlock` directly through serde — the shim and its `flatten_inline` helper are removed. Inlines surface as structured objects (`{"kind":"text","text":"…"}` etc.). Regenerated `nested-headings.blocks.snapshot.json` to reflect the new shape via the existing `--ignored` emitter; `code-and-table.blocks.snapshot.json` has no inlines and is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:10:54 +00:00
altair823	606ce1cf66	kb-core: hotfix Inline serde schema (struct variants) `#[serde(tag = "kind")]` rejects newtype variants whose payload is not a struct, so 4 of 5 `Inline` variants (`Text(String)`, `Code(String)`, `Strong(Vec<…>)`, `Emph(Vec<…>)`) failed to serialize at runtime — only `Link { text, href }` worked. Convert every variant to struct form so the internally-tagged shape is well-formed and round-trips through JSON. Add `inline_serde_round_trip` covering all five variants. Per design §9, this is a wire-schema migration; no `docs/wire-schema/v1/*.json` change required since `Inline` is not directly referenced there. Callers in kb-parse-md follow in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:10:40 +00:00
altair823	8ce44af95a	Merge pull request 'feat(p1-3): kb-parse-md blocks (Markdown body → ParsedBlock tree)' (#8 ) from feat/p1-3-parse-md-blocks into main Reviewed-on: altair823-org/kb#8	2026-04-30 15:03:24 +00:00
altair823	80123e9e27	p1-3: pin reviewer probe inputs as regression tests The quality reviewer named three specific input probes for the C1/C2/ C3 fixes. Encode each as a verbatim test so future regressions on those exact inputs surface immediately: - probe_overflow: parse_blocks(b"# h\nbody\n", u32::MAX) → empty + Warning::ExtractFailed. - probe_list_escape: list with embedded code block → single List block, two items. - probe_empty_heading: `# \n# Real\nbody\n` → body's heading_path is `["Real"]`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:42:21 +00:00
altair823	2b6d9abc0f	p1-3: doc-comment + test pin Quote drops non-text children `ParsedPayload::Quote { text, inlines }` cannot represent block-level children (lists, code, tables, images), so the BlockQuote end handler silently drops them when assembling the Quote payload. This matches §3.4 for now but is non-obvious and easy to regress without an explicit pin. Add a TODO(P1-future) comment near the Quote emission code and a regression test (`quote_with_list_inside_drops_list`) that fixes the current shape: a `> - item` blockquote produces a Quote with empty text and empty inlines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:41:47 +00:00
altair823	23ff4d68af	p1-3: preserve whitespace in link text across SoftBreak/HardBreak `[multi\nline](http://x)` produced `Inline::Link.text = "multiline"` because the SoftBreak/HardBreak handler called `push_text(" ")` — which updates `paragraph.text` and the inline buffer, but NOT the open link frame's flattened text accumulator. Text events flowed through `push_link_text`; line breaks didn't. Add `push_link_text(" ")` alongside the existing `push_text(" ")` in the break handler so a line break inside `[ ... ](href)` collapses to a visible space rather than disappearing. New tests: - link_with_soft_break_preserves_space_in_text - link_with_hard_break_preserves_space_in_text Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:40:42 +00:00
altair823	73040cab30	p1-3: capture image refs from pulldown-cmark Tag::Image events The previous block-level image detector scanned paragraph source bytes for the literal `![alt](src)` shape. That was fragile in three ways: - `![alt](src "title")` leaked the title into `src` (`src "title"`) - `![alt](<https://x.com/a b>)` kept the angle brackets verbatim - `![]()` had undefined behavior Replace the byte-scan with state on `Frame::Paragraph` that observes the actual `Tag::Image` events from pulldown-cmark: - `image_count` increments on each `Start(Tag::Image)` and `image_src` captures `dest_url` (which already strips angle brackets and excludes the title). - Text events seen while `image_depth > 0` are routed into `image_alt` and suppressed from the inline buffer. - Strong/Emph/Link starts and any non-image text outside the image flag `non_image_text_seen`. At `End(Paragraph)`, the paragraph is lifted to `ImageRef` iff `image_count == 1 && !non_image_text_seen`. The byte-scanner `match_block_image` is removed. New tests: - image_with_title_attribute (title dropped, no leak into src) - image_with_angle_bracketed_url (brackets stripped) - empty_image_alt_and_src (`![]()` pins to empty/empty) Existing image tests (`image_ref_block_captures_src_and_alt`, `inline_image_inside_paragraph_is_dropped`) continue to pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:40:04 +00:00
altair823	d49dbc1926	p1-3: skip empty heading text when building heading_path `# ` (a heading with no following text) used to seed the heading stack with `Some("")`, which then propagated into every child block's `heading_path` as a `""` segment — visibly polluting the path that downstream consumers index by. Filter empty entries from both `heading_path()` and the in-line ancestor collection at heading-end. We deliberately keep `Some("")` in the stack rather than skipping the assignment so the slot remains occupied and a subsequent deeper heading is still positioned correctly relative to its level — only the visible path drops the empty. New tests: - empty_heading_does_not_pollute_path - empty_h1_then_h2_does_not_break_stack Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:37:40 +00:00
altair823	0050cf32ea	p1-3: route block-level content inside list items into parent inline buffer `emit_block` previously walked the frame stack looking only for a Quote container, falling back to top-level on miss. That caused any block emitted inside a list item — code blocks, images, tables, headings — to escape the list and appear at the top of `blocks`, after the entire list and out of source order. `ParsedPayload::List { items: Vec<Vec<Inline>> }` cannot represent a child block structurally, so the choice is between dropping content and flattening. Extend the reverse-walk to also recognize `Frame::ListItem` and route the block into a textual rendering appended to the item's inline buffer (`flatten_block_into_item`): - Code → fenced text approximation, preserving lang hint + body - Image → `![alt](src)` text - Audio → `[audio](src)` text - Heading → leading hashes + text - Quote → `> text` - Nested List → same rendering as `nested_in_item` flatten - Table → pipe-table approximation Document order is preserved because flattening happens inside the item's frame, before the item closes. New tests: - code_block_inside_list_item_flattens_into_parent - image_inside_list_item_flattens_into_parent - block_content_in_list_preserves_document_order Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:36:27 +00:00
altair823	de9164802b	p1-3: fix span arithmetic overflow + body_offset_lines fuzz `span_for` previously used `u32 + u32` directly, so callers passing a large `body_offset_lines` could panic (debug, then masked by `catch_unwind` and the entire body discarded) or wrap to an inverted span with `start > end` (release). Switch to `checked_add`; on overflow flag the walk state and at the end of `parse_blocks_inner` discard accumulated blocks and surface a single `Warning::ExtractFailed` carrying the offending body line. This degrades cleanly without panicking and without emitting a silently-broken span. Also extend `random_bytes_do_not_panic` to mix u32::MAX-style offsets across the fuzz iterations so the overflow path is exercised by the randomized corpus. New tests: - body_offset_lines_max_returns_extract_failed - body_offset_lines_zero_at_max_minus_one_no_overflow Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:34:58 +00:00
altair823	2ac0f56105	p1-3: drop dead bindings in heading + table end handlers Removes the leftover `let _ = level_u8;` and `let _ = header_count;` discards. The Heading frame already carries the canonical level (we populated it from `Tag::Heading` at Start), so we destructure that directly and ignore the redundant `TagEnd::Heading(level)` payload. The header_count helper was dead — Frame::Table tracks `cols` internally and we never consumed `header_count`.	2026-04-30 14:18:54 +00:00
altair823	f604a381df	p1-3: snapshot tests + clippy fix Adds two snapshot tests (`nested-headings.md`, `code-and-table.md`) under crates/kb-parse-md/tests/blocks_snapshots.rs, with matching baseline JSON next to each fixture. The snapshot view projects `kb_core::Inline` to flat strings — `Inline` carries `serde(tag = "kind")` which is incompatible with newtype variants holding a primitive (`Text(String)`), so direct serialization of `ParsedBlock` would fail today. The view preserves the contract that matters for P1-3 (heading paths, source spans, payload kinds, payload text/code/table content) and will keep working once kb-core fixes the Inline schema in a later task. Also tightens `level_to_use >= 1 && <= 6` into `(1..=6).contains(&_)` to satisfy `clippy::manual_range_contains`.	2026-04-30 14:17:41 +00:00
altair823	4e7e9cad87	p1-3: add parse_blocks (pulldown-cmark walker) submodule Implements `kb_parse_md::parse_blocks(body, body_offset_lines)` returning a flat `Vec<ParsedBlock>` plus warnings. Walks pulldown-cmark events through a small frame-based state machine that tracks heading paths, accumulates inline buffers (Text/Code/Link/Strong/Emph only — design §3.4), and reports SourceSpan::Line spans in 1-indexed file-line coordinates. Covers headings, paragraphs, code blocks (lang from info string), GFM tables (with malformed fallback to paragraph + MalformedTable warning), lists (nested sub-lists flattened into parent item), and block-level image references. Inline images are dropped silently per the inline filter. Adversarial inputs are caught with `catch_unwind` and degrade to an empty output + ExtractFailed warning. 15 unit tests cover heading-path correctness, code lang, table parsing, malformed-table fallback (driven via synthetic events since pulldown-cmark auto-normalizes table widths), LF/CRLF line-range parity, image refs, nested-list flattening, inline filter, and 100-iteration random-bytes plus hand-crafted adversarial-input no-panic guards.	2026-04-30 14:14:34 +00:00
altair823	ff37ea5927	Merge pull request 'feat(p1-2): kb-parse-md frontmatter (YAML/TOML → Metadata)' (#7 ) from feat/p1-2-parse-md-frontmatter into main Reviewed-on: altair823-org/kb#7	2026-04-30 14:06:54 +00:00
altair823	5850bfcf7a	p1-2: address review minors (FrontmatterSpan doc, parse_frontmatter rustdoc, YAML library note) M1: Reword the FrontmatterSpan doc-comment from "technically meant to be crate-internal" to a forward-looking note about P1-3 / P1-4 callers using bytes[span.end..] for body slicing. M3: Add an explicit `# Errors` section to parse_frontmatter's rustdoc. The current implementation never returns Err — all recoverable problems are downgraded to warnings — but the Result is kept on the signature so future hard-fail conditions can be added without breaking callers. M4: Mention serde_yml in the library-choice rationale alongside serde_yaml_ng, with a one-line note on why _ng was preferred (stricter adherence to original serde_yaml semantics around null / tagged enums). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:13:16 +00:00
altair823	6a4db624b6	p1-2: fix CRLF / trailing whitespace / BOM in frontmatter delimiter detection C1: detect_delimiters now returns (DelimKind, FrontmatterSpan, Range<usize>) where the inner range is the YAML/TOML payload byte range — derived in one place rather than recomputed by the parser via fixed-width opening_len / closing_len constants that wrongly assumed LF endings. CRLF input now parses correctly end-to-end; the originally-failing reviewer probe "---\r\ntitle: Doc\r\n---\r\nbody\r\n" now yields title="Doc" with no warnings. I1: Trailing horizontal whitespace (spaces / tabs) on either delimiter line is now accepted, matching Hugo / Jekyll. Editors that auto-trim trailing whitespace no longer silently break otherwise-valid frontmatter. I2: A leading UTF-8 BOM (EF BB BF, byte 0 only) is tolerated and skipped before delimiter scanning. The returned span.start accounts for the BOM (=3) so callers using bytes[span.end..] for body slicing still get the correct range without further bookkeeping. Mid-input BOMs are not stripped. M2: Drop the now-dead DelimKind::opening_len / closing_len constants — the inner range is encoded once at detection time. 12 new tests covering CRLF (YAML / TOML / mixed-EOL / end-to-end), trailing whitespace on opener / closer / tabs, leading BOM (detection + full pipeline), and mid-input BOM non-stripping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:12:34 +00:00
altair823	1fab6b0207	p1-2: address spec review (drop user_id_alias mirror in user map) Spec §"Behavior contract" line 74 says `id:` is captured into `metadata.user_id_alias` only. Remove the redundant `user.insert` that was also writing it into the user map, and update the snapshot baseline accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 13:02:28 +00:00
altair823	42a7d53e5d	p1-2: fixtures + snapshot tests for frontmatter parser Two markdown fixtures with hand-authored JSON baselines that pin the §0 Q9 derive output across runs: - frontmatter-only.md exercises the YAML happy path with most fields, unknown keys, an `id:` field, and a non-UTC created_at (so the baseline shows original_timestamps preservation). - mixed-lang.md is body-only with no `lang:` field; baseline pins the lingua autodetect result for our enabled language set. A separate `emit_snapshots` test (marked `#[ignore]`) regenerates the baselines from the current parser output. A determinism test parses the fixture twice and asserts equality so any non-determinism (e.g. key ordering, lingua nondeterminism) fails fast.	2026-04-30 12:56:19 +00:00
altair823	cc8f7dad3f	p1-2: parse_frontmatter + §0 Q9 derive table Implement the frontmatter submodule: - detect_delimiters scans for a leading YAML (---) or TOML (+++) block at byte 0. Strict per §0 Q9: no leading whitespace / BOM, no chars on the delimiter line. Closing must be its own line. Unterminated → no FM. - parse_raw deserializes into RawFrontmatter, a serde-flatten struct that catches unknown keys into a serde_json::Map for verbatim preservation in metadata.user. - derive_metadata implements the §0 Q9 fallback chain: title → frontmatter \| BodyHints.first_h1 \| (filename: caller) aliases/tags→ frontmatter \| [] lang → frontmatter \| lingua autodetect on first 4 KB \| hints \| "und" created_at → frontmatter (RFC 3339, normalized to UTC) \| fs_ctime updated_at → frontmatter \| fs_mtime source_type → frontmatter \| "markdown" trust_level → frontmatter \| "primary" id → user_id_alias only — never a doc_id factor (§4.2) - Non-UTC offsets are normalized to UTC; the original string is preserved in user.original_timestamps[field] per §0 Q9. - Warnings are emitted for: malformed YAML/TOML, unknown enum values, malformed timestamps. Unknown keys are silent. - lingua detector is cached in a OnceLock — first build is heavy. - 15 unit tests cover every row of the derive table + delimiter edge cases + an explicit pin that `id:` does not feed id_for_doc.	2026-04-30 12:56:02 +00:00

1 2

98 Commits