11 Commits

Author SHA1 Message Date
53ec9b4dc5 test(chunk): regenerate AST + long-section snapshots for V009 chunk field
S3 의 Chunk struct 갱신 (kebab-core 의 tokenized_korean_text:
Option<String> field 추가) 가 모든 chunk snapshot JSON 의 serde
serialize 결과를 변경시킴. 10 snapshot fixture (9 AST chunker +
markdown long-section) 의 baseline 을 V009 형태로 regenerate.

각 snapshot 의 변경 = chunk JSON 마다 `"tokenized_korean_text":
null` field 추가 (대부분의 fixture 가 영어 코드라 lindera 의 None
fallback). 동작 변경 없음 — serde representation 의 cascade만.

Spec: docs/superpowers/specs/2026-05-28-v0.20.x-korean-morphological-tokenizer-spec.md §6.2
Plan: docs/superpowers/plans/2026-05-28-v0.20.x-korean-morphological-tokenizer-plan.md (S3 follow-up via S11 sanity)
2026-05-28 12:27:37 +00:00
2fd13db209 docs(p9-fb-23): README + HANDOFF + HOTFIXES + INDEX + per-task spec
Also fixes snapshot drift in code-and-table.canonical.snapshot.json
introduced by task 2 (CanonicalDocument gains last_chunker_version +
last_embedding_version fields).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:27:05 +00:00
207a0ff61e kb-chunk: regenerate long-section.chunks.snapshot.json baseline
The snapshot now includes the policy_hash field on every Chunk per the
preceding kb-core schema change. chunk_ids are unchanged because the
chunk_id recipe (§4.2) already incorporated policy_hash via the chunker —
the field is simply now visible in the wire form.

Regenerated via:
  UPDATE_SNAPSHOTS=1 cargo test -p kb-chunk long_section_chunks_snapshot
2026-04-30 17:02:53 +00:00
58f7b8573d p1-5: add long-section fixture + Vec<Chunk> snapshot test
Bakes the chunker output for fixtures/markdown/long-section.md (3 H1s,
nested H2 under Alpha, a 50-line code block, a 3-col x 4-row table,
and a multi-paragraph Gamma section) into the JSON snapshot baseline.
Confirms the priority rules end-to-end:

  - Heading boundaries hold across H1 → H2 → H1 transitions
  - The code block emits one chunk at 427 tokens > target=200
  - The table stays single-chunk
  - Gamma's paragraph stream splits with one block of overlap seed

A second test runs the full parse → normalize → chunk pipeline 5
times and asserts identical chunk_ids each pass.

Drops the unused `kb-config` and `serde` from regular dependencies —
they were declared but no source path imports them; `serde` flows in
transitively via `kb-core` as a public API requirement, and
`ChunkingCfg` lives in `kb-config` but the chunker takes
`ChunkPolicy` directly. Production deps are now exactly the allowed
set actually used: anyhow, blake3, kb-core, serde_json_canonicalizer,
tracing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:33:29 +00:00
e0df42984e p1-4: address review I1-I3 + minors (extract attribution, audio-ref skip, NFC heading_path)
I1: warning_agent maps ExtractFailed → "kb-parse-md" (the panic-recovery
emitter in kb-parse-md/src/blocks.rs). Lift-stage warnings from
build_canonical_document are tracked separately and attributed to
"kb-normalize", so the I1 mapping change does not lie about
kb-normalize-originated drops.

I2: ParsedPayload::AudioRef no longer synthesizes Block::AudioRef with
an invalid empty AssetId (would violate AssetId::from_str's 32-hex
invariant). Block is dropped, Warning surfaces in Provenance with src
mention, attributed to kb-normalize (lift-stage decision). TODO(P8)
comment marks this as a placeholder until the audio extractor lands.

I3: NFC-normalize each heading_path string in lift_block before feeding
into id_for_block AND into CommonBlock.heading_path. pulldown-cmark does
not NFC heading text and serde_json_canonicalizer v0.3 does not either,
so canonically-equivalent NFD/NFC inputs would produce different
block_ids without this normalization. Mirrors the existing doc_id NFC
handling via to_posix.

Minors:
- M4: trim Cargo.toml — drop kb-config, serde_json_canonicalizer,
  blake3 (unused); keep tracing (now wired) + unicode-normalization
  (now used by I3).
- M5: determinism_1000_iterations_under_1s now uses the same 5-block
  fixture as block_ordinals_scoped_per_heading_and_kind (extracted into
  fixture_blocks_five helper) so the determinism property is exercised
  on a real lift_block path, not just an empty Vec. Still < 1s.
- M6: snapshot integration test now passes BodyHints { first_h1:
  Some("Code And Table"), .. } and asserts doc.title == "Code And Table"
  end-to-end. Baseline JSON updated.
- M7: title/lang edge-case unit tests pin policy: empty string lifts to
  empty string; non-stringy values silently drop. Rustdoc updated.
- M10: provenance_contains_stage_events_in_order asserts events[1].at
  == events[2].at to pin the shared-now_utc invariant.

New tests (unit, kb-normalize):
- provenance_with_extract_failed_warning_attributes_to_kb_parse_md (I1)
- audio_ref_block_skipped_with_warning (I2)
- nfc_nfd_korean_heading_path_same_block_id (I3)
- title_empty_string_in_user_map_falls_back_to_default (M7)
- title_non_string_in_user_map_silently_drops (M7)
- lang_invalid_shape_silently_drops (M7)

kb-normalize unit tests: 9 → 14. Integration snapshot: 1 (unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:41:50 +00:00
5352bede5c p1-4: snapshot + determinism tests
Add the integration snapshot test pinning the full `CanonicalDocument`
JSON for `fixtures/markdown/code-and-table.md` (run through the real
`kb-parse-md::parse_frontmatter` + `parse_blocks`, dev-dep only).
Non-deterministic `provenance.events[*].at` for the Parsed and
Normalized events is stripped before comparison; the Discovered
event's `at` is pinned by constructing the test `RawAsset` with a
fixed `discovered_at`. Run with `UPDATE_SNAPSHOTS=1` to regenerate.

Add the 1000-iteration determinism property: same inputs ⇒ byte-
identical JSON (modulo the same stripped timestamps), in under one
second of wall-clock time. A regression in canonical JSON, BLAKE3
hashing, ordinal counting, or any other deterministic field would
surface here immediately.

The integration test depends on `kb-parse-md` only as a dev-dep, so
`cargo tree -p kb-normalize --depth 1 --edges normal` confirms no
parser implementation appears in the production dep tree per design
§8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:20:18 +00:00
cfccb3687d p1-3: update kb-parse-md callers + drop BlockView projection in snapshots
Mechanical sweep over `Inline::Text(_)` / `Code(_)` / `Strong(_)` / `Emph(_)`
construction and match sites under the new struct-variant shape introduced
in the previous commit. `Inline::Link { text, href }` is unchanged.

The snapshot test in `tests/blocks_snapshots.rs` previously projected
`ParsedBlock` into a `BlockView`/`PayloadView` shim because the old
`Inline` could not serialize. With the schema fix in place we now
serialize `ParsedBlock` directly through serde — the shim and its
`flatten_inline` helper are removed. Inlines surface as structured
objects (`{"kind":"text","text":"…"}` etc.). Regenerated
`nested-headings.blocks.snapshot.json` to reflect the new shape via
the existing `--ignored` emitter; `code-and-table.blocks.snapshot.json`
has no inlines and is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:10:54 +00:00
f604a381df p1-3: snapshot tests + clippy fix
Adds two snapshot tests (`nested-headings.md`, `code-and-table.md`) under
crates/kb-parse-md/tests/blocks_snapshots.rs, with matching baseline JSON
next to each fixture. The snapshot view projects `kb_core::Inline` to
flat strings — `Inline` carries `serde(tag = "kind")` which is
incompatible with newtype variants holding a primitive (`Text(String)`),
so direct serialization of `ParsedBlock` would fail today. The view
preserves the contract that matters for P1-3 (heading paths, source
spans, payload kinds, payload text/code/table content) and will keep
working once kb-core fixes the Inline schema in a later task.

Also tightens `level_to_use >= 1 && <= 6` into `(1..=6).contains(&_)` to
satisfy `clippy::manual_range_contains`.
2026-04-30 14:17:41 +00:00
1fab6b0207 p1-2: address spec review (drop user_id_alias mirror in user map)
Spec §"Behavior contract" line 74 says `id:` is captured into
`metadata.user_id_alias` only. Remove the redundant `user.insert`
that was also writing it into the user map, and update the snapshot
baseline accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 13:02:28 +00:00
42a7d53e5d p1-2: fixtures + snapshot tests for frontmatter parser
Two markdown fixtures with hand-authored JSON baselines that pin the
§0 Q9 derive output across runs:

- frontmatter-only.md exercises the YAML happy path with most fields,
  unknown keys, an `id:` field, and a non-UTC created_at (so the
  baseline shows original_timestamps preservation).
- mixed-lang.md is body-only with no `lang:` field; baseline pins the
  lingua autodetect result for our enabled language set.

A separate `emit_snapshots` test (marked `#[ignore]`) regenerates the
baselines from the current parser output. A determinism test parses
the fixture twice and asserts equality so any non-determinism (e.g.
key ordering, lingua nondeterminism) fails fast.
2026-04-30 12:56:19 +00:00
a166b7051c p0-1: wire-schema stubs, doc/spec stubs, V001 migration, fixtures
- docs/wire-schema/v1/ ships 7 schema stubs (citation, search_hit,
  answer, ingest_report, doc_summary, chunk_inspection, doctor) that
  pin schema_version + required fields per design §2. Full property
  validation lands in later phases.
- docs/spec/ ships 7 markdown stubs each linking to the canonical
  frozen design (domain-model, ids, canonical-document, chunk-policy,
  citation-policy, module-boundaries, ai-generation-guidelines).
- migrations/V001__init.sql contains only schema_meta + migrations
  tables per design §5.1; data tables ship in P1-6/P2-1/P3-3.
- fixtures/ has the 11 subdirectories every downstream task references
  (markdown, source-fs, search/{lexical,hybrid}, embed, vector, rag,
  eval, image, pdf, audio). Empty subdirs use .gitkeep so they track.
  fixtures/markdown/ ships the 3 phase-0 fixtures: simple-note.md,
  nested-headings.md, code-and-table.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:17:32 +00:00