feat(p1-5): kb-chunk md-heading-v1 chunker #10
Reference in New Issue
Block a user
Delete Branch "feat/p1-5-chunk"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
요약
P1-5 task spec을 구현했습니다. 신규 crate
kb-chunk—md-heading-v1라벨의 첫 번째 구체Chunker구현체.kb_core::CanonicalDocument를 받아 우선순위 기반(heading 경계 우선 / code 및 table atomic / 단락 + overlap 분할)으로Vec<kb_core::Chunk>를 산출합니다.chunk_id는 design §4.2 recipe (doc_id, chunker_version, block_ids, policy_hash)에 따라 결정적이며,policy_hash는blake3(canonical_json(ChunkPolicy))의 hex 16자입니다.기준 문서: tasks/p1/p1-5-chunk.md, docs/superpowers/specs/2026-04-27-kb-final-form-design.md (§3.5 Chunk / §4.2 chunk_id / §7.2 Chunker / §0 Q3 / §14).
구현
crates/kb-chunk:lib.rs—MdHeadingV1Chunker만 노출.md_heading_v1.rs—Chunkertrait 구현. 우선순위 logic, 토큰 추정 proxy(bytes / 3, 한국어 안전한 over-estimation), policy_hash, overlap seeding 포함.tests/long_section_snapshot.rs—long-section.md픽스처 →Vec<Chunk>JSON 결정적 round-trip + 5-iter 결정성.fixtures/markdown/long-section.md(~3KB, 3 H1 / 1 H2 / 코드블록 / 표 / ImageRef 혼합) +long-section.chunks.snapshot.jsonbaseline.동작 contract 핵심
token_estimate=0) → 긴 단락은overlap_tokens만큼 prior tail을 seed로 carry →chunker_version/policy_hash기록.heading_path전파 시 첫 블록이Block::Heading이면parent_path + [heading.text]로 합성하여 heading-only chunk가 인용 컨텍스트를 잃지 않도록 보장(I2 수정).min(overlap_tokens, target_tokens / 2)로 clamp하여 병리적 정책(overlap >= target)에서 1-block-per-chunk 패턴을 차단(I3 수정).policy_hashID에만 fold됨 — design §3.5의Chunkstruct에는policy_hash필드가 없는 게 정합. P1-6 SQLitechunks.policy_hash컬럼은 caller가Chunker::policy_hash(policy)로 재계산하여 채울 책임.MdHeadingV1Chunker는 zero-sized 무상태 — trait method 호출 간 캐싱 없음.검증
cargo check / build (RUSTFLAGS=\"-D warnings\") / clippy --all-targets -D warnings모두 cleancargo test -p kb-chunk→ 11 unit + 2 integration 통과cargo test --workspace→ 워크스페이스 전체 그린cargo tree -p kb-chunk --depth 1→ 허용 deps만 (kb-core, anyhow, serde_json_canonicalizer, blake3, tracing).kb-parse-md/kb-normalize는[dev-dependencies]에만 등장(§8 boundary 준수).deterministic_chunk_ids_10001000회 byte-identical chunk_id,long_section_chunks_are_deterministic5-iter 통합 결정성, snapshot baseline 안정.Review trail
Chunkstruct에policy_hash필드 부재)는 design §3.5가 권위적이라 구현자 해석 정확.respect_markdown_headings무시 / I2 heading-only chunk path 손실 / I3 overlap seed double-count) + Minor 2건 (M3policy_hashpanic 문서화 / M4would_exceed회계 주석).f780c71,a81460f,ceeac9f) — heading_path 합성 로직 + overlap clamp + 문서화. Snapshot baseline 무변화.후속 자연 해소 (비차단)
put_chunks(...)시그니처에policy_hash또는policy: &ChunkPolicy인자 추가 필요(§5.5 컬럼이NOT NULL). P1-6 책임.doc_markdown백틱,missing_errors_doc등)는 별도 hardening 패스에서.다음
머지 후
feat/p1-6-store-sqlite분기 → P1-6 (kb-store-sqlite) 진행. P1 마지막 task.Fills in MdHeadingV1Chunker::chunk with the priority-ordered ruleset from the design (§0 / §14): 1. Heading is a hard boundary; the heading itself starts and is included in its chunk so heading text is retrievable. 2. Code blocks never split, regardless of size. 3. Tables stay single-chunk (row-split deferred per task spec). 4. Long sections split at target_tokens with paragraph-level overlap_tokens worth of seeded tail blocks. 5. ImageRef / AudioRef each become their own chunk with token_estimate = 0. 6. heading_path lifts from the first contributing non-Heading block; source_spans concatenate in document order. 7. chunk_id derives from id_for_chunk(doc_id, chunker_version, block_ids, policy_hash) per §4.2. Covers the unit + determinism rows of the P1-5 test plan: heading boundary respected, 800-token code block stays single, small table stays single, long paragraph chain splits with overlap, ImageRef chunk has token_estimate=0, and 1000-iter chunk_id determinism. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>A pathological `ChunkPolicy { overlap_tokens >= target_tokens }` caused `md-heading-v1` to degenerate into 1-block-per-chunk: the seeded `acc.text_tokens` already exceeded `target_tokens` before any fresh content landed, so the next paragraph immediately tripped the `would_exceed` flush. Clamp the seed budget in `collect_overlap_seed` to `min(overlap_tokens, target_tokens / 2)`. This guarantees that after seeding, the chunk has at least `target/2` worth of room for new content before flushing, restoring the intended paragraph-overlap behavior on every reasonable and unreasonable policy. Adds a regression test pinning a 50/200 (overlap = 4× target) policy and asserting every emitted chunk holds ≥2 blocks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>리뷰: 모든 항목 해소 — APPROVE 권고 (gate 정책상 COMMENT 등록)
내부 review loop를 3회 완료했습니다. 마지막 리뷰어 verdict: APPROVED, must-fix-now 0건.
Review trail
f780c71,a81460f,ceeac9f핵심 검증
chunker_version="md-heading-v1"+policy_hash16hex +chunk결정성chunk_idrecipe per §4.2 (block_ids는 document order, 순서 유의성 검증)# A\n## B\n[code]→[A],[A,B],[A,B]min(overlap_tokens, target_tokens / 2)(I3)Chunkstruct에policy_hash필드 부재 → ID에만 fold (design §3.5 정합)Gate 통과
cargo check / build (RUSTFLAGS=-D warnings) / clippy --all-targets -D warningscleancargo test -p kb-chunk→ 11 unit + 2 integration pass (was 9+2, +2 신규:heading_only_chunk_carries_self_in_path,overlap_clamped_when_overlap_exceeds_target)cargo test --workspace전체 그린long-section.chunks.snapshot.json무변화 — I2 fix는chunk_id영향 없음(heading_path는 recipe에 미포함), I3 clamp는 정상 정책에서 무영향(min(40, 100)=40).후속 자연 해소 (비차단)
chunks.policy_hash NOT NULL컬럼을 채우려면put_chunks시그니처 보정 필요. caller가Chunker::policy_hash(policy)로 재계산하여 전달하는 방식이 자연스럽습니다.결론
papered over 없이 root-cause fix 완료. self-review gate 정책상 본 코멘트는 COMMENT — author 측 수동 APPROVE + merge 부탁드립니다.