Implement Chunker with chunker_version = "audio-segment-v1". Groups consecutive transcript segments into chunks that approach target_tokens while respecting speaker-turn boundaries (when present).
Why now / why this size
Per-medium chunker. Tiny but versioned — chunk_id depends on chunker_version so labeling matters.
policy_hash = blake3(canonical_json(policy)) truncated to 16 hex chars.
Behavior contract
Operates only on documents whose first block is Block::AudioRef with Some(transcript). Other documents → anyhow::Error("AudioSegmentV1Chunker only handles audio docs").
Iterate transcript.segments (already in chronological order):
Greedily group adjacent segments until estimated token budget approaches policy.target_tokens (bytes / 4 proxy on segment text).
Force a split when segment[i].speaker != segment[i-1].speaker (only if speaker info present), even if budget not met.
No overlap across chunks (audio chunk overlap is rarely useful for retrieval).
block_ids = [audio_ref_block.block_id] (always one block per chunk).
token_estimate = byte_len / 4.
Empty transcript (segments.is_empty()) → Vec::new() (no chunks).
Speaker label for citation: if all segments in a chunk share a speaker, the chunk's Citation::Time { speaker: Some(...) } (constructed downstream by retrieval) preserves it. This task's responsibility ends at populating source_spans; retrieval-side citation construction reads transcript.segments from DB to attach speaker (or this chunker can serialize speaker into a small extension JSON in chunk.heading_path — chosen approach: leave the speaker propagation to the retriever, NOT the chunker, because including it in chunk_id would couple speakers into chunk_id).
Time-overlap chunks (intentionally not supported in v1).
Real tokenizer integration (P+ replaces byte proxy across all chunkers).
Risks / notes
Speaker boundary forcing can create very small chunks if speakers alternate fast (e.g., interview Q/A). Document a policy.min_segments_per_chunk knob (default 1) to optionally suppress force-splits below the floor — implementer's call to add a config knob if metric pressure demands.
Citation speaker inference at retrieval time needs DB lookup of transcript_segments (or a transcript_segments table — none exists yet). For v1, surface speaker info via the wire Citation::Time.speaker only when the retriever can confidently attach it; otherwise leave None. This task does not block on that decision.
Bumping chunker_version invalidates downstream embeddings; treat as a versioning event per §9.