Implement Extractor for MediaType::Audio(_) plus a Transcriber trait + whisper.cpp Rust binding adapter (whisper-rs). Produces a CanonicalDocument whose body is one AudioRefBlock populated with Transcript { segments, language, engine, engine_version }.
Why now / why this size
Audio stays a single, replaceable engine boundary (Transcriber trait). Extractor + adapter together because the extractor is essentially a thin shell over the transcriber.
Allowed dependencies
kebab-core
kebab-config
whisper-rs = "0.13" (or current stable)
symphonia = { version = "0.5", features = ["all"] } — decode .m4a/.mp3/.wav/.flac/.ogg to interleaved f32 PCM at the source's native sample rate / channel layout. Symphonia does NOT resample; that is rubato's job.
rubato = "0.15" — sample-rate conversion to 16 kHz mono f32 (the input shape whisper.cpp expects). Use rubato::FftFixedIn::new(input_sample_rate, 16_000, frames_per_chunk, sub_chunks, 1 /* channels after downmix */) for fixed-input streaming; pre-mix multi-channel to mono via simple averaging before the resampler.
Determinism: greedy sampling + fixed model + identical PCM → identical transcript text and segment timestamps. Tests use base.en (small fast model) for speed.
Model file missing → anyhow::Error with hint download whisper.cpp model and set audio.model_path.
Storage / wire effects
Reads: config.audio.model_path (model file).
Otherwise none directly.
Test plan
kind
description
fixture / data
unit
3-second WAV containing "hello world" → segments[0].text contains "hello world" (using base.en model, downloaded once for CI)
fixtures/audio/hello.wav
unit
duration_ms matches actual audio length within ±50 ms
inline
unit
corrupt audio → error
fixtures/audio/corrupt.wav
unit
model file missing → error with helpful hint
inline
unit
language hint passed to whisper changes detected language
inline
determinism
identical input → identical Transcript twice
inline
#[ignore] integration
30-second Korean audio → segments_count > 1, language = "ko"
requires large-v3 model
snapshot
CanonicalDocument JSON stable for short fixture
fixtures/audio/hello.wav
All tests under cargo test -p kebab-parse-audio. Mark slow/large-model tests #[ignore].
Definition of Done
cargo check -p kebab-parse-audio passes
cargo test -p kebab-parse-audio passes (excluding #[ignore])
No imports outside Allowed dependencies (resampler crate may be added — record in PR)
First-run model download path documented (NOT performed by code; user responsibility)
PR links design §3.4, §3.7a, §9.3
Out of scope
Diarization (P+).
Real-time / streaming transcription (P+).
Voice activity detection beyond what whisper.cpp offers internally.
Lossless re-encoding of source audio.
Risks / notes
whisper.cpp model files are large (1+ GB for large-v3). Tests must default to base.en (~150 MB) and ship a 3-second fixture.
macOS Metal acceleration: ensure whisper-rs feature flags align with M-series builds; document any required env vars.
Decoding errors for variable-bitrate .m4a are common; symphonia is the most reliable Rust option but expect occasional unsupported codec; fail clean rather than panic.
Resampling: rubato::FftFixedIn is the v1 default — high enough quality that whisper.cpp recognition is not the bottleneck, fast enough that decode + resample stays under real-time on M-series. If a regression appears, switch to SincFixedIn with PR; record the change in engine_version since transcript stability depends on the resampler.