Multi-engine speech-to-text library for Rust. Supports Parakeet, Canary, Moonshine, SenseVoice, GigaAM, Whisper, Whisperfile, and OpenAI.
Version 0.3.0 changes the SpeechModel trait. If you need the old API, pin to version = "=0.2.9".
transcribe()andtranscribe_file()now take&TranscribeOptionsinstead ofOption<&str>for languageSpeechModelrequiresSend, enablingBox<dyn SpeechModel + Send>across threadsTranscribeOptionsincludes atranslatefield for Whisper/Whisperfile translation supportWhisperEngine::capabilities()now returns actual model language support (English-only vs multilingual) instead of always reporting 99 languages
Note: 0.3.0 is a large migration. We believe correctness is preserved for all engines, but expect potential issues as this stabilizes. Please report any problems on GitHub.
[dependencies]
transcribe-rs = { version = "0.3", features = ["onnx"] }No features are enabled by default. Pick the engines you need:
| Feature | Engines |
|---|---|
onnx |
Parakeet, Canary, Moonshine, SenseVoice, GigaAM (via ONNX Runtime) |
whisper-cpp |
Whisper (local, GGML via whisper.cpp with Metal/Vulkan) |
whisperfile |
Whisperfile (local server wrapper) |
openai |
OpenAI API (remote, async) |
all |
Everything above |
GPU accelerator features for ORT engines:
| Feature | Backend |
|---|---|
ort-cuda |
NVIDIA CUDA |
ort-rocm |
AMD ROCm |
ort-directml |
Microsoft DirectML (Windows) |
use transcribe_rs::onnx::parakeet::{ParakeetModel, ParakeetParams, TimestampGranularity};
use transcribe_rs::onnx::Quantization;
use std::path::PathBuf;
let mut model = ParakeetModel::load(
&PathBuf::from("models/parakeet-tdt-0.6b-v3-int8"),
&Quantization::Int8,
)?;
let samples = transcribe_rs::audio::read_wav_samples(&PathBuf::from("audio.wav"))?;
let result = model.transcribe_with(
&samples,
&ParakeetParams {
timestamp_granularity: Some(TimestampGranularity::Segment),
..Default::default()
},
)?;
println!("{}", result.text);All local engines implement the SpeechModel trait. Remote engines (OpenAI) implement RemoteTranscriptionEngine separately because they are async and file-based.
By default, engines use CPU. To enable GPU acceleration, enable the appropriate feature and set the accelerator preference before loading any models:
use transcribe_rs::{set_ort_accelerator, OrtAccelerator};
// Use CUDA for all ORT engines (SenseVoice, GigaAM, Parakeet, Moonshine)
set_ort_accelerator(OrtAccelerator::Cuda);
// Or auto-detect the best available GPU
set_ort_accelerator(OrtAccelerator::Auto);For whisper.cpp, GPU backend (Metal, Vulkan) is selected at compile time. You can control whether GPU is used at runtime:
use transcribe_rs::{set_whisper_accelerator, WhisperAccelerator};
set_whisper_accelerator(WhisperAccelerator::CpuOnly); // force CPUDirectML note: DirectML requires special ORT session settings (parallel_execution(false), memory_pattern(false)) that would hurt performance on other backends. Because of this, Auto mode does not include DirectML — you must explicitly select it with OrtAccelerator::DirectMl.
Query which ORT accelerators are compiled in with OrtAccelerator::available().
use transcribe_rs::onnx::canary::{CanaryModel, CanaryParams};
use transcribe_rs::onnx::Quantization;
use std::path::PathBuf;
let mut model = CanaryModel::load(
&PathBuf::from("models/canary-1b-v2"),
&Quantization::Int8,
)?;
let samples = transcribe_rs::audio::read_wav_samples(&PathBuf::from("audio.wav"))?;
let result = model.transcribe_with(
&samples,
&CanaryParams {
language: Some("en".to_string()),
..Default::default()
},
)?;Canary supports translation via target_language:
let result = model.transcribe_with(
&samples,
&CanaryParams {
language: Some("de".to_string()),
target_language: Some("en".to_string()),
..Default::default()
},
)?;Model variant (Flash vs V2) is auto-detected from vocabulary size. Flash models support en/de/es/fr; V2 supports 25 languages.
Features:
- PnC (punctuation and capitalization) — enabled by default. When on, the model adds proper punctuation and capitalization. Set
use_pnc: falsefor raw output. - ITN (inverse text normalization) — enabled by default. Converts spoken numbers to written form (e.g. "one hundred twenty three" becomes "123"). Set
use_itn: falseto disable. Only supported on V2 models; silently ignored on Flash. - Translation — set
target_languageto translate between supported languages.
use transcribe_rs::onnx::sense_voice::{SenseVoiceModel, SenseVoiceParams};
use transcribe_rs::onnx::Quantization;
use std::path::PathBuf;
let mut model = SenseVoiceModel::load(
&PathBuf::from("models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17"),
&Quantization::Int8,
)?;
let samples = transcribe_rs::audio::read_wav_samples(&PathBuf::from("audio.wav"))?;
let result = model.transcribe_with(
&samples,
&SenseVoiceParams {
language: Some("en".to_string()),
..Default::default()
},
)?;use transcribe_rs::onnx::moonshine::{MoonshineModel, MoonshineVariant};
use transcribe_rs::onnx::Quantization;
use transcribe_rs::SpeechModel;
use std::path::PathBuf;
let mut model = MoonshineModel::load(
&PathBuf::from("models/moonshine-base"),
MoonshineVariant::Base,
&Quantization::default(),
)?;
let result = model.transcribe_file(&PathBuf::from("audio.wav"), &transcribe_rs::TranscribeOptions::default())?;Streaming variant:
use transcribe_rs::onnx::moonshine::StreamingModel;
use transcribe_rs::onnx::Quantization;
use transcribe_rs::SpeechModel;
use std::path::PathBuf;
let mut model = StreamingModel::load(
&PathBuf::from("models/moonshine-streaming/moonshine-tiny-streaming-en"),
4, // threads
&Quantization::default(),
)?;
let result = model.transcribe_file(&PathBuf::from("audio.wav"), &transcribe_rs::TranscribeOptions::default())?;use transcribe_rs::onnx::gigaam::GigaAMModel;
use transcribe_rs::onnx::Quantization;
use transcribe_rs::SpeechModel;
use std::path::PathBuf;
let mut model = GigaAMModel::load(
&PathBuf::from("models/giga-am-v3"),
&Quantization::default(),
)?;
let result = model.transcribe_file(&PathBuf::from("audio.wav"), &transcribe_rs::TranscribeOptions::default())?;use transcribe_rs::whisper_cpp::{WhisperEngine, WhisperInferenceParams};
use std::path::PathBuf;
let mut engine = WhisperEngine::load(&PathBuf::from("models/whisper-medium-q4_1.bin"))?;
let samples = transcribe_rs::audio::read_wav_samples(&PathBuf::from("audio.wav"))?;
let result = engine.transcribe_with(
&samples,
&WhisperInferenceParams {
initial_prompt: Some("Context prompt here.".to_string()),
..Default::default()
},
)?;use transcribe_rs::whisperfile::{
WhisperfileEngine, WhisperfileInferenceParams, WhisperfileLoadParams,
};
use std::path::PathBuf;
let mut engine = WhisperfileEngine::load_with_params(
&PathBuf::from("models/whisperfile-0.9.3"),
&PathBuf::from("models/ggml-small.bin"),
WhisperfileLoadParams {
port: 8080,
startup_timeout_secs: 60,
..Default::default()
},
)?;
let samples = transcribe_rs::audio::read_wav_samples(&PathBuf::from("audio.wav"))?;
let result = engine.transcribe_with(
&samples,
&WhisperfileInferenceParams {
language: Some("en".to_string()),
..Default::default()
},
)?;
// Server shuts down automatically when engine is dropped.use transcribe_rs::remote::openai::{self, OpenAIModel, OpenAIRequestParams};
use transcribe_rs::{remote, RemoteTranscriptionEngine};
use std::path::PathBuf;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let engine = openai::default_engine();
let result = engine
.transcribe_file(
&PathBuf::from("audio.wav"),
OpenAIRequestParams::builder()
.model(OpenAIModel::Gpt4oMiniTranscribe)
.timestamp_granularity(remote::openai::OpenAITimestampGranularity::Segment)
.build()?,
)
.await?;
println!("{}", result.text);
Ok(())
}All audio input must be 16 kHz, mono, 16-bit PCM WAV.
| Engine | Download |
|---|---|
| Parakeet (int8) | blob.handy.computer / HuggingFace |
| Canary 180M Flash | HuggingFace |
| Canary 1B Flash | HuggingFace |
| Canary 1B v2 | HuggingFace |
| SenseVoice (int8) | blob.handy.computer / sherpa-onnx |
| Moonshine | HuggingFace |
| GigaAM | HuggingFace |
| Whisper (GGML) | HuggingFace |
| Whisperfile binary | GitHub |
Parakeet (directory):
models/parakeet-tdt-0.6b-v3-int8/
├── encoder-model.int8.onnx
├── decoder_joint-model.int8.onnx
├── nemo128.onnx
└── vocab.txt
Canary (directory):
models/canary-1b-v2/
├── encoder-model.int8.onnx
├── decoder-model.int8.onnx
├── nemo128.onnx
└── vocab.txt
SenseVoice (directory):
models/sense-voice/
├── model.int8.onnx
└── tokens.txt
Moonshine (directory):
models/moonshine-base/
├── encoder_model.onnx
├── decoder_model_merged.onnx
└── tokenizer.json
Moonshine Streaming (directory):
models/moonshine-streaming/moonshine-tiny-streaming-en/
├── encoder.onnx
├── decoder.onnx
├── streaming_config.json
└── tokenizer.json
GigaAM (directory):
models/giga-am-v3/
├── model.onnx (or model.int8.onnx)
└── vocab.txt
Whisper: single file (e.g. whisper-medium-q4_1.bin).
| Variant | Language |
|---|---|
| Tiny | English |
| TinyAr | Arabic |
| TinyZh | Chinese |
| TinyJa | Japanese |
| TinyKo | Korean |
| TinyUk | Ukrainian |
| TinyVi | Vietnamese |
| Base | English |
| BaseEs | Spanish |
Each engine has an example in examples/. Run with the appropriate feature flag:
cargo run --example parakeet --features onnx
cargo run --example canary --features onnx
cargo run --example sense_voice --features onnx
cargo run --example moonshine --features onnx
cargo run --example moonshine_streaming --features onnx
cargo run --example gigaam --features onnx
cargo run --example whisper --features whisper-cpp
cargo run --example whisperfile --features whisperfile
cargo run --example openai --features openaiTests are also feature-gated. Models must be present locally; tests skip gracefully if not found.
cargo test --features onnx
cargo test --features whisper-cpp
cargo test --features whisperfile
cargo test --all-featuresWhisperfile tests look for the binary at models/whisperfile-0.9.3 (override with WHISPERFILE_BIN) and model at models/ggml-small.bin (override with WHISPERFILE_MODEL). GigaAM tests require samples/russian.wav.
Development aliases from .cargo/config.toml:
cargo check-all # cargo check --all-features
cargo build-all # cargo build --all-features
cargo test-all # cargo test --all-featuresParakeet int8 benchmarks:
| Platform | Speed |
|---|---|
| MBP M4 Max | ~30x real-time |
| Zen 3 (5700X) | ~20x real-time |
| Skylake (i5-6500) | ~5x real-time |
| Jetson Nano CPU | ~5x real-time |
- istupakov for the ONNX Parakeet, Canary, and GigaAM exports
- NVIDIA for Parakeet and Canary
- whisper.cpp
- jart / Mozilla AI for llamafile and Whisperfile
- UsefulSensors for Moonshine
- FunASR / sherpa-onnx for SenseVoice
- SberDevices for GigaAM