Is your feature request related to a problem? Please describe.
Voice AI and voice-agent products have exploded over the past year phone agents, IVR replacements, in-app voice assistants, real-time captioning. For India, that wave runs into a hard wall: there is no strong, openly-loadable Hindi (or Hindi+English) streaming ASR baseline that teams can pick up from NeMo and ship.
The pattern we keep seeing at companies trying to build Hindi voice agents:
- They start with
nemotron-speech-streaming-en-0.6b because it's the best streaming option in NeMo, then realize it's English-only.
- They fall back to commercial APIs (Deepgram / Sarvam / ElevenLabs), which solves the prototype but blows up unit economics at agent scale and gives them no path to domain-tune.
- They try to finetune the English Nemotron themselves and hit the real cost: curating ~3,000h of Hindi, building a bilingual tokenizer, deciding on ITN conventions, running multi-day training, and finally hosting weights somewhere work most product teams can't justify.
The result is that the open Hindi ASR baseline available via NeMo's from_pretrained lags well behind what's possible, and every team ends up redoing the same finetune in private. We'd like to close that gap by contributing a ready-to-load Hindi streaming RNN-T to NeMo's pretrained registry.
Describe the solution you'd like
Register SkunkWorkLabs/varuna-stt — a 0.6B Hindi + English streaming RNN-T finetuned from nvidia/nemotron-speech-streaming-en-0.6b — as a pretrained checkpoint loadable via the standard NeMo wrapper (EncDecRNNTBPEModel), so anyone building a Hindi or Hindi+English voice agent can do:
from nemo.collections.asr.models import EncDecRNNTBPEModel
asr = EncDecRNNTBPEModel.from_pretrained("SkunkWorkLabs/varuna-stt")
print(asr.transcribe(["hindi_clip_16k.wav"]))
# -> "दो लाख पचास हजार रुपये का प्रोजेक्ट है।" (already ITN-formatted)
Model summary
- Architecture: Conformer encoder + RNN-T (
EncDecRNNTBPEModel), 0.6B params, 16 kHz mono, streaming-capable (inherited from base).
- Languages: Hindi and English (handled separately — the model transcribes Hindi-only or English-only utterances; in-utterance code-switch is out of scope, see Limitations).
- Tokenizer: bilingual EN-1024 + HI-512 BPE (1,536 tokens) — English vocabulary inherited from the base model is preserved.
- Training data: ~3,000h Hindi (Shrutilipi, IndicVoices, IndicVoices-R, Kathbath, Gramvaani, Vaani, Lahaja, IndicTTS, short-form domain) plus the English data carried through from the base finetune.
- Output: ITN-style Hindi (digits,
1st/3rd, lakh/crore commas, ।/,/?/!), labels normalized via a Gemma-4 pass following NVIDIA Riva Hindi ITN conventions — directly useful for agent pipelines that downstream into LLMs and TTS.
Evaluation
Vistaar-style normalized comparison on SkunkWorkLabs/hindi-asr-benchmark.
WER (%) — Varuna vs. commercial APIs
| subset |
n |
Varuna |
ElevenLabs Scribe v1 |
Deepgram Nova-2 |
Sarvam Saarika v2.5 |
| indictts |
98 |
9.75 |
13.20 |
15.41 |
14.71 |
| fleurs (test) |
417 |
17.29 |
11.93 |
21.22 |
15.74 |
| kathbath |
1,929 |
16.82 |
13.32 |
20.55 |
16.62 |
| kathbath_noisy |
1,929 |
19.06 |
13.16 |
21.98 |
17.75 |
| commonvoice |
1,727 |
24.16 |
17.02 |
28.34 |
19.32 |
| mucs |
3,897 |
24.60 |
10.97 |
20.54 |
12.72 |
CER (%)
| subset |
Varuna |
ElevenLabs |
Deepgram |
Sarvam |
| indictts |
2.75 |
4.16 |
8.53 |
6.51 |
| fleurs (test) |
7.20 |
5.68 |
16.74 |
7.08 |
| kathbath |
6.36 |
6.50 |
13.53 |
7.42 |
| kathbath_noisy |
8.00 |
5.87 |
14.75 |
7.82 |
| commonvoice |
10.72 |
8.96 |
20.25 |
9.87 |
| mucs |
10.75 |
3.94 |
9.94 |
4.79 |
Inference (H100 PCIe, batch=1, greedy_batch RNN-T): RTFx ≈ 25×, p50 latency 175 ms, p90 362 ms — i.e. fits inside a real-time voice agent's per-turn budget on commodity GPU infra.
Describe alternatives you've considered
- Loading via
from_pretrained("SkunkWorkLabs/varuna-stt") directly from HF already works without any NeMo change, but the model isn't surfaced in list_available_models() or NeMo docs, so most people building Hindi voice agents on NeMo never find it.
- Existing NeMo Hindi checkpoints (older Conformer-CTC variants) are non-streaming and don't produce ITN output, so they're a poor fit for agent pipelines that need low-latency partials and LLM-friendly text.
Additional context
- Limitations: supports Hindi and English separately, but not in-utterance code-switch (Hindi-English mixed mid-sentence may produce transliteration artifacts); 16 kHz mono only; weaker on codec-degraded telephony audio (see model card).
- Training-data licensing: each source corpus retains its own license; this contribution is weights + tokenizer only, no audio redistribution.
- Reproduction: happy to add a finetune recipe under
examples/asr/ and a short load-and-transcribe tutorial showing a "Hindi voice-agent slice" (live transcription → LLM → TTS) if maintainers want them bundled with the registration.
- Maintainer: SkunkWorks Labs (contact: harshris2314@gmail.com).
- Motivation / acknowledgement: this PR exists because @KunalDhawan, in the HF discussion on
nvidia/nemotron-speech-streaming-en-0.6b (discussion #2), explicitly invited the community to feel free and raise a PR at NeMo to contribute finetuned variants back. Huge thanks to Kunal for the nudge — without that comment this would probably have stayed an internal checkpoint.
Is your feature request related to a problem? Please describe.
Voice AI and voice-agent products have exploded over the past year phone agents, IVR replacements, in-app voice assistants, real-time captioning. For India, that wave runs into a hard wall: there is no strong, openly-loadable Hindi (or Hindi+English) streaming ASR baseline that teams can pick up from NeMo and ship.
The pattern we keep seeing at companies trying to build Hindi voice agents:
nemotron-speech-streaming-en-0.6bbecause it's the best streaming option in NeMo, then realize it's English-only.The result is that the open Hindi ASR baseline available via NeMo's
from_pretrainedlags well behind what's possible, and every team ends up redoing the same finetune in private. We'd like to close that gap by contributing a ready-to-load Hindi streaming RNN-T to NeMo's pretrained registry.Describe the solution you'd like
Register
SkunkWorkLabs/varuna-stt— a 0.6B Hindi + English streaming RNN-T finetuned fromnvidia/nemotron-speech-streaming-en-0.6b— as a pretrained checkpoint loadable via the standard NeMo wrapper (EncDecRNNTBPEModel), so anyone building a Hindi or Hindi+English voice agent can do:Model summary
EncDecRNNTBPEModel), 0.6B params, 16 kHz mono, streaming-capable (inherited from base).1st/3rd, lakh/crore commas,।/,/?/!), labels normalized via a Gemma-4 pass following NVIDIA Riva Hindi ITN conventions — directly useful for agent pipelines that downstream into LLMs and TTS.Evaluation
Vistaar-style normalized comparison on
SkunkWorkLabs/hindi-asr-benchmark.WER (%) — Varuna vs. commercial APIs
CER (%)
Inference (H100 PCIe, batch=1, greedy_batch RNN-T): RTFx ≈ 25×, p50 latency 175 ms, p90 362 ms — i.e. fits inside a real-time voice agent's per-turn budget on commodity GPU infra.
Describe alternatives you've considered
from_pretrained("SkunkWorkLabs/varuna-stt")directly from HF already works without any NeMo change, but the model isn't surfaced inlist_available_models()or NeMo docs, so most people building Hindi voice agents on NeMo never find it.Additional context
examples/asr/and a short load-and-transcribe tutorial showing a "Hindi voice-agent slice" (live transcription → LLM → TTS) if maintainers want them bundled with the registration.nvidia/nemotron-speech-streaming-en-0.6b(discussion #2), explicitly invited the community to feel free and raise a PR at NeMo to contribute finetuned variants back. Huge thanks to Kunal for the nudge — without that comment this would probably have stayed an internal checkpoint.