Skip to content

Feature Request : Add Varuna STT (Hindi 0.6B streaming RNN-T) to the NeMo pretrained model registry #15664

@harsh2ai

Description

@harsh2ai

Is your feature request related to a problem? Please describe.

Voice AI and voice-agent products have exploded over the past year phone agents, IVR replacements, in-app voice assistants, real-time captioning. For India, that wave runs into a hard wall: there is no strong, openly-loadable Hindi (or Hindi+English) streaming ASR baseline that teams can pick up from NeMo and ship.

The pattern we keep seeing at companies trying to build Hindi voice agents:

  1. They start with nemotron-speech-streaming-en-0.6b because it's the best streaming option in NeMo, then realize it's English-only.
  2. They fall back to commercial APIs (Deepgram / Sarvam / ElevenLabs), which solves the prototype but blows up unit economics at agent scale and gives them no path to domain-tune.
  3. They try to finetune the English Nemotron themselves and hit the real cost: curating ~3,000h of Hindi, building a bilingual tokenizer, deciding on ITN conventions, running multi-day training, and finally hosting weights somewhere work most product teams can't justify.

The result is that the open Hindi ASR baseline available via NeMo's from_pretrained lags well behind what's possible, and every team ends up redoing the same finetune in private. We'd like to close that gap by contributing a ready-to-load Hindi streaming RNN-T to NeMo's pretrained registry.

Describe the solution you'd like

Register SkunkWorkLabs/varuna-stt — a 0.6B Hindi + English streaming RNN-T finetuned from nvidia/nemotron-speech-streaming-en-0.6b — as a pretrained checkpoint loadable via the standard NeMo wrapper (EncDecRNNTBPEModel), so anyone building a Hindi or Hindi+English voice agent can do:

from nemo.collections.asr.models import EncDecRNNTBPEModel

asr = EncDecRNNTBPEModel.from_pretrained("SkunkWorkLabs/varuna-stt")
print(asr.transcribe(["hindi_clip_16k.wav"]))
# -> "दो लाख पचास हजार रुपये का प्रोजेक्ट है।"   (already ITN-formatted)

Model summary

  • Architecture: Conformer encoder + RNN-T (EncDecRNNTBPEModel), 0.6B params, 16 kHz mono, streaming-capable (inherited from base).
  • Languages: Hindi and English (handled separately — the model transcribes Hindi-only or English-only utterances; in-utterance code-switch is out of scope, see Limitations).
  • Tokenizer: bilingual EN-1024 + HI-512 BPE (1,536 tokens) — English vocabulary inherited from the base model is preserved.
  • Training data: ~3,000h Hindi (Shrutilipi, IndicVoices, IndicVoices-R, Kathbath, Gramvaani, Vaani, Lahaja, IndicTTS, short-form domain) plus the English data carried through from the base finetune.
  • Output: ITN-style Hindi (digits, 1st/3rd, lakh/crore commas, /,/?/!), labels normalized via a Gemma-4 pass following NVIDIA Riva Hindi ITN conventions — directly useful for agent pipelines that downstream into LLMs and TTS.

Evaluation

Vistaar-style normalized comparison on SkunkWorkLabs/hindi-asr-benchmark.

WER (%) — Varuna vs. commercial APIs

subset n Varuna ElevenLabs Scribe v1 Deepgram Nova-2 Sarvam Saarika v2.5
indictts 98 9.75 13.20 15.41 14.71
fleurs (test) 417 17.29 11.93 21.22 15.74
kathbath 1,929 16.82 13.32 20.55 16.62
kathbath_noisy 1,929 19.06 13.16 21.98 17.75
commonvoice 1,727 24.16 17.02 28.34 19.32
mucs 3,897 24.60 10.97 20.54 12.72

CER (%)

subset Varuna ElevenLabs Deepgram Sarvam
indictts 2.75 4.16 8.53 6.51
fleurs (test) 7.20 5.68 16.74 7.08
kathbath 6.36 6.50 13.53 7.42
kathbath_noisy 8.00 5.87 14.75 7.82
commonvoice 10.72 8.96 20.25 9.87
mucs 10.75 3.94 9.94 4.79

Inference (H100 PCIe, batch=1, greedy_batch RNN-T): RTFx ≈ 25×, p50 latency 175 ms, p90 362 ms — i.e. fits inside a real-time voice agent's per-turn budget on commodity GPU infra.

Describe alternatives you've considered

  • Loading via from_pretrained("SkunkWorkLabs/varuna-stt") directly from HF already works without any NeMo change, but the model isn't surfaced in list_available_models() or NeMo docs, so most people building Hindi voice agents on NeMo never find it.
  • Existing NeMo Hindi checkpoints (older Conformer-CTC variants) are non-streaming and don't produce ITN output, so they're a poor fit for agent pipelines that need low-latency partials and LLM-friendly text.

Additional context

  • Limitations: supports Hindi and English separately, but not in-utterance code-switch (Hindi-English mixed mid-sentence may produce transliteration artifacts); 16 kHz mono only; weaker on codec-degraded telephony audio (see model card).
  • Training-data licensing: each source corpus retains its own license; this contribution is weights + tokenizer only, no audio redistribution.
  • Reproduction: happy to add a finetune recipe under examples/asr/ and a short load-and-transcribe tutorial showing a "Hindi voice-agent slice" (live transcription → LLM → TTS) if maintainers want them bundled with the registration.
  • Maintainer: SkunkWorks Labs (contact: harshris2314@gmail.com).
  • Motivation / acknowledgement: this PR exists because @KunalDhawan, in the HF discussion on nvidia/nemotron-speech-streaming-en-0.6b (discussion #2), explicitly invited the community to feel free and raise a PR at NeMo to contribute finetuned variants back. Huge thanks to Kunal for the nudge — without that comment this would probably have stayed an internal checkpoint.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions