Feature Request : Add Varuna STT (Hindi 0.6B streaming RNN-T) to the NeMo pretrained model registry


## Is your feature request related to a problem? Please describe.

Voice AI and voice-agent products have exploded over the past year phone agents, IVR replacements, in-app voice assistants, real-time captioning. For India, that wave runs into a hard wall: there is no strong, openly-loadable Hindi (or Hindi+English) streaming ASR baseline that teams can pick up from NeMo and ship.

The pattern we keep seeing at companies trying to build Hindi voice agents:

1. They start with `nemotron-speech-streaming-en-0.6b` because it's the best streaming option in NeMo, then realize it's English-only.
2. They fall back to commercial APIs (Deepgram / Sarvam / ElevenLabs), which solves the prototype but blows up unit economics at agent scale and gives them no path to domain-tune.
3. They try to finetune the English Nemotron themselves and hit the real cost: curating ~3,000h of Hindi, building a bilingual tokenizer, deciding on ITN conventions, running multi-day training, and finally hosting weights somewhere  work most product teams can't justify.

The result is that the *open* Hindi ASR baseline available via NeMo's `from_pretrained` lags well behind what's possible, and every team ends up redoing the same finetune in private. We'd like to close that gap by contributing a ready-to-load Hindi streaming RNN-T to NeMo's pretrained registry.

## Describe the solution you'd like

Register `SkunkWorkLabs/varuna-stt` — a 0.6B Hindi + English streaming RNN-T finetuned from `nvidia/nemotron-speech-streaming-en-0.6b` — as a pretrained checkpoint loadable via the standard NeMo wrapper (`EncDecRNNTBPEModel`), so anyone building a Hindi or Hindi+English voice agent can do:

```python
from nemo.collections.asr.models import EncDecRNNTBPEModel

asr = EncDecRNNTBPEModel.from_pretrained("SkunkWorkLabs/varuna-stt")
print(asr.transcribe(["hindi_clip_16k.wav"]))
# -> "दो लाख पचास हजार रुपये का प्रोजेक्ट है।"   (already ITN-formatted)
```

- **Model card:** https://huggingface.co/SkunkWorkLabs/varuna-stt
- **Live demo (HF Space):** https://huggingface.co/spaces/SkunkWorkLabs/varuna-stt-demo
- **Benchmark dataset:** https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark
- **License:**  MIT

### Model summary

- **Architecture:** Conformer encoder + RNN-T (`EncDecRNNTBPEModel`), 0.6B params, 16 kHz mono, streaming-capable (inherited from base).
- **Languages:** Hindi and English (handled separately — the model transcribes Hindi-only or English-only utterances; in-utterance code-switch is out of scope, see Limitations).
- **Tokenizer:** bilingual EN-1024 + HI-512 BPE (1,536 tokens) — English vocabulary inherited from the base model is preserved.
- **Training data:** ~3,000h Hindi (Shrutilipi, IndicVoices, IndicVoices-R, Kathbath, Gramvaani, Vaani, Lahaja, IndicTTS, short-form domain) plus the English data carried through from the base finetune.
- **Output:** ITN-style Hindi (digits, `1st`/`3rd`, lakh/crore commas, `।`/`,`/`?`/`!`), labels normalized via a Gemma-4 pass following NVIDIA Riva Hindi ITN conventions — directly useful for agent pipelines that downstream into LLMs and TTS.

### Evaluation

Vistaar-style normalized comparison on [`SkunkWorkLabs/hindi-asr-benchmark`](https://huggingface.co/datasets/SkunkWorkLabs/hindi-asr-benchmark).

**WER (%) — Varuna vs. commercial APIs**

| subset           | n     | Varuna    | ElevenLabs Scribe v1 | Deepgram Nova-2 | Sarvam Saarika v2.5 |
|------------------|-------|-----------|----------------------|------------------|----------------------|
| indictts         |    98 | **9.75**  | 13.20                | 15.41            | 14.71                |
| fleurs (test)    |   417 | 17.29     | **11.93**            | 21.22            | 15.74                |
| kathbath         | 1,929 | 16.82     | **13.32**            | 20.55            | 16.62                |
| kathbath_noisy   | 1,929 | 19.06     | **13.16**            | 21.98            | 17.75                |
| commonvoice      | 1,727 | 24.16     | **17.02**            | 28.34            | 19.32                |
| mucs             | 3,897 | 24.60     | **10.97**            | 20.54            | 12.72                |

**CER (%)**

| subset           | Varuna   | ElevenLabs | Deepgram | Sarvam   |
|------------------|----------|------------|----------|----------|
| indictts         | **2.75** | 4.16       | 8.53     | 6.51     |
| fleurs (test)    | 7.20     | **5.68**   | 16.74    | 7.08     |
| kathbath         | **6.36** | 6.50       | 13.53    | 7.42     |
| kathbath_noisy   | 8.00     | **5.87**   | 14.75    | 7.82     |
| commonvoice      | 10.72    | **8.96**   | 20.25    | 9.87     |
| mucs             | 10.75    | **3.94**   | 9.94     | 4.79     |

**Inference (H100 PCIe, batch=1, greedy_batch RNN-T):** RTFx ≈ 25×, p50 latency 175 ms, p90 362 ms — i.e. fits inside a real-time voice agent's per-turn budget on commodity GPU infra.

## Describe alternatives you've considered

- **Loading via `from_pretrained("SkunkWorkLabs/varuna-stt")` directly from HF** already works without any NeMo change, but the model isn't surfaced in `list_available_models()` or NeMo docs, so most people building Hindi voice agents on NeMo never find it.
- **Existing NeMo Hindi checkpoints** (older Conformer-CTC variants) are non-streaming and don't produce ITN output, so they're a poor fit for agent pipelines that need low-latency partials and LLM-friendly text.
## Additional context

- **Limitations:** supports Hindi and English separately, but **not in-utterance code-switch** (Hindi-English mixed mid-sentence may produce transliteration artifacts); 16 kHz mono only; weaker on codec-degraded telephony audio (see model card).
- **Training-data licensing:** each source corpus retains its own license; this contribution is weights + tokenizer only, no audio redistribution.
- **Reproduction:** happy to add a finetune recipe under `examples/asr/` and a short load-and-transcribe tutorial showing a "Hindi voice-agent slice" (live transcription → LLM → TTS) if maintainers want them bundled with the registration.
- **Maintainer:** SkunkWorks Labs (contact: harshris2314@gmail.com).
- **Motivation / acknowledgement:** this PR exists because @KunalDhawan, in the HF discussion on `nvidia/nemotron-speech-streaming-en-0.6b` ([discussion #2](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b/discussions/2#6964a6ebeacb0d73bdce046f)), explicitly invited the community to feel free and raise a PR at NeMo to contribute finetuned variants back. Huge thanks to Kunal for the nudge — without that comment this would probably have stayed an internal checkpoint.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request : Add Varuna STT (Hindi 0.6B streaming RNN-T) to the NeMo pretrained model registry #15664

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Model summary

Evaluation

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

subset	n	Varuna	ElevenLabs Scribe v1	Deepgram Nova-2	Sarvam Saarika v2.5
indictts	98	9.75	13.20	15.41	14.71
fleurs (test)	417	17.29	11.93	21.22	15.74
kathbath	1,929	16.82	13.32	20.55	16.62
kathbath_noisy	1,929	19.06	13.16	21.98	17.75
commonvoice	1,727	24.16	17.02	28.34	19.32
mucs	3,897	24.60	10.97	20.54	12.72

subset	Varuna	ElevenLabs	Deepgram	Sarvam
indictts	2.75	4.16	8.53	6.51
fleurs (test)	7.20	5.68	16.74	7.08
kathbath	6.36	6.50	13.53	7.42
kathbath_noisy	8.00	5.87	14.75	7.82
commonvoice	10.72	8.96	20.25	9.87
mucs	10.75	3.94	9.94	4.79

Feature Request : Add Varuna STT (Hindi 0.6B streaming RNN-T) to the NeMo pretrained model registry #15664

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Model summary

Evaluation

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions