Instructions to use Tonykip/nemotron-3.5-swahili-streaming-asr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Tonykip/nemotron-3.5-swahili-streaming-asr with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Tonykip/nemotron-3.5-swahili-streaming-asr") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Nemotron 3.5 ASR — Swahili (streaming)
A low-latency streaming Swahili ASR model: NVIDIA's cache-aware FastConformer-RNNT
(nvidia/nemotron-3.5-asr-streaming-0.6b, 0.6B) fine-tuned to add Swahili, a language the
base model does not officially support. Unlike Whisper (a 30-second-window batch model), this
model transcribes incrementally with sub-100 ms latency — for live captioning, voice agents,
and dictation.
This checkpoint is a weight-average ("soup") of the 5 best fine-tuning checkpoints, each selected on the held-out FLEURS Swahili test set (not the training-time dev metric).
Results — FLEURS sw_ke test (n=487, identical streaming pipeline, lowercase+punct-strip)
| attention context (lookahead) | WER | CER |
|---|---|---|
[56,13] (~480 ms) |
0.298 | 0.106 |
[56,0] (~0 ms, lowest latency) |
0.322 | 0.115 |
base model [56,13] (no Swahili FT) |
1.147 | 0.860 |
The base model is effectively unusable for Swahili (100% WER, degenerate 30% WER / ~11% CER** — within the range NVIDIA reports for its
officially supported languages. For reference, offline (non-streaming) Whisper-large-v3 (1.5B)
on the same test set is in a similar WER range but cannot stream.⁇ loops);
fine-tuning brings it to **
Intended use & limitations
Good for: streaming/live Swahili transcription, voice agents (with an LLM downstream), live captioning, dictation aids, gisting. Not for: unattended verbatim transcription (legal/medical) without review.
Limitations:
- Training data is formal, image-prompt-elicited standard ("Sanifu") Kenyan Swahili with noun-level English code-switching — not Sheng / spontaneous speech. Expect higher WER on Sheng-heavy or noisy spontaneous audio. FLEURS (read news) is a relatively favorable test.
- 0.6B streaming model: offline larger models are more accurate; the value here is latency.
- Pick
att_context_sizefor your latency/accuracy budget: more lookahead ([56,13]) is more accurate;[56,0]is lowest latency.
Usage (NeMo)
import nemo.collections.asr as nemo_asr
m = nemo_asr.models.ASRModel.restore_from("nemotron-3.5-swahili-streaming-0.6b.nemo")
# Streaming inference: use NeMo's
# examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py
# with target_lang=sw-KE, att_context_size=[56,13] (or [56,0]), decoder_type=rnnt.
Training
- Base:
nvidia/nemotron-3.5-asr-streaming-0.6b(FastConformer-RNNT, cache-aware streaming). - Data: ~382 h Kenyan Swahili from
badrex/swahili-speech-400hr(Afrivoice / DigitalUmuganda, CC-BY-4.0); speaker-disjoint dev split. A 30 h clean subset is atTonykip/kenyan-swahili-asr-clean. - Recipe: full fine-tune (all 637M params), AdamW, bf16, Noam schedule (warmup 2000), fused RNNT joint; pretrained tokenizer reused (vocab 13087). Best checkpoint = top-5 weight-average, selected on FLEURS.
ONNX / edge deployment
ONNX exports are in onnx/ for CPU / desktop / edge streaming via ONNX Runtime
(Rust ort crate; parakeet-rs implements the streaming chunk + RNNT greedy loop):
encoder-swahili.int8.onnx(+_data, ~650 MB) — int8 encoder; the macOS/edge build (4× smaller than fp32)encoder-swahili.onnx(+_data, ~2.4 GB) — fp32 masterdecoder_joint-swahili.onnx— RNNT predictor + joint (keep fp32)prompt_kernel.pt+prompt_info.json— the Swahili language prompt (sw-KE:48), applied host-side (the exported encoder is prompt-less)onnx/INTEGRATION.md— cache I/O + prompt + streaming-loop spec
Graphs are NeMo check_trace-verified (ONNX≈PyTorch). End-to-end streaming WER is validated in the
runtime. GGUF is not applicable (ggml/whisper.cpp is Whisper-only) — ONNX Runtime is the CPU/edge path.
License & attribution
Derivative of nvidia/nemotron-3.5-asr-streaming-0.6b under OpenMDW-1.1 — see the base model
for license terms. Training data: badrex/swahili-speech-400hr (CC-BY-4.0, Afrivoice/DigitalUmuganda).
Please attribute both NVIDIA and the data authors.
- Downloads last month
- 33
Model tree for Tonykip/nemotron-3.5-swahili-streaming-asr
Base model
nvidia/nemotron-3.5-asr-streaming-0.6b