Automatic Speech Recognition
NeMo
ONNX
Swahili
streaming
swahili
kenyan-swahili
fastconformer
rnnt

Nemotron 3.5 ASR — Swahili (streaming)

A low-latency streaming Swahili ASR model: NVIDIA's cache-aware FastConformer-RNNT (nvidia/nemotron-3.5-asr-streaming-0.6b, 0.6B) fine-tuned to add Swahili, a language the base model does not officially support. Unlike Whisper (a 30-second-window batch model), this model transcribes incrementally with sub-100 ms latency — for live captioning, voice agents, and dictation.

This checkpoint is a weight-average ("soup") of the 5 best fine-tuning checkpoints, each selected on the held-out FLEURS Swahili test set (not the training-time dev metric).

Results — FLEURS sw_ke test (n=487, identical streaming pipeline, lowercase+punct-strip)

attention context (lookahead) WER CER
[56,13] (~480 ms) 0.298 0.106
[56,0] (~0 ms, lowest latency) 0.322 0.115
base model [56,13] (no Swahili FT) 1.147 0.860

The base model is effectively unusable for Swahili (100% WER, degenerate loops); fine-tuning brings it to **30% WER / ~11% CER** — within the range NVIDIA reports for its officially supported languages. For reference, offline (non-streaming) Whisper-large-v3 (1.5B) on the same test set is in a similar WER range but cannot stream.

Intended use & limitations

Good for: streaming/live Swahili transcription, voice agents (with an LLM downstream), live captioning, dictation aids, gisting. Not for: unattended verbatim transcription (legal/medical) without review.

Limitations:

  • Training data is formal, image-prompt-elicited standard ("Sanifu") Kenyan Swahili with noun-level English code-switching — not Sheng / spontaneous speech. Expect higher WER on Sheng-heavy or noisy spontaneous audio. FLEURS (read news) is a relatively favorable test.
  • 0.6B streaming model: offline larger models are more accurate; the value here is latency.
  • Pick att_context_size for your latency/accuracy budget: more lookahead ([56,13]) is more accurate; [56,0] is lowest latency.

Usage (NeMo)

import nemo.collections.asr as nemo_asr
m = nemo_asr.models.ASRModel.restore_from("nemotron-3.5-swahili-streaming-0.6b.nemo")
# Streaming inference: use NeMo's
# examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py
# with target_lang=sw-KE, att_context_size=[56,13] (or [56,0]), decoder_type=rnnt.

Training

  • Base: nvidia/nemotron-3.5-asr-streaming-0.6b (FastConformer-RNNT, cache-aware streaming).
  • Data: ~382 h Kenyan Swahili from badrex/swahili-speech-400hr (Afrivoice / DigitalUmuganda, CC-BY-4.0); speaker-disjoint dev split. A 30 h clean subset is at Tonykip/kenyan-swahili-asr-clean.
  • Recipe: full fine-tune (all 637M params), AdamW, bf16, Noam schedule (warmup 2000), fused RNNT joint; pretrained tokenizer reused (vocab 13087). Best checkpoint = top-5 weight-average, selected on FLEURS.

ONNX / edge deployment

ONNX exports are in onnx/ for CPU / desktop / edge streaming via ONNX Runtime (Rust ort crate; parakeet-rs implements the streaming chunk + RNNT greedy loop):

  • encoder-swahili.int8.onnx (+_data, ~650 MB) — int8 encoder; the macOS/edge build (4× smaller than fp32)
  • encoder-swahili.onnx (+_data, ~2.4 GB) — fp32 master
  • decoder_joint-swahili.onnx — RNNT predictor + joint (keep fp32)
  • prompt_kernel.pt + prompt_info.json — the Swahili language prompt (sw-KE:48), applied host-side (the exported encoder is prompt-less)
  • onnx/INTEGRATION.md — cache I/O + prompt + streaming-loop spec

Graphs are NeMo check_trace-verified (ONNX≈PyTorch). End-to-end streaming WER is validated in the runtime. GGUF is not applicable (ggml/whisper.cpp is Whisper-only) — ONNX Runtime is the CPU/edge path.

License & attribution

Derivative of nvidia/nemotron-3.5-asr-streaming-0.6b under OpenMDW-1.1 — see the base model for license terms. Training data: badrex/swahili-speech-400hr (CC-BY-4.0, Afrivoice/DigitalUmuganda). Please attribute both NVIDIA and the data authors.

Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tonykip/nemotron-3.5-swahili-streaming-asr

Quantized
(9)
this model

Datasets used to train Tonykip/nemotron-3.5-swahili-streaming-asr

Space using Tonykip/nemotron-3.5-swahili-streaming-asr 1