Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
Abstract
Distil-Whisper, a smaller and faster variant of the Whisper model, achieves nearly the same performance with fewer resources and is optimized for low-latency environments.
As the size of pre-trained speech recognition models increases, running these large models in low-latency or resource-constrained environments becomes challenging. In this work, we leverage pseudo-labelling to assemble a large-scale open-source dataset which we use to distill the Whisper model into a smaller variant, called Distil-Whisper. Using a simple word error rate (WER) heuristic, we select only the highest quality pseudo-labels for training. The distilled model is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data in a zero-shot transfer setting. Distil-Whisper maintains the robustness of the Whisper model to difficult acoustic conditions, while being less prone to hallucination errors on long-form audio. Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up while mathematically ensuring the same outputs as the original model. To facilitate further research in this ___domain, we make our training code, inference code and models publicly accessible.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data (2023)
- HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models (2023)
- CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders (2023)
- Massive End-to-end Models for Short Search Queries (2023)
- Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Distil-Whisper: Faster, Smaller, Yet Powerful Speech Recognition!
Links 🔗:
👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/
Get this paper in your agent:
hf papers read 2311.00430 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 66
distil-whisper/distil-large-v3
Datasets citing this paper 0
No dataset linking this paper
