Training Domain Draft Models for Speculative Decoding: Best Practices and Insights
Abstract
Knowledge distillation techniques improve speculative decoding efficiency for ___domain-specific large language models by training ___domain-specific draft models, with offline and white-box distillation showing superior performance.
Speculative decoding is an effective method for accelerating inference of large language models (LLMs) by employing a small draft model to predict the output of a target model. However, when adapting speculative decoding to ___domain-specific target models, the acceptance rate of the generic draft model drops significantly due to ___domain shift. In this work, we systematically investigate knowledge distillation techniques for training ___domain draft models to improve their speculation accuracy. We compare white-box and black-box distillation approaches and explore their effectiveness in various data accessibility scenarios, including historical user queries, curated ___domain data, and synthetically generated alignment data. Our experiments across Function Calling, Biology, and Chinese domains show that offline distillation consistently outperforms online distillation by 11% to 25%, white-box distillation surpasses black-box distillation by 2% to 10%, and data scaling trends hold across domains. Additionally, we find that synthetic data can effectively align draft models and achieve 80% to 93% of the performance of training on historical user queries. These findings provide practical guidelines for training ___domain-specific draft models to improve speculative decoding efficiency.
Get this paper in your agent:
hf papers read 2503.07807 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper