Instructions to use Moonlight556/Kimi-Linear-48B-A3B-DFlash-240k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Moonlight556/Kimi-Linear-48B-A3B-DFlash-240k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Moonlight556/Kimi-Linear-48B-A3B-DFlash-240k", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Moonlight556/Kimi-Linear-48B-A3B-DFlash-240k", trust_remote_code=True) model = AutoModel.from_pretrained("Moonlight556/Kimi-Linear-48B-A3B-DFlash-240k", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
Kimi-Linear-48B-A3B-DFlash-240k
A DFlash speculative-decoding drafter for moonshotai/Kimi-Linear-48B-A3B-Instruct — the first public Kimi-Linear DFlash drafter.
Headline
| Metric | Value |
|---|---|
| Target | Kimi-Linear-48B-A3B-Instruct (48B total / 3B active MoE, pure linear-attention) |
| Drafter | 5-layer softmax DFlash, hidden 2304, target_layer_ids=[3,7,15,19,23], block 16, vocab 163840, mask_token_id 163839 |
| Training data | 240k target-matched math examples (Moonlight556/kimi-linear-48b-a3b-target-matched-math-240k) |
| Training compute | 1 epoch, 59,867 optimizer steps, 12h41m wall on 2× B200 (FSDP-2) |
| Math500 / N=64 offline acceptance length | 5.2106 (greedy, no thinking; block_size=16) |
Comparable benchmarks
Offline mean-acceptance-length on Math500 (N=64, shuffle seed 0, greedy, no thinking) across the la-draftery training family:
| Drafter | Target | Train data | Offline accept length |
|---|---|---|---|
| la-draftery Phase 1 | Qwen3.5-0.8B | 25k target-matched × 2ep | 3.99 |
| la-draftery Phase 1.1 | Qwen3.5-0.8B | 25k target-matched × 6ep | 5.02 |
| la-draftery Phase 1.2 | Qwen3.5-0.8B | 240k target-matched × 1ep | 5.71 |
| Phase 2 30k proof | Kimi-Linear-48B-A3B | 30k target-matched × 1ep | 2.36 |
| Phase 2 240k (this model) | Kimi-Linear-48B-A3B | 240k target-matched × 1ep | 5.21 |
Offline acceptance length is teacher-forced (draft argmax vs the target's own greedy continuation, parallel block-wise pass). Serving (online) acceptance via SGLang spec-v2 is a separate measurement and may differ — see la-draftery's docs/015 and docs/016 for the Phase 1.2 serving story.
Architecture & infrastructure
The Kimi-Linear target needs an architectural bridge that the Qwen3.5 path doesn't: its custom forward returns outputs.hidden_states=None, so DFlash can't grab per-layer hidden states the standard way. We capture them via forward hooks instead — see la-draftery/specforge/modeling/target/dflash_target_model_kda.py and the full reproducibility recipe at docs/017.
The drafter architecture itself (5-layer softmax DFlash, causal_head=False) is unchanged from Phase 1.2; only the target-side capture is different.
Files
| File | Purpose |
|---|---|
config.json |
DFlash drafter config |
dflash.py |
Drafter modeling code (matches Phase 1.2) |
model.safetensors |
Drafter weights (902 MB) |
Usage
See la-draftery/recipes/train_phase2_kimi_30k_proof.sh for the training launcher (the 240k run uses the same recipe with DATA_PATH pointing at the full 240k jsonl), and tools/bench/bench_dflash.py for offline benchmarking.
License
Apache 2.0.
- Downloads last month
- 21
Model tree for Moonlight556/Kimi-Linear-48B-A3B-DFlash-240k
Base model
moonshotai/Kimi-Linear-48B-A3B-Instruct