Kimi-Linear-48B-A3B-DFlash-240k

A DFlash speculative-decoding drafter for moonshotai/Kimi-Linear-48B-A3B-Instruct — the first public Kimi-Linear DFlash drafter.

Headline

Metric Value
Target Kimi-Linear-48B-A3B-Instruct (48B total / 3B active MoE, pure linear-attention)
Drafter 5-layer softmax DFlash, hidden 2304, target_layer_ids=[3,7,15,19,23], block 16, vocab 163840, mask_token_id 163839
Training data 240k target-matched math examples (Moonlight556/kimi-linear-48b-a3b-target-matched-math-240k)
Training compute 1 epoch, 59,867 optimizer steps, 12h41m wall on 2× B200 (FSDP-2)
Math500 / N=64 offline acceptance length 5.2106 (greedy, no thinking; block_size=16)

Comparable benchmarks

Offline mean-acceptance-length on Math500 (N=64, shuffle seed 0, greedy, no thinking) across the la-draftery training family:

Drafter Target Train data Offline accept length
la-draftery Phase 1 Qwen3.5-0.8B 25k target-matched × 2ep 3.99
la-draftery Phase 1.1 Qwen3.5-0.8B 25k target-matched × 6ep 5.02
la-draftery Phase 1.2 Qwen3.5-0.8B 240k target-matched × 1ep 5.71
Phase 2 30k proof Kimi-Linear-48B-A3B 30k target-matched × 1ep 2.36
Phase 2 240k (this model) Kimi-Linear-48B-A3B 240k target-matched × 1ep 5.21

Offline acceptance length is teacher-forced (draft argmax vs the target's own greedy continuation, parallel block-wise pass). Serving (online) acceptance via SGLang spec-v2 is a separate measurement and may differ — see la-draftery's docs/015 and docs/016 for the Phase 1.2 serving story.

Architecture & infrastructure

The Kimi-Linear target needs an architectural bridge that the Qwen3.5 path doesn't: its custom forward returns outputs.hidden_states=None, so DFlash can't grab per-layer hidden states the standard way. We capture them via forward hooks instead — see la-draftery/specforge/modeling/target/dflash_target_model_kda.py and the full reproducibility recipe at docs/017.

The drafter architecture itself (5-layer softmax DFlash, causal_head=False) is unchanged from Phase 1.2; only the target-side capture is different.

Files

File Purpose
config.json DFlash drafter config
dflash.py Drafter modeling code (matches Phase 1.2)
model.safetensors Drafter weights (902 MB)

Usage

See la-draftery/recipes/train_phase2_kimi_30k_proof.sh for the training launcher (the 240k run uses the same recipe with DATA_PATH pointing at the full 240k jsonl), and tools/bench/bench_dflash.py for offline benchmarking.

License

Apache 2.0.

Downloads last month
21
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Moonlight556/Kimi-Linear-48B-A3B-DFlash-240k

Finetuned
(7)
this model

Dataset used to train Moonlight556/Kimi-Linear-48B-A3B-DFlash-240k