[Bug] Three stability failure modes on RTX 5090 (consumer Blackwell, SM120) — illegal instruction, CUDA-graph-replay crash, and silent engine hang
[Bug] Three stability failure modes on RTX 5090 (consumer Blackwell, SM120) — illegal instruction, CUDA-graph-replay crash, and silent engine hang
Summary
Running nvidia/Qwen3.6-35B-A3B-NVFP4 (ModelOpt MIXED_PRECISION) on a single RTX 5090 (consumer Blackwell, SM120 / compute capability 12.0) with vLLM, the model loads and serves correctly, but exhibits three distinct stability failures under real/concurrent load. Two have working workarounds; the third (a silent engine hang) is only mitigated, not fixed. All three appear specific to the SM120 NVFP4-MoE + FlashInfer + CUDA-graph stack. Filing here so the kernels/recipe can be hardened for consumer Blackwell.
Environment
- GPU: NVIDIA GeForce RTX 5090, 32 GB, SM120 (sm_120a/f), single GPU (TP=1)
- Driver: 580.159.03 (Open Kernel), CUDA 13
- vLLM: nightly
0.22.1rc1.dev26+g4721bb3aa(Dockervllm/vllm-openai:nightly) - PyTorch: 2.11.0+cu130
- Model:
nvidia/Qwen3.6-35B-A3B-NVFP4,--quantization modelopt_mixed - MoE backend selected: MARLIN (FlashInfer NvFp4 MoE backends report "configuration not supported" for this modelopt scheme)
- Attention: TRITON_ATTN; FP8 attention via FlashInferFP8ScaledMMLinearKernel
- KV:
--kv-cache-dtype fp8 --no-calculate-kv-scales
Failure mode 1 — cudaErrorIllegalInstruction (FIXED with a workaround)
Under long-context requests (~13k prompt tokens) the engine dies with:
torch.AcceleratorError: CUDA error: an illegal instruction was encountered (cudaErrorIllegalInstruction)
Root cause: FlashInfer auto-detects SM major ≥ 9 and compiles compute_120a. The 120a arch does not support the TMA-WS grouped-GEMM tactics used by the NVFP4 path, so the kernel hits an illegal instruction at runtime on SM120.
Workaround that fixes it: export FLASHINFER_CUDA_ARCH_LIST=12.0f (force the full-feature 120f target, available with CUDA 13). 0 occurrences after applying.
Ask: FlashInfer/vLLM should auto-select 120f (not 120a) on consumer Blackwell, so users don't need this env var.
Failure mode 2 — EngineDeadError at CUDA-graph replay (mitigated)
Under concurrent structured-output extraction load (xgrammar, ~24 concurrent, mixed prompt lengths), the engine dies. Stack trace root:
File "<unknown>", in cudaGraphLaunch
File "<unknown>", in at::cuda::CUDAGraph::replay()
→ vllm.v1.engine.exceptions.EngineDeadError
This is not the illegal instruction (mode 1 was already fixed); it surfaces at CUDA graph replay. Reproduced ~13× over ~2h of concurrent load.
Mitigation: --max-cudagraph-capture-size 64 (the full 1–192 capture range was unstable). Survives the same load that previously crashed.
Ask: investigate why captured graphs (sizes > 64) fail at replay on SM120 NVFP4-MoE.
Failure mode 3 — Silent engine hang / deadlock (only mitigated — the real blocker)
Under sustained load, the engine silently deadlocks:
/healthkeeps returning 200 (misleading — the endpoint doesn't probe generation)- All in-flight requests stay in
num_requests_running(e.g. 18) and never complete vllm:generation_tokens_totalis frozen- GPU shows 100% utilization but only ~120 W (a dead spin-loop; a healthy load draws ~350 W) — i.e. a stuck/spinning kernel, not a crash
- No error is logged — the process does not die, it just stops producing tokens
- Recovery requires a manual
systemctl restart/ container restart
Observed at least twice/day under light beta load. This is the same SM120 deadlock class reported elsewhere (vLLM forum: "Deadlock on RTX 5090 Blackwell", Blackwell PCIe worker-init hangs). Community reports indicate --enforce-eager (disabling CUDA graphs entirely) avoids it — strongly implying the CUDA-graph machinery is the culprit — but that costs ~40× single-stream throughput on this hybrid GDN+MoE architecture (≈7 tok/s vs ≈280 tok/s with graphs), which is not viable.
Currently testing: --cudagraph-mode PIECEWISE (drop the full end-to-end decode graph, keep piecewise) as a speed-preserving middle ground.
Ask (the important one): please harden the CUDA-graph replay path for SM120 NVFP4-MoE so the model is stable with graphs, OR document a stable graph configuration. --enforce-eager should not be the only stable option, because the hybrid GDN+MoE arch is pathologically slow without graphs.
Minimal repro
- Serve
nvidia/Qwen3.6-35B-A3B-NVFP4on an RTX 5090 with vLLM nightly,--quantization modelopt_mixed, CUDA graphs ON (default),--kv-cache-dtype fp8. - (If not set, mode 1 fires first.) Set
FLASHINFER_CUDA_ARCH_LIST=12.0f. - Drive sustained concurrent load (e.g. ~24 concurrent structured-output requests, mixed lengths, several minutes).
- Observe: EngineDeadError at
cudaGraphLaunch(mode 2) and/or a silent hang (mode 3,/health=200 + frozengeneration_tokens_total+ GPU 100%/~120 W).
What would help
- Native SM120 NVFP4-MoE kernels (so MARLIN fallback isn't required), or a supported FlashInfer/CUTLASS path for the modelopt MIXED_PRECISION scheme on
120f. - A CUDA-graph replay path validated on SM120 under concurrency.
- An official "stable on RTX 5090" serving recipe in the model card.
Happy to provide full logs, the exact compose, and repro scripts.