[Bug] Three stability failure modes on RTX 5090 (consumer Blackwell, SM120) — illegal instruction, CUDA-graph-replay crash, and silent engine hang

#9
by YoRandom - opened

[Bug] Three stability failure modes on RTX 5090 (consumer Blackwell, SM120) — illegal instruction, CUDA-graph-replay crash, and silent engine hang

Summary

Running nvidia/Qwen3.6-35B-A3B-NVFP4 (ModelOpt MIXED_PRECISION) on a single RTX 5090 (consumer Blackwell, SM120 / compute capability 12.0) with vLLM, the model loads and serves correctly, but exhibits three distinct stability failures under real/concurrent load. Two have working workarounds; the third (a silent engine hang) is only mitigated, not fixed. All three appear specific to the SM120 NVFP4-MoE + FlashInfer + CUDA-graph stack. Filing here so the kernels/recipe can be hardened for consumer Blackwell.

Environment

  • GPU: NVIDIA GeForce RTX 5090, 32 GB, SM120 (sm_120a/f), single GPU (TP=1)
  • Driver: 580.159.03 (Open Kernel), CUDA 13
  • vLLM: nightly 0.22.1rc1.dev26+g4721bb3aa (Docker vllm/vllm-openai:nightly)
  • PyTorch: 2.11.0+cu130
  • Model: nvidia/Qwen3.6-35B-A3B-NVFP4, --quantization modelopt_mixed
  • MoE backend selected: MARLIN (FlashInfer NvFp4 MoE backends report "configuration not supported" for this modelopt scheme)
  • Attention: TRITON_ATTN; FP8 attention via FlashInferFP8ScaledMMLinearKernel
  • KV: --kv-cache-dtype fp8 --no-calculate-kv-scales

Failure mode 1 — cudaErrorIllegalInstruction (FIXED with a workaround)

Under long-context requests (~13k prompt tokens) the engine dies with:

torch.AcceleratorError: CUDA error: an illegal instruction was encountered (cudaErrorIllegalInstruction)

Root cause: FlashInfer auto-detects SM major ≥ 9 and compiles compute_120a. The 120a arch does not support the TMA-WS grouped-GEMM tactics used by the NVFP4 path, so the kernel hits an illegal instruction at runtime on SM120.
Workaround that fixes it: export FLASHINFER_CUDA_ARCH_LIST=12.0f (force the full-feature 120f target, available with CUDA 13). 0 occurrences after applying.
Ask: FlashInfer/vLLM should auto-select 120f (not 120a) on consumer Blackwell, so users don't need this env var.

Failure mode 2 — EngineDeadError at CUDA-graph replay (mitigated)

Under concurrent structured-output extraction load (xgrammar, ~24 concurrent, mixed prompt lengths), the engine dies. Stack trace root:

File "<unknown>", in cudaGraphLaunch
File "<unknown>", in at::cuda::CUDAGraph::replay()
→ vllm.v1.engine.exceptions.EngineDeadError

This is not the illegal instruction (mode 1 was already fixed); it surfaces at CUDA graph replay. Reproduced ~13× over ~2h of concurrent load.
Mitigation: --max-cudagraph-capture-size 64 (the full 1–192 capture range was unstable). Survives the same load that previously crashed.
Ask: investigate why captured graphs (sizes > 64) fail at replay on SM120 NVFP4-MoE.

Failure mode 3 — Silent engine hang / deadlock (only mitigated — the real blocker)

Under sustained load, the engine silently deadlocks:

  • /health keeps returning 200 (misleading — the endpoint doesn't probe generation)
  • All in-flight requests stay in num_requests_running (e.g. 18) and never complete
  • vllm:generation_tokens_total is frozen
  • GPU shows 100% utilization but only ~120 W (a dead spin-loop; a healthy load draws ~350 W) — i.e. a stuck/spinning kernel, not a crash
  • No error is logged — the process does not die, it just stops producing tokens
  • Recovery requires a manual systemctl restart / container restart

Observed at least twice/day under light beta load. This is the same SM120 deadlock class reported elsewhere (vLLM forum: "Deadlock on RTX 5090 Blackwell", Blackwell PCIe worker-init hangs). Community reports indicate --enforce-eager (disabling CUDA graphs entirely) avoids it — strongly implying the CUDA-graph machinery is the culprit — but that costs ~40× single-stream throughput on this hybrid GDN+MoE architecture (≈7 tok/s vs ≈280 tok/s with graphs), which is not viable.
Currently testing: --cudagraph-mode PIECEWISE (drop the full end-to-end decode graph, keep piecewise) as a speed-preserving middle ground.
Ask (the important one): please harden the CUDA-graph replay path for SM120 NVFP4-MoE so the model is stable with graphs, OR document a stable graph configuration. --enforce-eager should not be the only stable option, because the hybrid GDN+MoE arch is pathologically slow without graphs.

Minimal repro

  1. Serve nvidia/Qwen3.6-35B-A3B-NVFP4 on an RTX 5090 with vLLM nightly, --quantization modelopt_mixed, CUDA graphs ON (default), --kv-cache-dtype fp8.
  2. (If not set, mode 1 fires first.) Set FLASHINFER_CUDA_ARCH_LIST=12.0f.
  3. Drive sustained concurrent load (e.g. ~24 concurrent structured-output requests, mixed lengths, several minutes).
  4. Observe: EngineDeadError at cudaGraphLaunch (mode 2) and/or a silent hang (mode 3, /health=200 + frozen generation_tokens_total + GPU 100%/~120 W).

What would help

  • Native SM120 NVFP4-MoE kernels (so MARLIN fallback isn't required), or a supported FlashInfer/CUTLASS path for the modelopt MIXED_PRECISION scheme on 120f.
  • A CUDA-graph replay path validated on SM120 under concurrency.
  • An official "stable on RTX 5090" serving recipe in the model card.

Happy to provide full logs, the exact compose, and repro scripts.

Sign up or log in to comment