[Bug] Three stability failure modes on RTX 5090 (consumer Blackwell, SM120) — illegal instruction, CUDA-graph-replay crash, and silent engine hang

by YoRandom - opened 5 days ago

[Bug] Three stability failure modes on RTX 5090 (consumer Blackwell, SM120) — illegal instruction, CUDA-graph-replay crash, and silent engine hang

Summary

Running nvidia/Qwen3.6-35B-A3B-NVFP4 (ModelOpt MIXED_PRECISION) on a single RTX 5090 (consumer Blackwell, SM120 / compute capability 12.0) with vLLM, the model loads and serves correctly, but exhibits three distinct stability failures under real/concurrent load. Two have working workarounds; the third (a silent engine hang) is only mitigated, not fixed. All three appear specific to the SM120 NVFP4-MoE + FlashInfer + CUDA-graph stack. Filing here so the kernels/recipe can be hardened for consumer Blackwell.

Environment

GPU: NVIDIA GeForce RTX 5090, 32 GB, SM120 (sm_120a/f), single GPU (TP=1)
Driver: 580.159.03 (Open Kernel), CUDA 13
vLLM: nightly 0.22.1rc1.dev26+g4721bb3aa (Docker vllm/vllm-openai:nightly)
PyTorch: 2.11.0+cu130
Model: nvidia/Qwen3.6-35B-A3B-NVFP4, --quantization modelopt_mixed
MoE backend selected: MARLIN (FlashInfer NvFp4 MoE backends report "configuration not supported" for this modelopt scheme)
Attention: TRITON_ATTN; FP8 attention via FlashInferFP8ScaledMMLinearKernel
KV: --kv-cache-dtype fp8 --no-calculate-kv-scales

Failure mode 1 — `cudaErrorIllegalInstruction` (FIXED with a workaround)

Under long-context requests (~13k prompt tokens) the engine dies with:

torch.AcceleratorError: CUDA error: an illegal instruction was encountered (cudaErrorIllegalInstruction)

Root cause: FlashInfer auto-detects SM major ≥ 9 and compiles compute_120a. The 120a arch does not support the TMA-WS grouped-GEMM tactics used by the NVFP4 path, so the kernel hits an illegal instruction at runtime on SM120.
Workaround that fixes it: export FLASHINFER_CUDA_ARCH_LIST=12.0f (force the full-feature 120f target, available with CUDA 13). 0 occurrences after applying.
Ask: FlashInfer/vLLM should auto-select 120f (not 120a) on consumer Blackwell, so users don't need this env var.

Failure mode 2 — `EngineDeadError` at CUDA-graph replay (mitigated)

Under concurrent structured-output extraction load (xgrammar, ~24 concurrent, mixed prompt lengths), the engine dies. Stack trace root:

File "<unknown>", in cudaGraphLaunch
File "<unknown>", in at::cuda::CUDAGraph::replay()
→ vllm.v1.engine.exceptions.EngineDeadError

This is not the illegal instruction (mode 1 was already fixed); it surfaces at CUDA graph replay. Reproduced ~13× over ~2h of concurrent load.
Mitigation: --max-cudagraph-capture-size 64 (the full 1–192 capture range was unstable). Survives the same load that previously crashed.
Ask: investigate why captured graphs (sizes > 64) fail at replay on SM120 NVFP4-MoE.

Failure mode 3 — Silent engine hang / deadlock (only mitigated — the real blocker)

Under sustained load, the engine silently deadlocks:

/health keeps returning 200 (misleading — the endpoint doesn't probe generation)
All in-flight requests stay in num_requests_running (e.g. 18) and never complete
vllm:generation_tokens_total is frozen
GPU shows 100% utilization but only ~120 W (a dead spin-loop; a healthy load draws ~350 W) — i.e. a stuck/spinning kernel, not a crash
No error is logged — the process does not die, it just stops producing tokens
Recovery requires a manual systemctl restart / container restart

Observed at least twice/day under light beta load. This is the same SM120 deadlock class reported elsewhere (vLLM forum: "Deadlock on RTX 5090 Blackwell", Blackwell PCIe worker-init hangs). Community reports indicate --enforce-eager (disabling CUDA graphs entirely) avoids it — strongly implying the CUDA-graph machinery is the culprit — but that costs ~40× single-stream throughput on this hybrid GDN+MoE architecture (≈7 tok/s vs ≈280 tok/s with graphs), which is not viable.
Currently testing: --cudagraph-mode PIECEWISE (drop the full end-to-end decode graph, keep piecewise) as a speed-preserving middle ground.
Ask (the important one): please harden the CUDA-graph replay path for SM120 NVFP4-MoE so the model is stable with graphs, OR document a stable graph configuration. --enforce-eager should not be the only stable option, because the hybrid GDN+MoE arch is pathologically slow without graphs.

Minimal repro

Serve nvidia/Qwen3.6-35B-A3B-NVFP4 on an RTX 5090 with vLLM nightly, --quantization modelopt_mixed, CUDA graphs ON (default), --kv-cache-dtype fp8.
(If not set, mode 1 fires first.) Set FLASHINFER_CUDA_ARCH_LIST=12.0f.
Drive sustained concurrent load (e.g. ~24 concurrent structured-output requests, mixed lengths, several minutes).
Observe: EngineDeadError at cudaGraphLaunch (mode 2) and/or a silent hang (mode 3, /health=200 + frozen generation_tokens_total + GPU 100%/~120 W).

What would help

Native SM120 NVFP4-MoE kernels (so MARLIN fallback isn't required), or a supported FlashInfer/CUTLASS path for the modelopt MIXED_PRECISION scheme on 120f.
A CUDA-graph replay path validated on SM120 under concurrency.
An official "stable on RTX 5090" serving recipe in the model card.

Happy to provide full logs, the exact compose, and repro scripts.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment