Nemotron-3-Ultra-550B-A55B NVFP4 on 4x ASUS GX10 — bench
Date: 2026-06-13
Hardware: 4x ASUS Ascent GX10 (NVIDIA GB10 Blackwell, 128 GiB unified memory each, ConnectX-7 200 GbE fabric), TP=4 + expert-parallel + Ray
Container image: vllm-node:latest (base image, not the vllm-node-mimo variant) — see Reproducibility — Docker image below for the exact build steps, pinned versions, and the mods that have to be patched in
vLLM: 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 (local wheel, image id 75429f413d11, built 2026-06-05)
Model: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 (HF), local path /root/.cache/huggingface/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4, ~329 GiB per node
Architecture: LatentMoE + Mamba-2 + attention hybrid, 512 routed experts top-22
MTP: nemotron_h_mtp, num_speculative_tokens=3 (multi-head MTP — unlike Qwen3.5's single-head MTP, MTP-3 is comfortable here)
KV cache: fp8; max_model_len=262144; max_num_seqs=6; max_num_batched_tokens=8192
GPU memory budget: 108 GiB/node (GX10 unified memory firm cap is 110 GiB; mod enforces 0.5 GiB margin against the tightest node's 111.27 GiB)
API: http://<head-node>:8000
Bench client: ../resources/bench.py, streaming via requests + iter_lines() with stream_options={"include_usage": True}
Field notes
Nemotron-3-Ultra was the "somebody said this should not fit here, so I had to try" run. I had read that this was not a model you should expect to run on a 4x GX10 cluster, and the satisfying part was getting it to boot anyway: TP=4, expert parallel, Ray, fp8 KV, the Nemotron parser patches, and a tight raw-GiB memory budget all lining up well enough to serve a 550B-class NVFP4 model from the Spark nodes.
In day-to-day use, though, it does not obviously feel better than the
Qwen3.5-397B baseline. Tool calling was fine, and the 200K retrieval test
was clean, but the model has different tradeoffs rather than a clear
quality win. The LatentMoE + Mamba-2 hybrid spends memory and compute on
state that helps the architecture scale and keep long-context behavior
stable, while this Spark setup is decode-throughput constrained and caps
max_num_seqs at 6. Qwen397's working MTP-2 head gives it a large
interactive-speed advantage, so Nemotron can be technically impressive
without feeling like the better default assistant on this hardware.
For full cluster + image setup details shared with the Qwen3.5-397B post: see
../resources/INFRA.md.Companion files: launch script, recipe yaml, mod patches, bench harness, and wheel provenance are checked into
../resources/so each script + patch is reproducible as a file you can copy.
Provenance
Prior art — where the launch flags came from
Nemotron-3 Super (released 2026-03-11) and Ultra (released 2026-06-04) are the same architecture family (LatentMoE Mamba-2 + MoE + attention hybrid with MTP). We didn't write the launch flags from the Ultra model card alone — we started from the Super deployment docs and adjusted for the Ultra's larger expert count and tighter per-rank memory budget. Authoritative sources used:
- vLLM blog — Nemotron-3-Super launch post (2026-03-11): https://vllm.ai/blog/2026-03-11-nemotron-3-super — explains LatentMoE's compressed-dim routing (d=4096 → ℓ=1024, ~4× less all-to-all traffic), the
--mamba-backend flashinferrequirement, thenemotron_h_mtpspeculator, and--reasoning-parser nemotron_v3/--tool-call-parser qwen3_coderwiring. - NVIDIA Spark Deployment Guide — Nemotron-3-Super: https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemotron-3-Super/SparkDeploymentGuide/README.html — the canonical GB10 vLLM launch flags. Closest official doc to our 4x GX10 cluster; our recipe is essentially the Super single-node command extended to TP=4 + EP across Ray.
- NVIDIA Developer Forums — Nemotron-3-Super NVFP4 on Acer GB10: https://forums.developer.nvidia.com/t/nemotron-3-super-120b-a12b-nvfp4-with-vllm-v0-22-0-on-1x-acer-gb10-with-495-71-05-driver-container-is-ubuntu-24-04-cuda-13-2-1-gxx11/371853 — community thread on single-node GB10 troubleshooting (same SM121 as ours); useful sanity check on the vLLM 0.22 + cu132 driver/container combination.
- NVIDIA vLLM release notes: https://docs.nvidia.com/deeplearning/frameworks/vllm-release-notes/index.html — for verifying which release added each Nemotron-specific parser / backend (
--reasoning-parser nemotron_v3,--mamba-backend flashinfer,--speculative-config nemotron_h_mtp).
Our spark-vllm-docker/mods/nemotron-ultra mod is a fork of the
mods/nemotron-super mod with the arch registration switched to the
Ultra config — Super was already running on the cluster when Ultra
shipped, so the recipe was derived rather than rebuilt from scratch.
Sampling for both tests: temperature=0.0. Prompts are tokenized via vLLM's /tokenize endpoint so the input token count is exact.
Benchmarks
1. Concurrency sweep — 10k in / 1024 out
Each request: ~9 818 input tokens (built from varied English filler) + an instruction to continue the narrative; max_tokens=1024. All concurrent requests fire simultaneously (synchronized via a threading.Barrier). Per-request and aggregate throughput reported.
Note on N choice: this run sweeps N ∈ {1, 2, 4, 6, 8}. The Qwen blog sweeps up to N=16; Nemotron's
max_num_seqs=6makes N>6 informative as a queue-saturation point rather than additional throughput.
Aggregate throughput (single-run, fp8 KV, MTP-3 active):
| N | Wall (s) | Agg prefill (t/s) | Agg decode (t/s) | Median TTFT (s) | Median per-req decode (t/s) |
|---|---|---|---|---|---|
| 1 | 67.3 | 701 | 19.3 | 14.28 | 19.3 |
| 2 | 72.9 | 3 889 | 28.8 | 5.14 | 15.4 |
| 4 | 82.4 | 8 568 | 52.6 | 4.67 | 13.5 |
| 6 | 96.2 | 9 477 | 68.3 | 6.33 | 11.7 |
| 8 | 169.1 | 810 ⚠ | 50.8 | 7.97 | 11.5 |
⚠ N=8 exceeds max_num_seqs=6; two of the eight requests had to wait for a free slot (their TTFT jumps to 92-99 s — see the N=8 table below). That inflates the wall window and collapses the apparent "aggregate prefill" rate. Aggregate decode also regresses vs N=6 because the queued requests stall the wall window while contributing no decode tokens until they're admitted.
Per-request totals (input always 10 006 tokens, output always 1 024 tokens):
N=1
| req | TTFT (s) | duration (s) | prefill (t/s) | decode (t/s) |
|---|---|---|---|---|
| 0 | 14.28 | 67.33 | 701 | 19.3 |
N=2
| req | TTFT (s) | duration (s) | prefill (t/s) | decode (t/s) |
|---|---|---|---|---|
| 0 | 5.14 | 71.75 | 1 945 | 15.4 |
| 1 | 1.98 | 72.92 | 5 056 | 14.4 |
N=4
| req | TTFT (s) | duration (s) | prefill (t/s) | decode (t/s) |
|---|---|---|---|---|
| 0 | 4.67 | 82.44 | 2 143 | 13.2 |
| 1 | 4.67 | 80.44 | 2 143 | 13.5 |
| 2 | 4.67 | 74.77 | 2 144 | 14.6 |
| 3 | 4.67 | 80.44 | 2 142 | 13.5 |
N=6
| req | TTFT (s) | duration (s) | prefill (t/s) | decode (t/s) |
|---|---|---|---|---|
| 0 | 6.33 | 96.21 | 1 580 | 11.4 |
| 1 | 6.33 | 93.80 | 1 580 | 11.7 |
| 2 | 6.33 | 94.90 | 1 581 | 11.6 |
| 3 | 6.33 | 87.11 | 1 580 | 12.7 |
| 4 | 6.33 | 96.20 | 1 581 | 11.4 |
| 5 | 6.33 | 94.09 | 1 580 | 11.7 |
N=8 (over-subscribed, max_num_seqs=6)
| req | TTFT (s) | duration (s) | prefill (t/s) | decode (t/s) |
|---|---|---|---|---|
| 0 | 7.97 | 104.76 | 1 255 | 10.6 |
| 1 | 92.46 ⚠ | 163.09 | 108 | 14.5 |
| 2 | 7.97 | 103.90 | 1 255 | 10.7 |
| 3 | 7.97 | 90.22 | 1 256 | 12.4 |
| 4 | 7.98 | 96.68 | 1 255 | 11.5 |
| 5 | 7.97 | 105.84 | 1 255 | 10.5 |
| 6 | 7.97 | 96.67 | 1 255 | 11.5 |
| 7 | 98.86 ⚠ | 169.14 | 101 | 14.6 |
Requests 1 and 7 waited ~90 s for a free slot — their decode throughput is normal (the high apparent rate reflects how prefill_tps = in_tokens / TTFT lumps queue wait into the prefill window). When the queueing is excluded, the engine is saturated cleanly between N=4 and N=6.
Notes on the shape
- Per-request decode falls cleanly with N (19.3 → 15.4 → 13.5 → 11.7 → 11.5 t/s) — well-behaved scaling once the engine is busy. The N=1 number is dominated by the per-token Mamba state update on a single sequence; the cluster doesn't get more efficient until multiple sequences amortize the Mamba+attention overhead.
- Aggregate decode peaks at N=6 (68.3 t/s, 3.5× N=1). With
max_num_seqs=6this is the engineered ceiling for short-context concurrent serving; N=8 only hurts because two streams queue. - Prefill scales ~13× from N=1 to N=6 (701 → 9 477 t/s). Each prefill of ~10K tokens fits in a single 8192-batched-token window, so multi-request concurrency benefits the prefill throughput linearly until the chunked-prefill scheduler has to interleave with active decodes.
- TTFT plateaus cleanly at N=2-6 (1.98-6.33 s) — the chunked prefill scheduler admits requests within a single 8K-token window when there's room.
2. Needle-in-a-haystack — 200k prefill
200 000-token filler with a single needle injected at ~50% depth:
IMPORTANT: The secret access code for Project Aurora is BANANA-MOOSE-7421. Remember this exactly.
Question appended at the end: "Based on the text above, what is the secret access code for Project Aurora? Reply with just the code itself, nothing else."
| Metric | Value |
|---|---|
| Input tokens | 199 476 |
| TTFT (≈ prefill) | 163.0 s |
| Prefill speed | 1 224 tokens/s |
| Output tokens | 94 (terse — small reasoning trace) |
| Decode speed | 20.6 t/s |
| Total wall | 167.5 s |
| Needle retrieved | ✅ yes — exact match |
Model's final answer (after the reasoning block):
BANANA-MOOSE-7421
Reasoning trace (full, 296 chars):
The user is asking for the secret access code for Project Aurora,
which was mentioned in the text. Let me find it.
Looking through the text, I see this line: "IMPORTANT: The secret
access code for Project Aurora is BANANA-MOOSE-7421. Remember this
exactly."
So the code is BANANA-MOOSE-7421.
Nemotron-3-Ultra's NIAH retrieval at 200K is materially leaner than Qwen3.5-397B's: 94 output tokens (~7× shorter than Qwen's 669) with the needle copied verbatim in a 3-paragraph trace. Prefill is 1 224 t/s (vs Qwen 1 584 t/s) — the Mamba+attention hybrid is a touch slower per token at this depth, but the model also generates the answer with much less ceremony.
Launch config (attempt 10, validated boot 2026-06-06)
Recipe: ~/spark-vllm-docker/recipes/4x-spark-cluster/nemotron-3-ultra-nvfp4.yaml on head GX10 (copy at ../resources/recipes/nemotron-3-ultra-nvfp4.yaml).
Relaunch wrapper: ~/spark-vllm-docker/relaunch-nemotron3-ultra-nvfp4-tp4.sh (copy at ../resources/scripts/relaunch-nemotron3-ultra-nvfp4-tp4.sh).
Effective vllm serve command:
vllm serve /root/.cache/huggingface/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
--served-model-name nvidia/nemotron-3-ultra \
--host 0.0.0.0 --port 8000 \
--trust-remote-code \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--distributed-executor-backend ray \
--kv-cache-dtype fp8 \
--gpu-memory-utilization-gb 108 \
--max-model-len 262144 \
--max-num-seqs 6 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill \
--enable-prefix-caching \
--enable-auto-tool-choice \
--reasoning-parser nemotron_v3 \
--tool-call-parser qwen3_coder \
--mamba-ssm-cache-dtype float16 \
--mamba-backend flashinfer \
--enable-mamba-cache-stochastic-rounding \
--mamba-cache-philox-rounds 5 \
--moe-backend flashinfer_cutlass \
--speculative-config '{"method":"nemotron_h_mtp","num_speculative_tokens":3}' \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}' \
--compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}' \
--distributed-timeout-seconds 3600
Env on the container:
VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm
VLLM_ALLOW_LONG_MAX_MODEL_LEN=0
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
expandable_segments:True lets the CUDA caching allocator return segments after the FlashInfer fp8 autotuner spike so KV-cache alloc can claim the freed memory instead of hitting a fragmented high-water-mark.
Critical config (flag-by-flag)
| Flag | Value | Why |
|---|---|---|
--enable-expert-parallel |
required | 512 routed experts, top-22 — EP across TP ranks dramatically reduces per-rank MoE memory |
--kv-cache-dtype fp8 |
required | KV pool is the bottleneck at long ctx; fp8 halves it. No coherence damage observed (unlike Step-3.7-FP8). |
--mamba-backend flashinfer |
required | Triton Mamba kernel is slower and uses more workspace |
--mamba-ssm-cache-dtype float16 |
required | fp32 SSM state doubles cache footprint |
--enable-mamba-cache-stochastic-rounding |
required | Without it fp16 SSM state quantization degrades long-context output |
--moe-backend flashinfer_cutlass |
required | Throughput-tuned MoE kernels; required ≥ vLLM 0.22 (env var VLLM_FLASHINFER_MOE_BACKEND was removed in v0.23) |
--reasoning-parser nemotron_v3 |
required | Routes thinking blocks into message.reasoning |
--tool-call-parser qwen3_coder |
required | Nemotron-3 emits Qwen3-style tool calls |
--compilation-config '{"pass_config":{"fuse_allreduce_rms":false}}' |
required | Pass interacts badly with Mamba+attention hybrid; OOMs on graph capture |
--speculative-config '{"method":"nemotron_h_mtp","num_speculative_tokens":3}' |
tuned | nemotron_h MTP is multi-head — MTP-3 is the documented sweet spot. MTP-5 was only validated at ctx=65536. |
--max-num-batched-tokens 8192 |
tuned | 16384 OOMs the NVRM driver during 200K prefill (attempt 8). 8192 is the validated ceiling — halving cut the FlashInfer fp8 autotuner workspace high-water-mark which the caching allocator otherwise holds for engine lifetime. |
--max-num-seqs 6 |
tuned | KV holds 6 × 262K slots at gpu_mem=108. Bumped up from attempt 9's 4 after the budget bump. |
--gpu-memory-utilization-gb 108 |
tuned | Firm cap is 110 GiB (mod enforces 0.5 GiB margin against the tightest node's 111.27 GiB). 105 left ~3 GiB on the table; 110 OOMs on first heavy prefill. 108 is the sweet spot. |
Pitfalls (learned across 10 attempts)
-
"1M context" is marketing. The model card may say otherwise, but
config.jsonenforcesmax_position_embeddings=262144. The RoPE base is the standard untuned 10 000 — extrapolating withVLLM_ALLOW_LONG_MAX_MODEL_LEN=1produces NaN-prone attention. Don't raise it. -
The 128 → 119 → 111 GiB Spark memory gap is structural. Spec says 128 GiB unified memory;
free -hshows ~119 GiB (CMA reservation for the iGPU); vLLM sees ~111 GiB (kernel slab + driver overhead). Userland processes contribute <500 MB. Nothing to free. The firm cap for--gpu-memory-utilization-gbis 110 GiB (0.5 GiB safety against the tightest node). -
--gpu-memory-utilization-gb 110OOMs on first heavy inference, not at boot. Attempt 8 booted fine; on a 200K prefill the NVRM driver loggedNV_ERR_NO_MEMORYand the kernel OOM-killer reaped a Ray worker. Activation peak during a chunked-prefill chunk scales withmax_num_batched_tokens, not memory budget alone. -
--max-num-batched-tokens 16384is the killer for big prefill. Halving to 8192 cut the FlashInfer fp8 autotuner workspace high-water-mark. The caching allocator holds onto that high-water-mark for the engine lifetime, so a one-time spike at boot starves later KV allocation. -
Fix order if a future config OOMs: (a) drop MTP first (frees draft head state + activation), (b) then drop
max_num_seqs, (c) only then lower--gpu-memory-utilization-gb(below 105 GiB the KV pool gets too thin for long ctx). -
Container
Up≠ engine alive. Ray launcher PID1 keeps the container running even afterEngineDeadError. Always verify with/healthAND a smoke chat completion. -
ImportError: libtorch_cuda.so: cannot open shared object fileat vllm import = the image hastorch 2.10.0+cpuinstead of2.11.0+cu130. Upstream vllm wheel ships lying metadata; uv installs the CPU build. Rebuild the image with the documentedtorch==2.11.0--force-reinstall --no-depsstep (seerecipes/nemotron-3-ultra-nvfp4.md). -
The Super HF card's 1M-context claim does NOT transfer to Ultra. Super may have extended RoPE; Ultra ships with theta=10000. Treat each variant separately.
-
The relaunch script has an empty-arg bug for workers.
relaunch-nemotron3-ultra-nvfp4-tp4.shdoesdocker rm -f(empty container name) on workers — won't clean up a previous workload's container. Manuallydocker rm -f <name>on each worker before relaunching when switching models. -
329 GiB weights per node + 916 GiB partition is tight. With Qwen397 weights (~379 GiB) as the steady-state primary and
vllm-node-mimo+vllm-nodeimages (~80 GiB combined), each node has ~280-320 GiB headroom. Clear the partial HF hub cache at<spark-model-root>/hub/(it's incomplete and bypassed by the absolute-path recipe) to recover ~70 GiB/node before staging.
Compared to Qwen3.5-397B-A17B-FP8 + MTP-2 on the same cluster
The companion blog post in ../qwen-3.5-397b/README.md benches the cluster's primary workload — Qwen3.5-397B-A17B-FP8 with qwen3_next_mtp (MTP-2). Same hardware, same vLLM wheel, same fp8 KV cache. Useful contrasts:
| Aspect | Qwen3.5-397B-A17B-FP8 | Nemotron-3-Ultra-550B-A55B-NVFP4 |
|---|---|---|
| Image | vllm-node-mimo (torch 2.11.0+cu130) |
vllm-node (torch 2.11.0+cu130) |
| Quant | FP8 (W8A8 + fp8 KV) | NVFP4 (W4A4) + fp8 KV |
| Architecture | Pure MoE attention | LatentMoE + Mamba-2 + attention hybrid |
| MTP head layout | Single MTP head, re-run per spec position | Multi-head (nemotron_h_mtp) |
Working num_spec_tokens |
2 (MTP-5 crashes engine — see ../qwen-3.5-397b/README.md) |
3 (validated at full 262K ctx) |
max_num_seqs |
16 | 6 |
| Native ctx | 262 144 | 262 144 |
| Single-stream decode | ~40 t/s | ~19 t/s |
| Aggregate decode peak | ~186 t/s @ N=16 | ~68 t/s @ N=6 |
| 200K NIAH prefill | 1 584 t/s, 669 out tokens, ✅ found | 1 224 t/s, 94 out tokens, ✅ found |
Reproducibility — Docker image
The container is vllm-node:latest — the base image from
github.com/eugr/spark-vllm-docker,
not the vllm-node-mimo variant the Qwen3.5-397B post uses.
Nemotron-3-Ultra works against the base because its arch + reasoning
parser are added at runtime via in-tree mod patches, not as image
layers. Lineage:
nvidia/cuda:13.2.0-devel-ubuntu24.04
│ + ccache, build tools, libibverbs (RDMA)
│ + vllm wheel from wheels/ (0.22.1rc1.dev124+...)
│ + flashinfer wheels (0.6.12)
│ + torch 2.11.0+cu130 force-reinstall (see pitfall below)
▼
vllm-node:latest ◀── this benchmark used this image
│
└─ at runtime, run-recipe.sh applies two git-patch mods INSIDE
the container before exec'ing vllm serve:
mods/gpu-mem-util-gb (--gpu-memory-utilization-gb flag)
mods/nemotron-ultra (nemotron_h arch + nemotron_v3 parser)
Required pinned versions (do NOT skip)
| Package | Version | Source | Why |
|---|---|---|---|
torch |
2.11.0+cu130 | https://download.pytorch.org/whl/cu130 |
The fresh vLLM wheel ships metadata pinning torch==2.10.0 but the C++ ABI actually needs 2.11.0. If you let uv resolve naturally it lands torch==2.10.0+cpu and vllm dies at import with ImportError: libtorch_cuda.so: cannot open shared object file. The image's final RUN must force-reinstall the cu130 wheel --no-deps --force-reinstall. |
vllm |
0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 |
local wheel wheels/vllm-*.whl |
nemotron_h architecture + LatentMoE require v0.22. Older wheels reject the arch and the mod patches don't apply cleanly. |
flashinfer-python |
0.6.12 |
local wheels wheels/flashinfer_*.whl |
required for FP8 KV + NVFP4 attention + the flashinfer_cutlass MoE backend. |
Build + verify + distribute (~10 min total on the head GX10)
# On head GX10 (<spark-user>@<head-node>)
cd ~/spark-vllm-docker
# 1. Drop wheels in wheels/ — vllm-0.22.1rc1.dev124+... and flashinfer_*-0.6.12
ls wheels/
# vllm-0.22.1rc1.dev124+gace95c9cf.d20260603.cu132-cp312-cp312-linux_aarch64.whl
# flashinfer_python-0.6.12-py3-none-any.whl
# flashinfer_jit_cache-0.6.12-cp39-abi3-manylinux_2_28_aarch64.whl
# flashinfer_cubin-0.6.12-py3-none-any.whl
# 2. Build vllm-node and SCP the tarball to all 3 workers over fabric
./build-and-copy.sh -c
# 3. VERIFY torch immediately BEFORE serving (the upstream wheel's
# metadata regression silently lands torch 2.10.0+cpu on rebuild
# if Dockerfile's force-reinstall step gets dropped)
docker run --rm --entrypoint python3 vllm-node:latest -c \
"import torch; print(torch.__version__, torch.version.cuda)"
# MUST print: 2.11.0+cu130 13.0
# If it prints '2.10.0+cpu' the container will go Up but /health never
# comes up. Patch the Dockerfile to end with:
#
# RUN uv pip install --no-deps --force-reinstall \
# --index-url https://download.pytorch.org/whl/cu130 \
# torch==2.11.0
#
# then rebuild and re-verify before distributing.
Pitfall —
docker image prune -awill silently delete this image when no container is running. Never runprune -ablind; either remove by tag (docker image rm vllm-node:latestif you actually want to) or filter with--filter "until=24h". We've reproduced this failure mode and now keep tar backups at<control-workspace>/docker-images/vllm-node.taron the control node so a lost image can be restored without a wheels rebuild.
Mods that the run script patches into the container
Both mods live under ~/spark-vllm-docker/mods/ as git patches; the
recipe yaml's mods: block applies them inside the container at launch
before vllm serve is exec'd. You don't run these by hand.
Copies of the mod files are in ../resources/mods/
so you can inspect or rebase them without SSH'ing to the head GX10:
| Mod | What it patches | Why required |
|---|---|---|
mods/gpu-mem-util-gb |
adds --gpu-memory-utilization-gb <int> flag to vLLM (raw GiB instead of a 0-1 fraction) |
Spark's unified memory is 128 GiB advertised but only 111 GiB visible to vLLM. The fraction-of-VRAM math overshoots the real ceiling; the raw-GiB budget mode is the only way to set a deterministic 108 GiB cap that respects the 110 GiB firm limit. |
mods/nemotron-ultra |
registers nemotron_h architecture + pulls the nemotron_v3 reasoning parser into vLLM's plugin registry |
Without this, vLLM rejects the Nemotron-3-Ultra config with unknown architecture and there's no parser for the <think> block. |
If a mod patch fails with git apply line-offset rejection after a
vLLM wheel bump, the run script falls back to patch --fuzz=5. This is
a known quirk of upstream vLLM file drift; if both methods fail, rebase
the patch against the current vLLM source in the wheel.
Reproducibility — launch the cluster
The launcher repo and recipe yaml are checked into
~/spark-vllm-docker/ on the head GX10. One-shot:
ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
'cd ~/spark-vllm-docker && ./relaunch-nemotron3-ultra-nvfp4-tp4.sh'
The relaunch script:
- Stops any existing
nemotron3-ultra-nvfp4-tp4container on all 4 nodes - Runs
./run-recipe.sh recipes/4x-spark-cluster/nemotron-3-ultra-nvfp4.yaml -dwhich boots Ray over the fabric and applies both mods inside the container - Exec's
vllm servewith the flags baked into the recipe yaml
The full expanded vllm serve command (after recipe + env + mod
substitution) is:
docker run -d --name nemotron3-ultra-nvfp4-tp4 \
--runtime nvidia --network host --ipc host --shm-size 16g \
-e HF_HUB_OFFLINE=1 -e TRANSFORMERS_OFFLINE=1 \
-e VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=0 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-v <spark-model-root>:/root/.cache/huggingface \
vllm-node:latest \
vllm serve /root/.cache/huggingface/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
--served-model-name nvidia/nemotron-3-ultra \
--host 0.0.0.0 --port 8000 \
--trust-remote-code \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--distributed-executor-backend ray \
--kv-cache-dtype fp8 \
--gpu-memory-utilization-gb 108 \
--max-model-len 262144 \
--max-num-seqs 6 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill \
--enable-prefix-caching \
--enable-auto-tool-choice \
--reasoning-parser nemotron_v3 \
--tool-call-parser qwen3_coder \
--mamba-ssm-cache-dtype float16 \
--mamba-backend flashinfer \
--enable-mamba-cache-stochastic-rounding \
--mamba-cache-philox-rounds 5 \
--moe-backend flashinfer_cutlass \
--speculative-config '{"method":"nemotron_h_mtp","num_speculative_tokens":3}' \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}' \
--compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}' \
--distributed-timeout-seconds 3600
expandable_segments:True is critical: it lets the CUDA caching
allocator return segments after the FlashInfer fp8 autotuner spike so
KV-cache alloc can claim the freed memory instead of being permanently
starved by the boot-time high-water-mark.
Verify the engine wired up MTP after boot by greping the container logs
for Detected MTP model. Sharing target model embedding weights (one
line per TP rank) and confirming SpeculativeConfig(method='mtp', num_spec_tokens=3) in the boot log — note vLLM's wheel transparently
remaps nemotron_h_mtp to plain mtp in the engine config (the
spec-config flag must still use the full name).
Boot time: ~12-13 min to /health=200. Model loading dominates
(~82 GiB per rank of NVFP4 weights + draft head + MoE expert shards).
Reproducibility — bench the cluster
# On the control node (Python 3.14; needs python3-requests from apt
# because pip/venv are not set up here).
cd <control-workspace>/blog/nemotron-3-ultra
/usr/bin/python3 ../resources/bench.py > results.json 2> bench.log
bench.py builds the prompts via the running engine's /tokenize
endpoint so the input token count is exact, fires N concurrent streams
synchronized on a threading.Barrier, parses streaming delta.content
and delta.reasoning plus the usage block from the trailing chunk,
and reports both per-request and wall-window aggregate throughput.
The shared harness lives at ../resources/bench.py
(same file used for the Qwen3.5-397B post).
A higher-level wrapper orchestrate.sh (in logs/, used during the
original run) handled the full end-to-end: poll weight staging on
head, fan the weights out to workers over the 200 GbE fabric, call the
relaunch script, poll /health, then run bench.py. Kept under
logs/ rather than ../resources/ because it has hardcoded paths and
was bespoke to this particular cold start.
See ../resources/INFRA.md for full cluster +
bench harness details shared with the Qwen3.5-397B blog post, and
<control-workspace>/recipes/nemotron-3-ultra-nvfp4.md for the
full attempt history (10 attempts) and pitfall catalog.
Files in this folder
README.md— this fileresults.json— raw bench JSON (parsed into the tables above)logs/— bench.log + orchestrate.{sh,log,out} + continue.{sh,log,out} from the original cold-start run
Shared scripts/recipes/mods used by this post live in
../resources/.