Nemotron-3-Ultra-550B-A55B NVFP4 on 4x ASUS GX10 — bench

Date: 2026-06-13 Hardware: 4x ASUS Ascent GX10 (NVIDIA GB10 Blackwell, 128 GiB unified memory each, ConnectX-7 200 GbE fabric), TP=4 + expert-parallel + Ray Container image: vllm-node:latest (base image, not the vllm-node-mimo variant) — see Reproducibility — Docker image below for the exact build steps, pinned versions, and the mods that have to be patched in vLLM: 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 (local wheel, image id 75429f413d11, built 2026-06-05) Model: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 (HF), local path /root/.cache/huggingface/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4, ~329 GiB per node Architecture: LatentMoE + Mamba-2 + attention hybrid, 512 routed experts top-22 MTP: nemotron_h_mtp, num_speculative_tokens=3 (multi-head MTP — unlike Qwen3.5's single-head MTP, MTP-3 is comfortable here) KV cache: fp8; max_model_len=262144; max_num_seqs=6; max_num_batched_tokens=8192 GPU memory budget: 108 GiB/node (GX10 unified memory firm cap is 110 GiB; mod enforces 0.5 GiB margin against the tightest node's 111.27 GiB) API: http://<head-node>:8000 Bench client: ../resources/bench.py, streaming via requests + iter_lines() with stream_options={"include_usage": True}

Field notes

Nemotron-3-Ultra was the "somebody said this should not fit here, so I had to try" run. I had read that this was not a model you should expect to run on a 4x GX10 cluster, and the satisfying part was getting it to boot anyway: TP=4, expert parallel, Ray, fp8 KV, the Nemotron parser patches, and a tight raw-GiB memory budget all lining up well enough to serve a 550B-class NVFP4 model from the Spark nodes.

In day-to-day use, though, it does not obviously feel better than the Qwen3.5-397B baseline. Tool calling was fine, and the 200K retrieval test was clean, but the model has different tradeoffs rather than a clear quality win. The LatentMoE + Mamba-2 hybrid spends memory and compute on state that helps the architecture scale and keep long-context behavior stable, while this Spark setup is decode-throughput constrained and caps max_num_seqs at 6. Qwen397's working MTP-2 head gives it a large interactive-speed advantage, so Nemotron can be technically impressive without feeling like the better default assistant on this hardware.

For full cluster + image setup details shared with the Qwen3.5-397B post: see ../resources/INFRA.md.

Companion files: launch script, recipe yaml, mod patches, bench harness, and wheel provenance are checked into ../resources/ so each script + patch is reproducible as a file you can copy.

Provenance

Prior art — where the launch flags came from

Nemotron-3 Super (released 2026-03-11) and Ultra (released 2026-06-04) are the same architecture family (LatentMoE Mamba-2 + MoE + attention hybrid with MTP). We didn't write the launch flags from the Ultra model card alone — we started from the Super deployment docs and adjusted for the Ultra's larger expert count and tighter per-rank memory budget. Authoritative sources used:

vLLM blog — Nemotron-3-Super launch post (2026-03-11): https://vllm.ai/blog/2026-03-11-nemotron-3-super — explains LatentMoE's compressed-dim routing (d=4096 → ℓ=1024, ~4× less all-to-all traffic), the --mamba-backend flashinfer requirement, the nemotron_h_mtp speculator, and --reasoning-parser nemotron_v3 / --tool-call-parser qwen3_coder wiring.
NVIDIA Spark Deployment Guide — Nemotron-3-Super: https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemotron-3-Super/SparkDeploymentGuide/README.html — the canonical GB10 vLLM launch flags. Closest official doc to our 4x GX10 cluster; our recipe is essentially the Super single-node command extended to TP=4 + EP across Ray.
NVIDIA Developer Forums — Nemotron-3-Super NVFP4 on Acer GB10: https://forums.developer.nvidia.com/t/nemotron-3-super-120b-a12b-nvfp4-with-vllm-v0-22-0-on-1x-acer-gb10-with-495-71-05-driver-container-is-ubuntu-24-04-cuda-13-2-1-gxx11/371853 — community thread on single-node GB10 troubleshooting (same SM121 as ours); useful sanity check on the vLLM 0.22 + cu132 driver/container combination.
NVIDIA vLLM release notes: https://docs.nvidia.com/deeplearning/frameworks/vllm-release-notes/index.html — for verifying which release added each Nemotron-specific parser / backend (--reasoning-parser nemotron_v3, --mamba-backend flashinfer, --speculative-config nemotron_h_mtp).

Our spark-vllm-docker/mods/nemotron-ultra mod is a fork of the mods/nemotron-super mod with the arch registration switched to the Ultra config — Super was already running on the cluster when Ultra shipped, so the recipe was derived rather than rebuilt from scratch.

Sampling for both tests: temperature=0.0. Prompts are tokenized via vLLM's /tokenize endpoint so the input token count is exact.

Benchmarks

1. Concurrency sweep — 10k in / 1024 out

Each request: ~9 818 input tokens (built from varied English filler) + an instruction to continue the narrative; max_tokens=1024. All concurrent requests fire simultaneously (synchronized via a threading.Barrier). Per-request and aggregate throughput reported.

Note on N choice: this run sweeps N ∈ {1, 2, 4, 6, 8}. The Qwen blog sweeps up to N=16; Nemotron's max_num_seqs=6 makes N>6 informative as a queue-saturation point rather than additional throughput.

Aggregate throughput (single-run, fp8 KV, MTP-3 active):

N	Wall (s)	Agg prefill (t/s)	Agg decode (t/s)	Median TTFT (s)	Median per-req decode (t/s)
1	67.3	701	19.3	14.28	19.3
2	72.9	3 889	28.8	5.14	15.4
4	82.4	8 568	52.6	4.67	13.5
6	96.2	9 477	68.3	6.33	11.7
8	169.1	810 ⚠	50.8	7.97	11.5

⚠ N=8 exceeds max_num_seqs=6; two of the eight requests had to wait for a free slot (their TTFT jumps to 92-99 s — see the N=8 table below). That inflates the wall window and collapses the apparent "aggregate prefill" rate. Aggregate decode also regresses vs N=6 because the queued requests stall the wall window while contributing no decode tokens until they're admitted.

Per-request totals (input always 10 006 tokens, output always 1 024 tokens):

N=1

req	TTFT (s)	duration (s)	prefill (t/s)	decode (t/s)
0	14.28	67.33	701	19.3

N=2

req	TTFT (s)	duration (s)	prefill (t/s)	decode (t/s)
0	5.14	71.75	1 945	15.4
1	1.98	72.92	5 056	14.4

N=4

req	TTFT (s)	duration (s)	prefill (t/s)	decode (t/s)
0	4.67	82.44	2 143	13.2
1	4.67	80.44	2 143	13.5
2	4.67	74.77	2 144	14.6
3	4.67	80.44	2 142	13.5

N=6

req	TTFT (s)	duration (s)	prefill (t/s)	decode (t/s)
0	6.33	96.21	1 580	11.4
1	6.33	93.80	1 580	11.7
2	6.33	94.90	1 581	11.6
3	6.33	87.11	1 580	12.7
4	6.33	96.20	1 581	11.4
5	6.33	94.09	1 580	11.7

N=8 (over-subscribed, max_num_seqs=6)

req	TTFT (s)	duration (s)	prefill (t/s)	decode (t/s)
0	7.97	104.76	1 255	10.6
1	92.46 ⚠	163.09	108	14.5
2	7.97	103.90	1 255	10.7
3	7.97	90.22	1 256	12.4
4	7.98	96.68	1 255	11.5
5	7.97	105.84	1 255	10.5
6	7.97	96.67	1 255	11.5
7	98.86 ⚠	169.14	101	14.6

Requests 1 and 7 waited ~90 s for a free slot — their decode throughput is normal (the high apparent rate reflects how prefill_tps = in_tokens / TTFT lumps queue wait into the prefill window). When the queueing is excluded, the engine is saturated cleanly between N=4 and N=6.

Notes on the shape

Per-request decode falls cleanly with N (19.3 → 15.4 → 13.5 → 11.7 → 11.5 t/s) — well-behaved scaling once the engine is busy. The N=1 number is dominated by the per-token Mamba state update on a single sequence; the cluster doesn't get more efficient until multiple sequences amortize the Mamba+attention overhead.
Aggregate decode peaks at N=6 (68.3 t/s, 3.5× N=1). With max_num_seqs=6 this is the engineered ceiling for short-context concurrent serving; N=8 only hurts because two streams queue.
Prefill scales ~13× from N=1 to N=6 (701 → 9 477 t/s). Each prefill of ~10K tokens fits in a single 8192-batched-token window, so multi-request concurrency benefits the prefill throughput linearly until the chunked-prefill scheduler has to interleave with active decodes.
TTFT plateaus cleanly at N=2-6 (1.98-6.33 s) — the chunked prefill scheduler admits requests within a single 8K-token window when there's room.

2. Needle-in-a-haystack — 200k prefill

200 000-token filler with a single needle injected at ~50% depth:

IMPORTANT: The secret access code for Project Aurora is BANANA-MOOSE-7421. Remember this exactly.

Question appended at the end: "Based on the text above, what is the secret access code for Project Aurora? Reply with just the code itself, nothing else."

Metric	Value
Input tokens	199 476
TTFT (≈ prefill)	163.0 s
Prefill speed	1 224 tokens/s
Output tokens	94 (terse — small reasoning trace)
Decode speed	20.6 t/s
Total wall	167.5 s
Needle retrieved	✅ yes — exact match

Model's final answer (after the reasoning block):

BANANA-MOOSE-7421

Reasoning trace (full, 296 chars):

The user is asking for the secret access code for Project Aurora,
which was mentioned in the text. Let me find it.

Looking through the text, I see this line: "IMPORTANT: The secret
access code for Project Aurora is BANANA-MOOSE-7421. Remember this
exactly."

So the code is BANANA-MOOSE-7421.

Nemotron-3-Ultra's NIAH retrieval at 200K is materially leaner than Qwen3.5-397B's: 94 output tokens (~7× shorter than Qwen's 669) with the needle copied verbatim in a 3-paragraph trace. Prefill is 1 224 t/s (vs Qwen 1 584 t/s) — the Mamba+attention hybrid is a touch slower per token at this depth, but the model also generates the answer with much less ceremony.

Launch config (attempt 10, validated boot 2026-06-06)

Recipe: ~/spark-vllm-docker/recipes/4x-spark-cluster/nemotron-3-ultra-nvfp4.yaml on head GX10 (copy at ../resources/recipes/nemotron-3-ultra-nvfp4.yaml). Relaunch wrapper: ~/spark-vllm-docker/relaunch-nemotron3-ultra-nvfp4-tp4.sh (copy at ../resources/scripts/relaunch-nemotron3-ultra-nvfp4-tp4.sh).

Effective vllm serve command:

vllm serve /root/.cache/huggingface/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --served-model-name nvidia/nemotron-3-ultra \
  --host 0.0.0.0 --port 8000 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --distributed-executor-backend ray \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization-gb 108 \
  --max-model-len 262144 \
  --max-num-seqs 6 \
  --max-num-batched-tokens 8192 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --reasoning-parser nemotron_v3 \
  --tool-call-parser qwen3_coder \
  --mamba-ssm-cache-dtype float16 \
  --mamba-backend flashinfer \
  --enable-mamba-cache-stochastic-rounding \
  --mamba-cache-philox-rounds 5 \
  --moe-backend flashinfer_cutlass \
  --speculative-config '{"method":"nemotron_h_mtp","num_speculative_tokens":3}' \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}' \
  --compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}' \
  --distributed-timeout-seconds 3600

Env on the container:

VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm
VLLM_ALLOW_LONG_MAX_MODEL_LEN=0
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

expandable_segments:True lets the CUDA caching allocator return segments after the FlashInfer fp8 autotuner spike so KV-cache alloc can claim the freed memory instead of hitting a fragmented high-water-mark.

Critical config (flag-by-flag)

Flag	Value	Why
`--enable-expert-parallel`	required	512 routed experts, top-22 — EP across TP ranks dramatically reduces per-rank MoE memory
`--kv-cache-dtype fp8`	required	KV pool is the bottleneck at long ctx; fp8 halves it. No coherence damage observed (unlike Step-3.7-FP8).
`--mamba-backend flashinfer`	required	Triton Mamba kernel is slower and uses more workspace
`--mamba-ssm-cache-dtype float16`	required	fp32 SSM state doubles cache footprint
`--enable-mamba-cache-stochastic-rounding`	required	Without it fp16 SSM state quantization degrades long-context output
`--moe-backend flashinfer_cutlass`	required	Throughput-tuned MoE kernels; required ≥ vLLM 0.22 (env var `VLLM_FLASHINFER_MOE_BACKEND` was removed in v0.23)
`--reasoning-parser nemotron_v3`	required	Routes thinking blocks into `message.reasoning`
`--tool-call-parser qwen3_coder`	required	Nemotron-3 emits Qwen3-style tool calls
`--compilation-config '{"pass_config":{"fuse_allreduce_rms":false}}'`	required	Pass interacts badly with Mamba+attention hybrid; OOMs on graph capture
`--speculative-config '{"method":"nemotron_h_mtp","num_speculative_tokens":3}'`	tuned	nemotron_h MTP is multi-head — MTP-3 is the documented sweet spot. MTP-5 was only validated at ctx=65536.
`--max-num-batched-tokens 8192`	tuned	16384 OOMs the NVRM driver during 200K prefill (attempt 8). 8192 is the validated ceiling — halving cut the FlashInfer fp8 autotuner workspace high-water-mark which the caching allocator otherwise holds for engine lifetime.
`--max-num-seqs 6`	tuned	KV holds 6 × 262K slots at gpu_mem=108. Bumped up from attempt 9's 4 after the budget bump.
`--gpu-memory-utilization-gb 108`	tuned	Firm cap is 110 GiB (mod enforces 0.5 GiB margin against the tightest node's 111.27 GiB). 105 left ~3 GiB on the table; 110 OOMs on first heavy prefill. 108 is the sweet spot.

Pitfalls (learned across 10 attempts)

"1M context" is marketing. The model card may say otherwise, but config.json enforces max_position_embeddings=262144. The RoPE base is the standard untuned 10 000 — extrapolating with VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 produces NaN-prone attention. Don't raise it.
The 128 → 119 → 111 GiB Spark memory gap is structural. Spec says 128 GiB unified memory; free -h shows ~119 GiB (CMA reservation for the iGPU); vLLM sees ~111 GiB (kernel slab + driver overhead). Userland processes contribute <500 MB. Nothing to free. The firm cap for --gpu-memory-utilization-gb is 110 GiB (0.5 GiB safety against the tightest node).
--gpu-memory-utilization-gb 110 OOMs on first heavy inference, not at boot. Attempt 8 booted fine; on a 200K prefill the NVRM driver logged NV_ERR_NO_MEMORY and the kernel OOM-killer reaped a Ray worker. Activation peak during a chunked-prefill chunk scales with max_num_batched_tokens, not memory budget alone.
--max-num-batched-tokens 16384 is the killer for big prefill. Halving to 8192 cut the FlashInfer fp8 autotuner workspace high-water-mark. The caching allocator holds onto that high-water-mark for the engine lifetime, so a one-time spike at boot starves later KV allocation.
Fix order if a future config OOMs: (a) drop MTP first (frees draft head state + activation), (b) then drop max_num_seqs, (c) only then lower --gpu-memory-utilization-gb (below 105 GiB the KV pool gets too thin for long ctx).
Container Up ≠ engine alive. Ray launcher PID1 keeps the container running even after EngineDeadError. Always verify with /health AND a smoke chat completion.
ImportError: libtorch_cuda.so: cannot open shared object file at vllm import = the image has torch 2.10.0+cpu instead of 2.11.0+cu130. Upstream vllm wheel ships lying metadata; uv installs the CPU build. Rebuild the image with the documented torch==2.11.0 --force-reinstall --no-deps step (see recipes/nemotron-3-ultra-nvfp4.md).
The Super HF card's 1M-context claim does NOT transfer to Ultra. Super may have extended RoPE; Ultra ships with theta=10000. Treat each variant separately.
The relaunch script has an empty-arg bug for workers. relaunch-nemotron3-ultra-nvfp4-tp4.sh does docker rm -f (empty container name) on workers — won't clean up a previous workload's container. Manually docker rm -f <name> on each worker before relaunching when switching models.
329 GiB weights per node + 916 GiB partition is tight. With Qwen397 weights (~379 GiB) as the steady-state primary and vllm-node-mimo + vllm-node images (~80 GiB combined), each node has ~280-320 GiB headroom. Clear the partial HF hub cache at <spark-model-root>/hub/ (it's incomplete and bypassed by the absolute-path recipe) to recover ~70 GiB/node before staging.

Compared to Qwen3.5-397B-A17B-FP8 + MTP-2 on the same cluster

The companion blog post in ../qwen-3.5-397b/README.md benches the cluster's primary workload — Qwen3.5-397B-A17B-FP8 with qwen3_next_mtp (MTP-2). Same hardware, same vLLM wheel, same fp8 KV cache. Useful contrasts:

Aspect	Qwen3.5-397B-A17B-FP8	Nemotron-3-Ultra-550B-A55B-NVFP4
Image	`vllm-node-mimo` (torch 2.11.0+cu130)	`vllm-node` (torch 2.11.0+cu130)
Quant	FP8 (W8A8 + fp8 KV)	NVFP4 (W4A4) + fp8 KV
Architecture	Pure MoE attention	LatentMoE + Mamba-2 + attention hybrid
MTP head layout	Single MTP head, re-run per spec position	Multi-head (nemotron_h_mtp)
Working `num_spec_tokens`	2 (MTP-5 crashes engine — see `../qwen-3.5-397b/README.md`)	3 (validated at full 262K ctx)
`max_num_seqs`	16	6
Native ctx	262 144	262 144
Single-stream decode	~40 t/s	~19 t/s
Aggregate decode peak	~186 t/s @ N=16	~68 t/s @ N=6
200K NIAH prefill	1 584 t/s, 669 out tokens, ✅ found	1 224 t/s, 94 out tokens, ✅ found

Reproducibility — Docker image

The container is vllm-node:latest — the base image from github.com/eugr/spark-vllm-docker, not the vllm-node-mimo variant the Qwen3.5-397B post uses. Nemotron-3-Ultra works against the base because its arch + reasoning parser are added at runtime via in-tree mod patches, not as image layers. Lineage:

nvidia/cuda:13.2.0-devel-ubuntu24.04
   │  + ccache, build tools, libibverbs (RDMA)
   │  + vllm wheel from wheels/ (0.22.1rc1.dev124+...)
   │  + flashinfer wheels (0.6.12)
   │  + torch 2.11.0+cu130 force-reinstall (see pitfall below)
   ▼
vllm-node:latest          ◀── this benchmark used this image
   │
   └─ at runtime, run-recipe.sh applies two git-patch mods INSIDE
      the container before exec'ing vllm serve:
         mods/gpu-mem-util-gb     (--gpu-memory-utilization-gb flag)
         mods/nemotron-ultra      (nemotron_h arch + nemotron_v3 parser)

Required pinned versions (do NOT skip)

Package	Version	Source	Why
`torch`	2.11.0+cu130	`https://download.pytorch.org/whl/cu130`	The fresh vLLM wheel ships metadata pinning `torch==2.10.0` but the C++ ABI actually needs 2.11.0. If you let uv resolve naturally it lands `torch==2.10.0+cpu` and vllm dies at import with `ImportError: libtorch_cuda.so: cannot open shared object file`. The image's final RUN must force-reinstall the cu130 wheel `--no-deps --force-reinstall`.
`vllm`	`0.22.1rc1.dev124+gace95c9cf.d20260603.cu132`	local wheel `wheels/vllm-*.whl`	`nemotron_h` architecture + LatentMoE require v0.22. Older wheels reject the arch and the mod patches don't apply cleanly.
`flashinfer-python`	`0.6.12`	local wheels `wheels/flashinfer_*.whl`	required for FP8 KV + NVFP4 attention + the `flashinfer_cutlass` MoE backend.

Build + verify + distribute (~10 min total on the head GX10)

# On head GX10 (<spark-user>@<head-node>)
cd ~/spark-vllm-docker

# 1. Drop wheels in wheels/ — vllm-0.22.1rc1.dev124+... and flashinfer_*-0.6.12
ls wheels/
#   vllm-0.22.1rc1.dev124+gace95c9cf.d20260603.cu132-cp312-cp312-linux_aarch64.whl
#   flashinfer_python-0.6.12-py3-none-any.whl
#   flashinfer_jit_cache-0.6.12-cp39-abi3-manylinux_2_28_aarch64.whl
#   flashinfer_cubin-0.6.12-py3-none-any.whl

# 2. Build vllm-node and SCP the tarball to all 3 workers over fabric
./build-and-copy.sh -c

# 3. VERIFY torch immediately BEFORE serving (the upstream wheel's
#    metadata regression silently lands torch 2.10.0+cpu on rebuild
#    if Dockerfile's force-reinstall step gets dropped)
docker run --rm --entrypoint python3 vllm-node:latest -c \
  "import torch; print(torch.__version__, torch.version.cuda)"
# MUST print: 2.11.0+cu130 13.0
# If it prints '2.10.0+cpu' the container will go Up but /health never
# comes up. Patch the Dockerfile to end with:
#
#   RUN uv pip install --no-deps --force-reinstall \
#       --index-url https://download.pytorch.org/whl/cu130 \
#       torch==2.11.0
#
# then rebuild and re-verify before distributing.

Pitfall — docker image prune -a will silently delete this image when no container is running. Never run prune -a blind; either remove by tag (docker image rm vllm-node:latest if you actually want to) or filter with --filter "until=24h". We've reproduced this failure mode and now keep tar backups at <control-workspace>/docker-images/vllm-node.tar on the control node so a lost image can be restored without a wheels rebuild.

Mods that the run script patches into the container

Both mods live under ~/spark-vllm-docker/mods/ as git patches; the recipe yaml's mods: block applies them inside the container at launch before vllm serve is exec'd. You don't run these by hand.

Copies of the mod files are in ../resources/mods/ so you can inspect or rebase them without SSH'ing to the head GX10:

mods/gpu-mem-util-gb/run.sh
mods/gpu-mem-util-gb/gpu_mem.patch
mods/nemotron-ultra/run.sh

Mod	What it patches	Why required
`mods/gpu-mem-util-gb`	adds `--gpu-memory-utilization-gb <int>` flag to vLLM (raw GiB instead of a 0-1 fraction)	Spark's unified memory is 128 GiB advertised but only 111 GiB visible to vLLM. The fraction-of-VRAM math overshoots the real ceiling; the raw-GiB budget mode is the only way to set a deterministic 108 GiB cap that respects the 110 GiB firm limit.
`mods/nemotron-ultra`	registers `nemotron_h` architecture + pulls the `nemotron_v3` reasoning parser into vLLM's plugin registry	Without this, vLLM rejects the Nemotron-3-Ultra config with `unknown architecture` and there's no parser for the `<think>` block.

If a mod patch fails with git apply line-offset rejection after a vLLM wheel bump, the run script falls back to patch --fuzz=5. This is a known quirk of upstream vLLM file drift; if both methods fail, rebase the patch against the current vLLM source in the wheel.

Reproducibility — launch the cluster

The launcher repo and recipe yaml are checked into ~/spark-vllm-docker/ on the head GX10. One-shot:

ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
  'cd ~/spark-vllm-docker && ./relaunch-nemotron3-ultra-nvfp4-tp4.sh'

The relaunch script:

Stops any existing nemotron3-ultra-nvfp4-tp4 container on all 4 nodes
Runs ./run-recipe.sh recipes/4x-spark-cluster/nemotron-3-ultra-nvfp4.yaml -d which boots Ray over the fabric and applies both mods inside the container
Exec's vllm serve with the flags baked into the recipe yaml

The full expanded vllm serve command (after recipe + env + mod substitution) is:

docker run -d --name nemotron3-ultra-nvfp4-tp4 \
  --runtime nvidia --network host --ipc host --shm-size 16g \
  -e HF_HUB_OFFLINE=1 -e TRANSFORMERS_OFFLINE=1 \
  -e VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=0 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -v <spark-model-root>:/root/.cache/huggingface \
  vllm-node:latest \
  vllm serve /root/.cache/huggingface/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
    --served-model-name nvidia/nemotron-3-ultra \
    --host 0.0.0.0 --port 8000 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --distributed-executor-backend ray \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization-gb 108 \
    --max-model-len 262144 \
    --max-num-seqs 6 \
    --max-num-batched-tokens 8192 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --reasoning-parser nemotron_v3 \
    --tool-call-parser qwen3_coder \
    --mamba-ssm-cache-dtype float16 \
    --mamba-backend flashinfer \
    --enable-mamba-cache-stochastic-rounding \
    --mamba-cache-philox-rounds 5 \
    --moe-backend flashinfer_cutlass \
    --speculative-config '{"method":"nemotron_h_mtp","num_speculative_tokens":3}' \
    --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}' \
    --compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}' \
    --distributed-timeout-seconds 3600

expandable_segments:True is critical: it lets the CUDA caching allocator return segments after the FlashInfer fp8 autotuner spike so KV-cache alloc can claim the freed memory instead of being permanently starved by the boot-time high-water-mark.

Verify the engine wired up MTP after boot by greping the container logs for Detected MTP model. Sharing target model embedding weights (one line per TP rank) and confirming SpeculativeConfig(method='mtp', num_spec_tokens=3) in the boot log — note vLLM's wheel transparently remaps nemotron_h_mtp to plain mtp in the engine config (the spec-config flag must still use the full name).

Boot time: ~12-13 min to /health=200. Model loading dominates (~82 GiB per rank of NVFP4 weights + draft head + MoE expert shards).

Reproducibility — bench the cluster

# On the control node (Python 3.14; needs python3-requests from apt
# because pip/venv are not set up here).
cd <control-workspace>/blog/nemotron-3-ultra
/usr/bin/python3 ../resources/bench.py > results.json 2> bench.log

bench.py builds the prompts via the running engine's /tokenize endpoint so the input token count is exact, fires N concurrent streams synchronized on a threading.Barrier, parses streaming delta.content and delta.reasoning plus the usage block from the trailing chunk, and reports both per-request and wall-window aggregate throughput.

The shared harness lives at ../resources/bench.py (same file used for the Qwen3.5-397B post).

A higher-level wrapper orchestrate.sh (in logs/, used during the original run) handled the full end-to-end: poll weight staging on head, fan the weights out to workers over the 200 GbE fabric, call the relaunch script, poll /health, then run bench.py. Kept under logs/ rather than ../resources/ because it has hardcoded paths and was bespoke to this particular cold start.

See ../resources/INFRA.md for full cluster + bench harness details shared with the Qwen3.5-397B blog post, and <control-workspace>/recipes/nemotron-3-ultra-nvfp4.md for the full attempt history (10 attempts) and pitfall catalog.

Files in this folder

README.md — this file
results.json — raw bench JSON (parsed into the tables above)
logs/ — bench.log + orchestrate.{sh,log,out} + continue.{sh,log,out} from the original cold-start run

Shared scripts/recipes/mods used by this post live in ../resources/.