Shared infrastructure — 4x ASUS Ascent GX10 / GB10 vLLM cluster
This file is the canonical reproducibility reference for both blog posts
in this folder (README.md for Qwen3.5-397B-A17B-FP8 + MTP, and
nemotron-3-ultra/README.md for Nemotron-3-Ultra-550B-A55B-NVFP4).
If you only care about one model's numbers, the blog post for that model is self-contained — it inlines the launch command, the image lineage, and the bench script. This file collects the once-per-cluster setup that both posts share, so we don't have to repeat it twice.
Hardware
- 4x ASUS Ascent GX10 systems, one is the head/API node and three are workers, all Ray-connected.
- Each GX10: NVIDIA GB10 Blackwell Grace-Blackwell superchip, 128 GiB unified memory (CPU + iGPU share one pool), CX-7 200 GbE NIC.
- Networking: 1 GbE mgmt link between the control box (
<control-node>) and the head GX10 (<spark-user>@<head-node>); 200 GbE fabric (<fabric-subnet>) directly between the 4 GX10s via ConnectX-7. Model staging uses control→head over 1 GbE then head→workers in parallel over the 200 GbE fabric, which is ~4× faster than fanning out from the control node naively. - GX10 unified-memory budget visible to vLLM: ~111 GiB (128 spec → 119
free after CMA reservation → 111 inside the container after kernel slab
- driver overhead). Firm cap for
--gpu-memory-utilization-gbis 110 GiB (custom mod enforces 0.5 GiB safety margin).
- driver overhead). Firm cap for
control node head GX10 worker GX10s
<control-node> <head-node> <worker-mgmt-ips>
─── 1 GbE ────────────► │ │
│ ──── 200 GbE ─────► │
│ fabric: <worker-fabric-ip-1>
│ <head-fabric-ip> <worker-fabric-ip-2>
│ <worker-fabric-ip-3>
▼
vllm serve on :8000 (head)
Ray workers on the 3 others
OS detail: each GX10 is headless (default systemd target multi-user.target
— a graphical desktop would burn 2-3 GiB of unified memory we want for KV).
Source repo
The cluster launcher lives at github.com/eugr/spark-vllm-docker
(cloned at ~/spark-vllm-docker/ on the head GX10). The pieces we use:
| File | Purpose |
|---|---|
Dockerfile |
Builds the base vllm-node image from nvidia/cuda:13.2.0-devel-ubuntu24.04. Compiles PyTorch + flashinfer + vLLM from local wheels. Pins torch==2.11.0+cu130. |
Dockerfile.mimo-runtime |
Layered image vllm-node-mimo on top of vllm-node-tf5 — overrides transformers ≥5.0 and force-reinstalls torch 2.11.0+cu130. Required for Qwen3.5-MTP. |
build-and-copy.sh -c |
Builds vllm-node locally then scp's the image tarball to the 3 worker GX10s over fabric. |
launch-cluster.sh |
Multi-node Ray + vllm launcher. Spawns the container on each node, wires Ray, runs vllm serve on the head. |
run-recipe.sh <recipe.yaml> |
Resolves a recipe yaml (mods, flags, env) and calls launch-cluster.sh. |
recipes/4x-spark-cluster/*.yaml |
The yamls (per model) describing the launch — flags, env, mods, command template. |
mods/ |
Git-patch directories that modify the in-container vLLM source. Each recipe lists which ones it needs. |
wheels/ |
Pinned vLLM + flashinfer wheels. Bumping vLLM means dropping new wheels here and rebuilding. |
relaunch-*-tp*.sh |
Per-model "one-shot" wrappers: stop the previous container on all 4 nodes, then ./run-recipe.sh -d .... |
Docker image lineage
nvidia/cuda:13.2.0-devel-ubuntu24.04
│ + ccache, build tools, libibverbs (RDMA)
│ + pip install torch==2.11.0+cu130, triton, nvshmem
│
▼
vllm-node:latest ◀── used by Nemotron-3-Ultra recipe
│ Builds vLLM + flashinfer wheels into the image, base for everything below.
│
├──► vllm-node-tf5:latest
│ (older transformers, stable for non-MTP)
│
└──► vllm-node-tf5 + Dockerfile.mimo-runtime
│ ENV overrides + reinstall wheels with transformers≥5.0
│ RUN force-reinstall torch==2.11.0+cu130
▼
vllm-node-mimo:latest ◀── used by Qwen3.5-397B-A17B-FP8 + MTP recipe
The two "live" images on the cluster right now are vllm-node and
vllm-node-mimo. Both pin vLLM 0.22.1rc1.dev124+gace95c9cf.d20260603.
vllm-node-tf5 is the layered base for mimo but is not itself launched.
Pinned versions (do not skip)
| Package | Version | Source | Why |
|---|---|---|---|
torch |
2.11.0+cu130 | https://download.pytorch.org/whl/cu130 |
The fresh vLLM wheel ships metadata pinning torch==2.10.0 — but the ABI actually needs 2.11.0. If you let uv resolve naturally it lands torch==2.10.0+cpu and vllm dies at import with ImportError: libtorch_cuda.so: cannot open shared object file. Force-reinstall the cu130 wheel as a separate Dockerfile RUN layer AFTER the vllm install with --no-deps --force-reinstall. |
transformers |
≥5.0.0 (mimo only) |
pip | qwen3_next_mtp config classes live in transformers 5.x; tf5's base pin is 4.x — mimo overrides. |
vllm |
0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 |
local wheel in wheels/ |
qwen3_next_mtp + nemotron_h architectures both need 0.22+ |
flashinfer-python |
0.6.12 |
local wheel in wheels/ (cubin + jit_cache + python) |
required for FP8 KV + NVFP4 attention; both recipes use --moe-backend flashinfer_* or --attention-backend flashinfer. |
Verify after every rebuild (BEFORE redistributing 22-48 GB to workers):
docker run --rm --entrypoint python3 vllm-node:latest -c \
"import torch; print(torch.__version__, torch.version.cuda)"
# MUST print: 2.11.0+cu130 13.0
# If it prints 2.10.0+cpu, the rebuild silently regressed — DO NOT distribute.
Image rebuild + redistribute (~10 min total)
# On head GX10
cd ~/spark-vllm-docker
# 1. Sanity-check the wheels are current
ls wheels/
# flashinfer_cubin-0.6.12-py3-none-any.whl
# flashinfer_jit_cache-0.6.12-cp39-abi3-manylinux_2_28_aarch64.whl
# flashinfer_python-0.6.12-py3-none-any.whl
# vllm-0.22.1rc1.dev124+gace95c9cf.d20260603.cu132-cp312-cp312-linux_aarch64.whl
# 2. Build the base vllm-node image (Nemotron recipe).
# -c means "cluster mode": build locally, then scp the tarball to the 3 workers.
./build-and-copy.sh -c
# 3. For the Qwen MTP recipe, also build the mimo overlay:
docker build -f Dockerfile.mimo-runtime -t vllm-node-mimo:latest .
# 4. VERIFY torch immediately (see above).
# 5. Distribute mimo to the workers over fabric:
docker save vllm-node-mimo:latest > /tmp/vllm-node-mimo.tar
for n in <worker-fabric-ip-1> <worker-fabric-ip-2> <worker-fabric-ip-3>; do
(cat /tmp/vllm-node-mimo.tar | ssh $n 'docker load') &
done; wait
rm /tmp/vllm-node-mimo.tar
If vllm-node or vllm-node-mimo is ever lost (e.g. a docker image prune -a while no container is running — known pitfall), restore from
the control-node backup tars at
<control-workspace>/docker-images/:
# On control node, push tar back to whichever Spark is missing the image
scp -i ~/.ssh/<spark-key> \
<control-workspace>/docker-images/vllm-node-mimo.tar \
<spark-user>@<head-node>:/tmp/
ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
'cat /tmp/vllm-node-mimo.tar | docker load && rm /tmp/vllm-node-mimo.tar'
Pitfall —
docker image prune -adeletes the working image. Never rundocker image prune -awhile a container using the image is stopped; the image gets marked unused and removed. Always be explicit:docker image rm <repo:tag>or filter withdocker image prune --filter "until=24h".
Custom mods (vLLM source patches applied at run-recipe time)
Each recipe yaml lists which mods to apply. Each mod is a git-patch dir
under ~/spark-vllm-docker/mods/. run-recipe.sh applies them inside the
running container against the in-container vLLM source tree.
| Mod | Used by | Purpose |
|---|---|---|
mods/gpu-mem-util-gb |
Nemotron | Adds --gpu-memory-utilization-gb (raw GiB budget) on top of the fraction-based flag, because Spark's unified-memory math overshoots when you use the standard 0-1 fraction. |
mods/nemotron-ultra |
Nemotron | Registers the nemotron_h architecture in vLLM and pulls the nemotron_v3 reasoning parser. |
Qwen3.5-397B-A17B-FP8 + MTP needs no mods at the current wheel — the
qwen3_next_mtp speculator path is in upstream vLLM main.
If a mod patch fails with git apply rejection after an image rebuild
(line drift in upstream vLLM), run-recipe.sh falls back to
patch --fuzz=5. This is expected; both modes succeed in practice.
Weights staging
Each model's weights live on the control node under
<control-workspace>/<org>/<model>/. Cluster nodes have a copy at
<spark-model-root>/<org>/<model>/ which the recipe mounts
into the container at /root/.cache/huggingface/<org>/<model>/.
| Model | HF source | Size per node | Path on cluster |
|---|---|---|---|
| Qwen3.5-397B-A17B-FP8 | Qwen/Qwen3.5-397B-A17B-FP8 |
379 GiB | <spark-model-root>/Qwen/Qwen3.5-397B-A17B-FP8/ |
| Nemotron-3-Ultra-550B-A55B-NVFP4 | nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 |
329 GiB | <spark-model-root>/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4/ |
Staging recipe (the documented pattern — much faster than fanning out from the control node directly):
# 1. control → head over 1 GbE (~100 MB/s, ~55 min for 329 GiB)
rsync -a --partial --info=progress2 \
-e "ssh -i ~/.ssh/<spark-key>" \
<control-workspace>/<org>/<model>/ \
<spark-user>@<head-node>:<spark-model-root>/<org>/<model>/
# 2. head → 3 workers in parallel over 200 GbE fabric (~10 min)
ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> bash <<'REMOTE'
SRC=<spark-model-root>/<org>/<model>
for h in <worker-fabric-ip-1> <worker-fabric-ip-2> <worker-fabric-ip-3>; do
( rsync -a --partial -e "ssh -o StrictHostKeyChecking=no" \
"$SRC/" "$h:$SRC/" ) &
done
wait
REMOTE
The cluster is single-tenant — only one tp=4 workload at a time. Disk
budget per Spark is tight (916 GiB partition; ~280-320 GiB headroom with
Qwen weights resident + both images installed), so swapping models
requires clearing the previous model's weights from at least the head
node first.
Launching
Each model has a relaunch wrapper that:
docker rm -fthe previous container on all 4 nodes.- Runs
./run-recipe.sh <recipe>.yaml -d ...whichdocker run -dthe image, applies mods, and execsvllm servewith the recipe's flags substituted in. - Returns — the engine then loads weights (~10-13 min). Poll
http://<head-node>:8000/healthuntil200.
| Model | Relaunch wrapper on head | Master port |
|---|---|---|
| Qwen3.5-397B-A17B-FP8 + MTP | ~/spark-vllm-docker/relaunch-qwen397-fp8-mtp-qwen3next-tp4.sh |
29510 |
| Nemotron-3-Ultra | ~/spark-vllm-docker/relaunch-nemotron3-ultra-nvfp4-tp4.sh |
29520 |
The two master ports are distinct so the launchers don't collide if someone forgets to stop the previous one (though single-tenancy means they shouldn't both be running anyway).
Bench harness (shared)
Both blog posts use the same Python bench client at
<control-workspace>/blog/resources/bench.py. Each post's own
folder holds only its results.json + logs/; the harness itself is
shared. The MODEL constant and concurrency-sweep points are wired
through env vars / args, not by editing per-post copies.
Requirements on the control node:
sudo apt install python3-requests
(The control node has Python 3.14 with no pip and a broken venv; apt is the path that actually works — see operator memory entries.)
The harness:
- Builds prompts of a target token count by generating varied English
filler and trimming via vLLM's
/tokenizeendpoint, so the input token count is exact. - Streams via
requests.Session().post(stream=True)+iter_lines(), withstream_options={"include_usage": True}so the final SSE chunk carries theusageblock (exact prompt/completion token counts). - Synchronizes concurrent requests with
threading.Barrier(n)so all N requests fire simultaneously; computes both per-request throughput and aggregate (wall-window) throughput. - Runs NIAH by injecting a fixed needle at ~50% depth into a 200 000-token filler and checking the returned content + reasoning for the needle value.
Run the bench:
cd <control-workspace>/blog # or blog/nemotron-3-ultra
/usr/bin/python3 bench.py > results.json 2> bench.log
Sampling is temperature=0.0 so the runs are deterministic modulo MTP
acceptance variance and Ray scheduler ordering.
See also
recipes/qwen-3.5-397b-fp8-mtp.md— full lore on the Qwen MTP launch (pitfalls, attempt history, smoke-test, fallback non-MTP path).recipes/nemotron-3-ultra-nvfp4.md— same for Nemotron-3-Ultra (attempt-history table walking through 10 configs, MoE memory tightrope).recipes/README.md— index of every recipe on the cluster.gists/qwen397-fp8-mtp/README.md— earlier doc that overlaps with this one; this INFRA.md supersedes it for blog reproducibility purposes.