Shared infrastructure — 4x ASUS Ascent GX10 / GB10 vLLM cluster

This file is the canonical reproducibility reference for both blog posts in this folder (README.md for Qwen3.5-397B-A17B-FP8 + MTP, and nemotron-3-ultra/README.md for Nemotron-3-Ultra-550B-A55B-NVFP4).

If you only care about one model's numbers, the blog post for that model is self-contained — it inlines the launch command, the image lineage, and the bench script. This file collects the once-per-cluster setup that both posts share, so we don't have to repeat it twice.

Hardware

4x ASUS Ascent GX10 systems, one is the head/API node and three are workers, all Ray-connected.
Each GX10: NVIDIA GB10 Blackwell Grace-Blackwell superchip, 128 GiB unified memory (CPU + iGPU share one pool), CX-7 200 GbE NIC.
Networking: 1 GbE mgmt link between the control box (<control-node>) and the head GX10 (<spark-user>@<head-node>); 200 GbE fabric (<fabric-subnet>) directly between the 4 GX10s via ConnectX-7. Model staging uses control→head over 1 GbE then head→workers in parallel over the 200 GbE fabric, which is ~4× faster than fanning out from the control node naively.
GX10 unified-memory budget visible to vLLM: ~111 GiB (128 spec → 119 free after CMA reservation → 111 inside the container after kernel slab
- driver overhead). Firm cap for --gpu-memory-utilization-gb is 110 GiB (custom mod enforces 0.5 GiB safety margin).

control node            head GX10              worker GX10s
<control-node>           <head-node>           <worker-mgmt-ips>
  ─── 1 GbE ────────────►   │                         │
                            │  ──── 200 GbE ─────►   │
                            │  fabric:            <worker-fabric-ip-1>
                            │  <head-fabric-ip>     <worker-fabric-ip-2>
                            │                     <worker-fabric-ip-3>
                            ▼
                        vllm serve on :8000 (head)
                        Ray workers on the 3 others

OS detail: each GX10 is headless (default systemd target multi-user.target — a graphical desktop would burn 2-3 GiB of unified memory we want for KV).

Source repo

The cluster launcher lives at github.com/eugr/spark-vllm-docker (cloned at ~/spark-vllm-docker/ on the head GX10). The pieces we use:

File	Purpose
`Dockerfile`	Builds the base `vllm-node` image from `nvidia/cuda:13.2.0-devel-ubuntu24.04`. Compiles PyTorch + flashinfer + vLLM from local wheels. Pins `torch==2.11.0+cu130`.
`Dockerfile.mimo-runtime`	Layered image `vllm-node-mimo` on top of `vllm-node-tf5` — overrides transformers ≥5.0 and force-reinstalls torch 2.11.0+cu130. Required for Qwen3.5-MTP.
`build-and-copy.sh -c`	Builds `vllm-node` locally then scp's the image tarball to the 3 worker GX10s over fabric.
`launch-cluster.sh`	Multi-node Ray + vllm launcher. Spawns the container on each node, wires Ray, runs `vllm serve` on the head.
`run-recipe.sh <recipe.yaml>`	Resolves a recipe yaml (mods, flags, env) and calls `launch-cluster.sh`.
`recipes/4x-spark-cluster/*.yaml`	The yamls (per model) describing the launch — flags, env, mods, command template.
`mods/`	Git-patch directories that modify the in-container vLLM source. Each recipe lists which ones it needs.
`wheels/`	Pinned vLLM + flashinfer wheels. Bumping vLLM means dropping new wheels here and rebuilding.
`relaunch--tp.sh`	Per-model "one-shot" wrappers: stop the previous container on all 4 nodes, then `./run-recipe.sh -d ...`.

Docker image lineage

nvidia/cuda:13.2.0-devel-ubuntu24.04
    │  + ccache, build tools, libibverbs (RDMA)
    │  + pip install torch==2.11.0+cu130, triton, nvshmem
    │
    ▼
vllm-node:latest          ◀── used by Nemotron-3-Ultra recipe
    │  Builds vLLM + flashinfer wheels into the image, base for everything below.
    │
    ├──► vllm-node-tf5:latest
    │       (older transformers, stable for non-MTP)
    │
    └──► vllm-node-tf5 + Dockerfile.mimo-runtime
            │  ENV overrides + reinstall wheels with transformers≥5.0
            │  RUN force-reinstall torch==2.11.0+cu130
            ▼
        vllm-node-mimo:latest  ◀── used by Qwen3.5-397B-A17B-FP8 + MTP recipe

The two "live" images on the cluster right now are vllm-node and vllm-node-mimo. Both pin vLLM 0.22.1rc1.dev124+gace95c9cf.d20260603. vllm-node-tf5 is the layered base for mimo but is not itself launched.

Pinned versions (do not skip)

Package	Version	Source	Why
`torch`	2.11.0+cu130	`https://download.pytorch.org/whl/cu130`	The fresh vLLM wheel ships metadata pinning `torch==2.10.0` — but the ABI actually needs 2.11.0. If you let uv resolve naturally it lands `torch==2.10.0+cpu` and vllm dies at import with `ImportError: libtorch_cuda.so: cannot open shared object file`. Force-reinstall the cu130 wheel as a separate Dockerfile RUN layer AFTER the vllm install with `--no-deps --force-reinstall`.
`transformers`	`≥5.0.0` (mimo only)	pip	`qwen3_next_mtp` config classes live in transformers 5.x; tf5's base pin is 4.x — mimo overrides.
`vllm`	`0.22.1rc1.dev124+gace95c9cf.d20260603.cu132`	local wheel in `wheels/`	qwen3_next_mtp + nemotron_h architectures both need 0.22+
`flashinfer-python`	`0.6.12`	local wheel in `wheels/` (cubin + jit_cache + python)	required for FP8 KV + NVFP4 attention; both recipes use `--moe-backend flashinfer_*` or `--attention-backend flashinfer`.

Verify after every rebuild (BEFORE redistributing 22-48 GB to workers):

docker run --rm --entrypoint python3 vllm-node:latest -c \
  "import torch; print(torch.__version__, torch.version.cuda)"
# MUST print: 2.11.0+cu130 13.0
# If it prints 2.10.0+cpu, the rebuild silently regressed — DO NOT distribute.

Image rebuild + redistribute (~10 min total)

# On head GX10
cd ~/spark-vllm-docker

# 1. Sanity-check the wheels are current
ls wheels/
#   flashinfer_cubin-0.6.12-py3-none-any.whl
#   flashinfer_jit_cache-0.6.12-cp39-abi3-manylinux_2_28_aarch64.whl
#   flashinfer_python-0.6.12-py3-none-any.whl
#   vllm-0.22.1rc1.dev124+gace95c9cf.d20260603.cu132-cp312-cp312-linux_aarch64.whl

# 2. Build the base vllm-node image (Nemotron recipe).
#    -c means "cluster mode": build locally, then scp the tarball to the 3 workers.
./build-and-copy.sh -c

# 3. For the Qwen MTP recipe, also build the mimo overlay:
docker build -f Dockerfile.mimo-runtime -t vllm-node-mimo:latest .

# 4. VERIFY torch immediately (see above).

# 5. Distribute mimo to the workers over fabric:
docker save vllm-node-mimo:latest > /tmp/vllm-node-mimo.tar
for n in <worker-fabric-ip-1> <worker-fabric-ip-2> <worker-fabric-ip-3>; do
  (cat /tmp/vllm-node-mimo.tar | ssh $n 'docker load') &
done; wait
rm /tmp/vllm-node-mimo.tar

If vllm-node or vllm-node-mimo is ever lost (e.g. a docker image prune -a while no container is running — known pitfall), restore from the control-node backup tars at <control-workspace>/docker-images/:

# On control node, push tar back to whichever Spark is missing the image
scp -i ~/.ssh/<spark-key> \
    <control-workspace>/docker-images/vllm-node-mimo.tar \
    <spark-user>@<head-node>:/tmp/
ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
    'cat /tmp/vllm-node-mimo.tar | docker load && rm /tmp/vllm-node-mimo.tar'

Pitfall — docker image prune -a deletes the working image. Never run docker image prune -a while a container using the image is stopped; the image gets marked unused and removed. Always be explicit: docker image rm <repo:tag> or filter with docker image prune --filter "until=24h".

Custom mods (vLLM source patches applied at run-recipe time)

Each recipe yaml lists which mods to apply. Each mod is a git-patch dir under ~/spark-vllm-docker/mods/. run-recipe.sh applies them inside the running container against the in-container vLLM source tree.

Mod	Used by	Purpose
`mods/gpu-mem-util-gb`	Nemotron	Adds `--gpu-memory-utilization-gb` (raw GiB budget) on top of the fraction-based flag, because Spark's unified-memory math overshoots when you use the standard 0-1 fraction.
`mods/nemotron-ultra`	Nemotron	Registers the `nemotron_h` architecture in vLLM and pulls the `nemotron_v3` reasoning parser.

Qwen3.5-397B-A17B-FP8 + MTP needs no mods at the current wheel — the qwen3_next_mtp speculator path is in upstream vLLM main.

If a mod patch fails with git apply rejection after an image rebuild (line drift in upstream vLLM), run-recipe.sh falls back to patch --fuzz=5. This is expected; both modes succeed in practice.

Weights staging

Each model's weights live on the control node under <control-workspace>/<org>/<model>/. Cluster nodes have a copy at <spark-model-root>/<org>/<model>/ which the recipe mounts into the container at /root/.cache/huggingface/<org>/<model>/.

Model	HF source	Size per node	Path on cluster
Qwen3.5-397B-A17B-FP8	`Qwen/Qwen3.5-397B-A17B-FP8`	379 GiB	`<spark-model-root>/Qwen/Qwen3.5-397B-A17B-FP8/`
Nemotron-3-Ultra-550B-A55B-NVFP4	`nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4`	329 GiB	`<spark-model-root>/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4/`

Staging recipe (the documented pattern — much faster than fanning out from the control node directly):

# 1. control → head over 1 GbE (~100 MB/s, ~55 min for 329 GiB)
rsync -a --partial --info=progress2 \
  -e "ssh -i ~/.ssh/<spark-key>" \
  <control-workspace>/<org>/<model>/ \
  <spark-user>@<head-node>:<spark-model-root>/<org>/<model>/

# 2. head → 3 workers in parallel over 200 GbE fabric (~10 min)
ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> bash <<'REMOTE'
SRC=<spark-model-root>/<org>/<model>
for h in <worker-fabric-ip-1> <worker-fabric-ip-2> <worker-fabric-ip-3>; do
  ( rsync -a --partial -e "ssh -o StrictHostKeyChecking=no" \
      "$SRC/" "$h:$SRC/" ) &
done
wait
REMOTE

The cluster is single-tenant — only one tp=4 workload at a time. Disk budget per Spark is tight (916 GiB partition; ~280-320 GiB headroom with Qwen weights resident + both images installed), so swapping models requires clearing the previous model's weights from at least the head node first.

Launching

Each model has a relaunch wrapper that:

docker rm -f the previous container on all 4 nodes.
Runs ./run-recipe.sh <recipe>.yaml -d ... which docker run -d the image, applies mods, and execs vllm serve with the recipe's flags substituted in.
Returns — the engine then loads weights (~10-13 min). Poll http://<head-node>:8000/health until 200.

Model	Relaunch wrapper on head	Master port
Qwen3.5-397B-A17B-FP8 + MTP	`~/spark-vllm-docker/relaunch-qwen397-fp8-mtp-qwen3next-tp4.sh`	29510
Nemotron-3-Ultra	`~/spark-vllm-docker/relaunch-nemotron3-ultra-nvfp4-tp4.sh`	29520

The two master ports are distinct so the launchers don't collide if someone forgets to stop the previous one (though single-tenancy means they shouldn't both be running anyway).

Bench harness (shared)

Both blog posts use the same Python bench client at <control-workspace>/blog/resources/bench.py. Each post's own folder holds only its results.json + logs/; the harness itself is shared. The MODEL constant and concurrency-sweep points are wired through env vars / args, not by editing per-post copies.

Requirements on the control node:

sudo apt install python3-requests

(The control node has Python 3.14 with no pip and a broken venv; apt is the path that actually works — see operator memory entries.)

The harness:

Builds prompts of a target token count by generating varied English filler and trimming via vLLM's /tokenize endpoint, so the input token count is exact.
Streams via requests.Session().post(stream=True) + iter_lines(), with stream_options={"include_usage": True} so the final SSE chunk carries the usage block (exact prompt/completion token counts).
Synchronizes concurrent requests with threading.Barrier(n) so all N requests fire simultaneously; computes both per-request throughput and aggregate (wall-window) throughput.
Runs NIAH by injecting a fixed needle at ~50% depth into a 200 000-token filler and checking the returned content + reasoning for the needle value.

Run the bench:

cd <control-workspace>/blog            # or blog/nemotron-3-ultra
/usr/bin/python3 bench.py > results.json 2> bench.log

Sampling is temperature=0.0 so the runs are deterministic modulo MTP acceptance variance and Ray scheduler ordering.