Shared infrastructure — 4x ASUS Ascent GX10 / GB10 vLLM cluster

This file is the canonical reproducibility reference for both blog posts in this folder (README.md for Qwen3.5-397B-A17B-FP8 + MTP, and nemotron-3-ultra/README.md for Nemotron-3-Ultra-550B-A55B-NVFP4).

If you only care about one model's numbers, the blog post for that model is self-contained — it inlines the launch command, the image lineage, and the bench script. This file collects the once-per-cluster setup that both posts share, so we don't have to repeat it twice.


Hardware

control node            head GX10              worker GX10s
<control-node>           <head-node>           <worker-mgmt-ips>
  ─── 1 GbE ────────────►   │                         │
                            │  ──── 200 GbE ─────►   │
                            │  fabric:            <worker-fabric-ip-1>
                            │  <head-fabric-ip>     <worker-fabric-ip-2>
                            │                     <worker-fabric-ip-3>
                            ▼
                        vllm serve on :8000 (head)
                        Ray workers on the 3 others

OS detail: each GX10 is headless (default systemd target multi-user.target — a graphical desktop would burn 2-3 GiB of unified memory we want for KV).


Source repo

The cluster launcher lives at github.com/eugr/spark-vllm-docker (cloned at ~/spark-vllm-docker/ on the head GX10). The pieces we use:

File Purpose
Dockerfile Builds the base vllm-node image from nvidia/cuda:13.2.0-devel-ubuntu24.04. Compiles PyTorch + flashinfer + vLLM from local wheels. Pins torch==2.11.0+cu130.
Dockerfile.mimo-runtime Layered image vllm-node-mimo on top of vllm-node-tf5 — overrides transformers ≥5.0 and force-reinstalls torch 2.11.0+cu130. Required for Qwen3.5-MTP.
build-and-copy.sh -c Builds vllm-node locally then scp's the image tarball to the 3 worker GX10s over fabric.
launch-cluster.sh Multi-node Ray + vllm launcher. Spawns the container on each node, wires Ray, runs vllm serve on the head.
run-recipe.sh <recipe.yaml> Resolves a recipe yaml (mods, flags, env) and calls launch-cluster.sh.
recipes/4x-spark-cluster/*.yaml The yamls (per model) describing the launch — flags, env, mods, command template.
mods/ Git-patch directories that modify the in-container vLLM source. Each recipe lists which ones it needs.
wheels/ Pinned vLLM + flashinfer wheels. Bumping vLLM means dropping new wheels here and rebuilding.
relaunch-*-tp*.sh Per-model "one-shot" wrappers: stop the previous container on all 4 nodes, then ./run-recipe.sh -d ....

Docker image lineage

nvidia/cuda:13.2.0-devel-ubuntu24.04
    │  + ccache, build tools, libibverbs (RDMA)
    │  + pip install torch==2.11.0+cu130, triton, nvshmem
    │
    ▼
vllm-node:latest          ◀── used by Nemotron-3-Ultra recipe
    │  Builds vLLM + flashinfer wheels into the image, base for everything below.
    │
    ├──► vllm-node-tf5:latest
    │       (older transformers, stable for non-MTP)
    │
    └──► vllm-node-tf5 + Dockerfile.mimo-runtime
            │  ENV overrides + reinstall wheels with transformers≥5.0
            │  RUN force-reinstall torch==2.11.0+cu130
            ▼
        vllm-node-mimo:latest  ◀── used by Qwen3.5-397B-A17B-FP8 + MTP recipe

The two "live" images on the cluster right now are vllm-node and vllm-node-mimo. Both pin vLLM 0.22.1rc1.dev124+gace95c9cf.d20260603. vllm-node-tf5 is the layered base for mimo but is not itself launched.


Pinned versions (do not skip)

Package Version Source Why
torch 2.11.0+cu130 https://download.pytorch.org/whl/cu130 The fresh vLLM wheel ships metadata pinning torch==2.10.0 — but the ABI actually needs 2.11.0. If you let uv resolve naturally it lands torch==2.10.0+cpu and vllm dies at import with ImportError: libtorch_cuda.so: cannot open shared object file. Force-reinstall the cu130 wheel as a separate Dockerfile RUN layer AFTER the vllm install with --no-deps --force-reinstall.
transformers ≥5.0.0 (mimo only) pip qwen3_next_mtp config classes live in transformers 5.x; tf5's base pin is 4.x — mimo overrides.
vllm 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 local wheel in wheels/ qwen3_next_mtp + nemotron_h architectures both need 0.22+
flashinfer-python 0.6.12 local wheel in wheels/ (cubin + jit_cache + python) required for FP8 KV + NVFP4 attention; both recipes use --moe-backend flashinfer_* or --attention-backend flashinfer.

Verify after every rebuild (BEFORE redistributing 22-48 GB to workers):

docker run --rm --entrypoint python3 vllm-node:latest -c \
  "import torch; print(torch.__version__, torch.version.cuda)"
# MUST print: 2.11.0+cu130 13.0
# If it prints 2.10.0+cpu, the rebuild silently regressed — DO NOT distribute.

Image rebuild + redistribute (~10 min total)

# On head GX10
cd ~/spark-vllm-docker

# 1. Sanity-check the wheels are current
ls wheels/
#   flashinfer_cubin-0.6.12-py3-none-any.whl
#   flashinfer_jit_cache-0.6.12-cp39-abi3-manylinux_2_28_aarch64.whl
#   flashinfer_python-0.6.12-py3-none-any.whl
#   vllm-0.22.1rc1.dev124+gace95c9cf.d20260603.cu132-cp312-cp312-linux_aarch64.whl

# 2. Build the base vllm-node image (Nemotron recipe).
#    -c means "cluster mode": build locally, then scp the tarball to the 3 workers.
./build-and-copy.sh -c

# 3. For the Qwen MTP recipe, also build the mimo overlay:
docker build -f Dockerfile.mimo-runtime -t vllm-node-mimo:latest .

# 4. VERIFY torch immediately (see above).

# 5. Distribute mimo to the workers over fabric:
docker save vllm-node-mimo:latest > /tmp/vllm-node-mimo.tar
for n in <worker-fabric-ip-1> <worker-fabric-ip-2> <worker-fabric-ip-3>; do
  (cat /tmp/vllm-node-mimo.tar | ssh $n 'docker load') &
done; wait
rm /tmp/vllm-node-mimo.tar

If vllm-node or vllm-node-mimo is ever lost (e.g. a docker image prune -a while no container is running — known pitfall), restore from the control-node backup tars at <control-workspace>/docker-images/:

# On control node, push tar back to whichever Spark is missing the image
scp -i ~/.ssh/<spark-key> \
    <control-workspace>/docker-images/vllm-node-mimo.tar \
    <spark-user>@<head-node>:/tmp/
ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
    'cat /tmp/vllm-node-mimo.tar | docker load && rm /tmp/vllm-node-mimo.tar'

Pitfall — docker image prune -a deletes the working image. Never run docker image prune -a while a container using the image is stopped; the image gets marked unused and removed. Always be explicit: docker image rm <repo:tag> or filter with docker image prune --filter "until=24h".


Custom mods (vLLM source patches applied at run-recipe time)

Each recipe yaml lists which mods to apply. Each mod is a git-patch dir under ~/spark-vllm-docker/mods/. run-recipe.sh applies them inside the running container against the in-container vLLM source tree.

Mod Used by Purpose
mods/gpu-mem-util-gb Nemotron Adds --gpu-memory-utilization-gb (raw GiB budget) on top of the fraction-based flag, because Spark's unified-memory math overshoots when you use the standard 0-1 fraction.
mods/nemotron-ultra Nemotron Registers the nemotron_h architecture in vLLM and pulls the nemotron_v3 reasoning parser.

Qwen3.5-397B-A17B-FP8 + MTP needs no mods at the current wheel — the qwen3_next_mtp speculator path is in upstream vLLM main.

If a mod patch fails with git apply rejection after an image rebuild (line drift in upstream vLLM), run-recipe.sh falls back to patch --fuzz=5. This is expected; both modes succeed in practice.


Weights staging

Each model's weights live on the control node under <control-workspace>/<org>/<model>/. Cluster nodes have a copy at <spark-model-root>/<org>/<model>/ which the recipe mounts into the container at /root/.cache/huggingface/<org>/<model>/.

Model HF source Size per node Path on cluster
Qwen3.5-397B-A17B-FP8 Qwen/Qwen3.5-397B-A17B-FP8 379 GiB <spark-model-root>/Qwen/Qwen3.5-397B-A17B-FP8/
Nemotron-3-Ultra-550B-A55B-NVFP4 nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 329 GiB <spark-model-root>/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4/

Staging recipe (the documented pattern — much faster than fanning out from the control node directly):

# 1. control → head over 1 GbE (~100 MB/s, ~55 min for 329 GiB)
rsync -a --partial --info=progress2 \
  -e "ssh -i ~/.ssh/<spark-key>" \
  <control-workspace>/<org>/<model>/ \
  <spark-user>@<head-node>:<spark-model-root>/<org>/<model>/

# 2. head → 3 workers in parallel over 200 GbE fabric (~10 min)
ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> bash <<'REMOTE'
SRC=<spark-model-root>/<org>/<model>
for h in <worker-fabric-ip-1> <worker-fabric-ip-2> <worker-fabric-ip-3>; do
  ( rsync -a --partial -e "ssh -o StrictHostKeyChecking=no" \
      "$SRC/" "$h:$SRC/" ) &
done
wait
REMOTE

The cluster is single-tenant — only one tp=4 workload at a time. Disk budget per Spark is tight (916 GiB partition; ~280-320 GiB headroom with Qwen weights resident + both images installed), so swapping models requires clearing the previous model's weights from at least the head node first.


Launching

Each model has a relaunch wrapper that:

  1. docker rm -f the previous container on all 4 nodes.
  2. Runs ./run-recipe.sh <recipe>.yaml -d ... which docker run -d the image, applies mods, and execs vllm serve with the recipe's flags substituted in.
  3. Returns — the engine then loads weights (~10-13 min). Poll http://<head-node>:8000/health until 200.
Model Relaunch wrapper on head Master port
Qwen3.5-397B-A17B-FP8 + MTP ~/spark-vllm-docker/relaunch-qwen397-fp8-mtp-qwen3next-tp4.sh 29510
Nemotron-3-Ultra ~/spark-vllm-docker/relaunch-nemotron3-ultra-nvfp4-tp4.sh 29520

The two master ports are distinct so the launchers don't collide if someone forgets to stop the previous one (though single-tenancy means they shouldn't both be running anyway).


Bench harness (shared)

Both blog posts use the same Python bench client at <control-workspace>/blog/resources/bench.py. Each post's own folder holds only its results.json + logs/; the harness itself is shared. The MODEL constant and concurrency-sweep points are wired through env vars / args, not by editing per-post copies.

Requirements on the control node:

sudo apt install python3-requests

(The control node has Python 3.14 with no pip and a broken venv; apt is the path that actually works — see operator memory entries.)

The harness:

Run the bench:

cd <control-workspace>/blog            # or blog/nemotron-3-ultra
/usr/bin/python3 bench.py > results.json 2> bench.log

Sampling is temperature=0.0 so the runs are deterministic modulo MTP acceptance variance and Ray scheduler ordering.


See also