Qwen3.5-397B-A17B-FP8 with MTP on 4x ASUS GX10 — bench
Date: 2026-06-13
Hardware: 4x ASUS Ascent GX10 (NVIDIA GB10 Blackwell, 128 GiB unified memory each, ConnectX-7 200 GbE fabric), TP=4 + Ray
Container image: vllm-node-mimo:latest, built from Dockerfile.mimo-runtime on top of vllm-node-tf5 — see Reproducibility — Docker image below for the exact build steps and pinned versions
vLLM: 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 (local wheel)
Model: Qwen/Qwen3.5-397B-A17B-FP8 (HF), local path /root/.cache/huggingface/Qwen/Qwen3.5-397B-A17B-FP8, ~379 GiB per node
MTP: qwen3_next_mtp, num_speculative_tokens=2 (single-head MTP — see §3 for why more breaks)
KV cache: fp8; max_model_len=262144; max_num_seqs=16; max_num_batched_tokens=8192
GPU memory budget: --gpu-memory-utilization 0.90 (the proven MTP ceiling — 0.92 has crashed twice)
API: http://<head-node>:8000
Bench client: ../resources/bench.py, streaming via requests + iter_lines() with stream_options={"include_usage": True}
Field notes
Qwen3.5-397B-A17B-FP8 is still the default model I reach for on this cluster. Once MTP was actually working, the performance moved from "large local model experiment" into "use this as a daily tool" territory: single-stream decode is fast enough to feel responsive, and aggregate throughput scales well enough that OpenClaude, Pi, and OpenCode can sit on top of it without the serving stack feeling fragile.
The other reason it remains the baseline is operational. The same Qwen3.5 family gives a clean migration path down to 2x Spark when using NVFP4, so the work here is not a one-off 4-node stunt. The tradeoff is behavioral rather than infrastructural: the model can get itself into continuous loops, especially in agentic sessions where tool results keep feeding back into the context. That is the main reason I keep comparing newer fine-tunes against it instead of just declaring the search over.
For full cluster + image setup details shared with the Nemotron-3-Ultra post: see
../resources/INFRA.md.Companion files: launch script, bench harness, and wheel provenance are checked into
../resources/.
Sampling for both tests: temperature=0.0. Prompts are tokenized via vLLM's /tokenize endpoint so the input token count is exact.
Benchmarks
1. Concurrency sweep — 10k in / 1024 out
Each request: ~9 818 input tokens (built from varied English filler) + an instruction to continue the narrative; max_tokens=1024. All concurrent requests fire simultaneously (synchronized via a threading.Barrier). Per‑request and aggregate throughput reported.
Aggregate throughput (single‑run, fp8 KV, MTP active):
| N | Wall (s) | Agg prefill (t/s) | Agg decode (t/s) | Median TTFT (s) | Median per‑req decode (t/s) |
|---|---|---|---|---|---|
| 1 | 30.8 | 1 830 | 40.3 | 5.37 | 40.3 |
| 2 | 33.9 | 4 684 | 63.6 | 2.96 | 33.5 |
| 4 | 53.3 | 6 209 | 79.4 | 5.13 | 21.4 |
| 8 | 61.9 | 13 761 | 134.5 | 4.35 | 18.0 |
| 16 | 88.7 | 16 369 | 186.4 | 7.53 | 12.7 |
Per‑request totals (input always 9818 tokens, output always 1024 tokens):
N=1
| req | TTFT (s) | duration (s) | prefill (t/s) | decode (t/s) |
|---|---|---|---|---|
| 0 | 5.37 | 30.78 | 1 830 | 40.3 |
N=2
| req | TTFT (s) | duration (s) | prefill (t/s) | decode (t/s) |
|---|---|---|---|---|
| 0 | 1.72 | 33.21 | 5 693 | 32.5 |
| 1 | 4.19 | 33.88 | 2 342 | 34.5 |
N=4
| req | TTFT (s) | duration (s) | prefill (t/s) | decode (t/s) |
|---|---|---|---|---|
| 0 | 5.13 | 52.79 | 1 912 | 21.5 |
| 1 | 5.14 | 53.19 | 1 912 | 21.3 |
| 2 | 6.32 | 53.26 | 1 553 | 21.8 |
| 3 | 1.75 | 52.58 | 5 601 | 20.1 |
N=8
| req | TTFT (s) | duration (s) | prefill (t/s) | decode (t/s) |
|---|---|---|---|---|
| 0 | 4.35 | 61.86 | 2 258 | 17.8 |
| 1 | 4.35 | 58.85 | 2 258 | 18.8 |
| 2 | 4.35 | 60.88 | 2 255 | 18.1 |
| 3 | 4.35 | 61.62 | 2 257 | 17.9 |
| 4 | 5.70 | 61.71 | 1 721 | 18.3 |
| 5 | 1.01 | 59.84 | 9 732 | 17.4 |
| 6 | 5.70 | 59.16 | 1 722 | 19.1 |
| 7 | 4.35 | 61.50 | 2 255 | 17.9 |
N=16
| req | TTFT (s) | duration (s) | prefill (t/s) | decode (t/s) |
|---|---|---|---|---|
| 0 | 7.54 | 88.64 | 1 303 | 12.6 |
| 1 | 9.60 | 87.65 | 1 023 | 13.1 |
| 2 | 4.26 | 86.35 | 2 305 | 12.5 |
| 3 | 7.53 | 88.71 | 1 303 | 12.6 |
| 4 | 7.53 | 87.95 | 1 304 | 12.7 |
| 5 | 9.59 | 88.35 | 1 024 | 13.0 |
| 6 | 4.26 | 88.10 | 2 306 | 12.2 |
| 7 | 9.58 | 87.46 | 1 024 | 13.1 |
| 8 | 4.26 | 88.10 | 2 302 | 12.2 |
| 9 | 9.59 | 85.72 | 1 024 | 13.4 |
| 10 | 7.54 | 86.14 | 1 303 | 13.0 |
| 11 | 4.26 | 88.65 | 2 304 | 12.1 |
| 12 | 7.55 | 87.65 | 1 301 | 12.8 |
| 13 | 7.53 | 84.81 | 1 303 | 13.2 |
| 14 | 0.91 | 86.54 | 10 754 | 11.9 |
| 15 | 4.27 | 86.55 | 2 301 | 12.4 |
Notes on the shape
- Prefill is staircased because
max_num_batched_tokens=8192. Each prefill needs ~9 818 input tokens, so vLLM admits prefills in micro-batches and TTFT comes in clusters (visible as 4 distinct TTFT bands at N=16: ~0.9, 4.3, 7.5, 9.6 s). - Per-request decode falls cleanly with N (40 → 33 → 21 → 18 → 13 t/s) as MoE expert dispatch and inter-node Ray collectives become the binding cost.
- Aggregate decode scales sublinearly (40 → 64 → 79 → 134 → 186 t/s — ~4.6× from N=1 to N=16). With MTP enabled, the speculator's ~86% acceptance rate is the headroom that keeps per-stream decode usable even at N=16.
- The aggregate prefill plateau around N=16 (~16 k t/s) reflects KV-block contention plus the 8192 batched-tokens limit; ramping
--max-num-batched-tokenswould shift this but cost KV memory.
2. Needle-in-a-haystack — 200k prefill
200 000-token filler (varied English sentence templates) with a single needle injected at ~50% depth:
IMPORTANT: The secret access code for Project Aurora is BANANA-MOOSE-7421. Remember this exactly.
Question appended at the end: "Based on the text above, what is the secret access code for Project Aurora? Reply with just the code itself, nothing else."
| Metric | Value |
|---|---|
| Input tokens | 199 402 |
| TTFT (≈ prefill) | 125.9 s |
| Prefill speed | 1 584 tokens/s |
| Output tokens | 669 (mostly the reasoning trace) |
| Decode speed | 39.5 t/s |
| Total wall | 142.8 s |
| Needle retrieved | ✅ yes — exact match |
Model's final answer (after <think> block):
BANANA-MOOSE-7421
Reasoning trace head (truncated to 400 chars):
Thinking Process:
1. Analyze the Request:
* Input: A very long text containing numerous sentences about various
characters (cat, team, librarian, John, chef, engineer, child, Mary,
researcher, fox, gardener, Dr. Reed, pilot, Captain Ortiz, detective,
astronaut) performing actions (walked, discovered, spoke, decided,
etc.) involving various objects (printing mistake, photograph,
door, l...
Prefill of 199 k tokens in 126 s is in the same ballpark as the small-batch prefill rate from §1 (~1.8 k t/s) and matches the engine-reported KV pool at this configuration (551 076 token-slots, 2.10× concurrency at full 262 k context — single 199 k request fits with room to spare for the 669-token completion).
Engine-reported config at boot
Pulled from docker logs qwen397-fp8-mtp-tp4 on the head GX10:
- Effective
speculative_config = SpeculativeConfig(method='mtp', num_spec_tokens=2)— note the wheel deprecatesqwen3_next_mtptomtp. - Drafter loaded on all 4 ranks;
Detected MTP model. Sharing target model embedding weights with the draft model.per rank. - Available KV cache: 4.65–5.18 GiB per rank (tightest 4.65 GiB at TP1)
- GPU KV cache size: 551 076 tokens
- Maximum vLLM-reported concurrency at full 262 144-token request: 2.10×
A live SpecDecoding metrics line from the same log:
SpecDecoding metrics: Mean acceptance length: 2.72,
Accepted: 31 tokens, Drafted: 36 tokens,
Per-position acceptance rate: 0.889, 0.833,
Avg Draft acceptance rate: 86.1%
The ~86% acceptance is what turns the 2-token speculator into the ~1.7× decode speedup visible vs the (older, non-MTP) baseline of ~23.5 t/s on this same cluster.
3. MTP=5 retest (2026-06-13, same cluster, same bench)
Relaunched the same recipe with --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' (raised from 2) to test whether more speculative tokens win on this model. vLLM warned at startup, repeating the warning from the MTP-2 run:
WARNING speculative.py:722 Enabling num_speculative_tokens > 1 will run
multiple times of forward on same MTP layer, which may result in lower
acceptance rate
Engine init confirmed: SpeculativeConfig(method='mtp', num_spec_tokens=5). The bench got partway through then crashed the engine:
| N | Wall (s) | Agg prefill (t/s) | Agg decode (t/s) | vs MTP-2 |
|---|---|---|---|---|
| 1 | 34.6 | 1 066 | 40.3 | tie |
| 2 | 43.5 | 3 324 | 54.4 | −14% vs 63.6 |
| 4 | 76.4 | 6 172 | malformed ⚠️ | request returned 0 completion tokens before scheduler crash |
| 8 | — | — | HTTP 500 | engine dead |
| 16 | — | — | — | not attempted |
At 2026-06-13 16:13:25, mid-N=8 step, all 4 ranks crashed with:
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
[rank0]:[E613 16:13:25 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 0]
Process group watchdog thread terminated with exception:
CUDA error: an illegal memory access was encountered
... Worker exit type: SYSTEM_ERROR ...
ERROR 06-13 16:14:15 [ray_executor_v2.py:464] RayWorkerProc rank=[0] died
unexpectedly, shutting down executor.
ERROR async_llm.py:704 vllm.v1.engine.exceptions.EngineDeadError: EngineCore
encountered an issue.
The container stayed Up (Ray PID1 persists per the known pitfall) but /health started refusing connections. Service was restored by relaunching the canonical MTP-2 script.
Verdict
MTP-5 is worse than MTP-2 on this stack and not safe to run.
- Even where MTP-5 returned valid completions, it was slower than MTP-2:
- N=2 aggregate decode 54.4 t/s (MTP-5) vs 63.6 t/s (MTP-2) — −14%.
- N=1 was a wash, which makes sense — at single-stream the extra speculator forwards just lose what the rare extra-accepted token wins.
- It crashes the engine under concurrency. Illegal memory access is not a recoverable failure mode — it requires a full Ray cluster relaunch.
- The per-position acceptance decay observed at MTP-2 (0.889 → 0.833, a 6 pp drop) projects a sharply worse return for MTP-5: positions 3–5 would land somewhere in the 0.65–0.78 range with the same speculator getting reused, exactly matching vLLM's warning.
The likely root cause is Qwen3.5-MoE-MTP's architecture: it has a single MTP head that vLLM re-runs num_spec_tokens times. Architectures with native multi-token MTP heads (DeepSeek-V3 / Nemotron-3-Ultra style) tolerate higher num_speculative_tokens — Qwen3.5 here does not.
Operator memory updated: project-qwen397-mtp5-crash.md — never propose MTP-5 (or higher) on this model with this wheel.
MTP-3 not yet tested
Worth a follow-up to confirm whether MTP-3 is a small win, parity, or already losing. The MTP-5 crash means the test has real risk attached — recommend running it during a window where a relaunch is acceptable.
Reproducibility — Docker image
The container is vllm-node-mimo:latest layered on top of vllm-node-tf5 (which is built from the base vllm-node image in github.com/eugr/spark-vllm-docker). Lineage:
nvidia/cuda:13.2.0-devel-ubuntu24.04
│ + ccache, build tools, libibverbs (RDMA)
│ + pip install torch==2.11.0+cu130, triton, nvshmem
│
▼
vllm-node-tf5:latest (older transformers, stable for non-MTP)
│
└── Dockerfile.mimo-runtime (transformers≥5.0 + reinstall vllm/flashinfer wheels)
│
└── vllm-node-mimo:latest ◀── this benchmark used this image
Required pinned versions (do NOT skip)
| Package | Version | Source | Why |
|---|---|---|---|
torch |
2.11.0+cu130 | https://download.pytorch.org/whl/cu130 |
The fresh vLLM wheel ships metadata pinning torch==2.10.0 but the C++ ABI actually needs 2.11.0. If you let uv resolve naturally it lands torch==2.10.0+cpu and vllm dies at import with ImportError: libtorch_cuda.so: cannot open shared object file. Force-reinstall the cu130 wheel as a separate Dockerfile RUN layer AFTER the vllm install with --no-deps --force-reinstall. |
transformers |
≥5.0.0 |
pip | qwen3_next_mtp config classes live in transformers 5.x; tf5's base pin is 4.x — mimo overrides. |
vllm |
0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 |
local wheel wheels/vllm-*.whl |
qwen3_next_mtp speculator path. |
flashinfer-python |
0.6.12 |
local wheels wheels/flashinfer_*.whl |
required for FP8 KV + the MTP attention path. |
Dockerfile.mimo-runtime (exact contents)
FROM vllm-node-tf5:latest
ENV PIP_BREAK_SYSTEM_PACKAGES=1
ENV UV_SYSTEM_PYTHON=1
ENV UV_BREAK_SYSTEM_PACKAGES=1
ENV UV_LINK_MODE=copy
COPY wheels/*.whl /tmp/mimo-wheels/
RUN printf "%s\n" "transformers>=5.0.0" > /tmp/tf-override.txt \
&& uv pip install /tmp/mimo-wheels/*.whl --override /tmp/tf-override.txt \
&& rm -rf /tmp/mimo-wheels /tmp/tf-override.txt
# Fix torch CPU regression: the fresh vllm wheel ships metadata pinning
# torch==2.10.0 but the C++ ABI needs 2.11.0 from cu130. Force-reinstall.
RUN uv pip install --no-deps --force-reinstall \
--index-url https://download.pytorch.org/whl/cu130 \
torch==2.11.0
Build + verify + distribute (~10 min total on the head GX10)
# On head GX10 (<spark-user>@<head-node>)
cd ~/spark-vllm-docker
# 1. Drop wheels in wheels/ — vllm-0.22.1rc1.dev124+... and flashinfer_*-0.6.12
ls wheels/
# vllm-0.22.1rc1.dev124+gace95c9cf.d20260603.cu132-cp312-cp312-linux_aarch64.whl
# flashinfer_python-0.6.12-py3-none-any.whl
# flashinfer_jit_cache-0.6.12-cp39-abi3-manylinux_2_28_aarch64.whl
# flashinfer_cubin-0.6.12-py3-none-any.whl
# 2. Build (assumes vllm-node-tf5 already exists from build-and-copy.sh)
docker build -f Dockerfile.mimo-runtime -t vllm-node-mimo:latest .
# 3. VERIFY torch immediately BEFORE redistributing
docker run --rm --entrypoint python3 vllm-node-mimo:latest -c \
"import torch; print(torch.__version__, torch.version.cuda)"
# MUST print: 2.11.0+cu130 13.0
# If it prints 2.10.0+cpu, the rebuild silently regressed — do NOT distribute.
# 4. Fan out to the 3 workers over the 200 GbE fabric
docker save vllm-node-mimo:latest > /tmp/vllm-node-mimo.tar
for n in <worker-fabric-ip-1> <worker-fabric-ip-2> <worker-fabric-ip-3>; do
(cat /tmp/vllm-node-mimo.tar | ssh $n 'docker load') &
done; wait
rm /tmp/vllm-node-mimo.tar
Pitfall —
docker image prune -awill silently delete this image when no container is running. Never runprune -ablind; either remove by tag (docker image rm vllm-node-mimo:latestif you actually want to) or filter with--filter "until=24h". We've reproduced this failure mode and now keep tar backups at<control-workspace>/docker-images/vllm-node-mimo.taron the control node so a lost image can be restored without a wheels rebuild.
Reproducibility — launch the cluster
The launcher repo and recipe yaml are checked into
~/spark-vllm-docker/ on the head GX10 (copy at
../resources/scripts/relaunch-qwen397-fp8-mtp-qwen3next-tp4.sh).
One-shot:
ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
'cd ~/spark-vllm-docker && ./relaunch-qwen397-fp8-mtp-qwen3next-tp4.sh'
The relaunch script bakes in the exact vllm serve flags used for this
bench. The full expanded command (after recipe + env substitution):
docker run -d --name qwen397-fp8-mtp-tp4 \
--runtime nvidia --network host --ipc host --shm-size 16g \
-e HF_HUB_OFFLINE=1 -e TRANSFORMERS_OFFLINE=1 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e VLLM_USE_DEEP_GEMM=0 -e VLLM_USE_FLASHINFER_MOE_FP16=1 \
-e VLLM_USE_FLASHINFER_SAMPLER=0 -e OMP_NUM_THREADS=4 \
-v <spark-model-root>:/root/.cache/huggingface \
vllm-node-mimo:latest \
vllm serve /root/.cache/huggingface/Qwen/Qwen3.5-397B-A17B-FP8 \
--served-model-name Qwen3.5-397B-A17B-FP8 \
--host 0.0.0.0 --port 8000 \
--max-model-len 262144 \
--gpu-memory-utilization 0.90 \
--load-format safetensors \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--max-num-batched-tokens 8192 \
--max-num-seqs 16 \
--trust-remote-code \
-tp 4 --distributed-executor-backend ray \
--mm-encoder-tp-mode data \
--kv-cache-dtype fp8 \
--compilation-config.cudagraph_mode none \
--attention-backend flashinfer \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
Note qwen3_next_mtp is deprecated and transparently remapped to plain
"mtp" on this wheel — older builds rejected "mtp" as non-functional;
the relationship inverted in dev124. Both names work today; once the
alias is removed in a future wheel, flip to "mtp".
Verify the engine actually wired up MTP by greping the container logs for
Detected MTP model. Sharing target model embedding weights (one line per
TP rank) — that's the unambiguous signal.
Boot time: ~10-14 min to /health=200 depending on whether weights
are warm in page cache. Model loading dominates (~95 GiB per rank).
Reproducibility — bench the cluster
# On the control node (Python 3.14; needs python3-requests from apt
# because pip/venv are not set up here).
cd <control-workspace>/blog/qwen-3.5-397b
/usr/bin/python3 ../resources/bench.py > results.json 2> bench.log
bench.py builds the prompts via the running engine's /tokenize
endpoint so the input token count is exact, fires N concurrent streams
synchronized on a threading.Barrier, parses streaming delta.content
and delta.reasoning plus the usage block from the trailing chunk,
and reports both per-request and wall-window aggregate throughput.
The shared harness lives at ../resources/bench.py
(same file used for the Nemotron-3-Ultra post).
See ../resources/INFRA.md for full cluster +
bench harness details shared with the Nemotron-3-Ultra blog post, and
<control-workspace>/recipes/qwen-3.5-397b-fp8-mtp.md for the
full attempt history and pitfall catalog.
Files in this folder
README.md— this fileresults.json— full raw output (every per-request timing)logs/— bench.log + MTP-5 retest artifacts (bench-mtp5.log, results-mtp5.json) + the historical wait_and_bench.py wrapper (hardcoded pre-restructure paths; kept as a record of the original run, not runnable as-is)
Shared scripts/mods/recipes used by this post live in
../resources/.