Qwen3.5-397B-A17B-FP8 with MTP on 4x ASUS GX10 — bench

Date: 2026-06-13 Hardware: 4x ASUS Ascent GX10 (NVIDIA GB10 Blackwell, 128 GiB unified memory each, ConnectX-7 200 GbE fabric), TP=4 + Ray Container image: vllm-node-mimo:latest, built from Dockerfile.mimo-runtime on top of vllm-node-tf5 — see Reproducibility — Docker image below for the exact build steps and pinned versions vLLM: 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 (local wheel) Model: Qwen/Qwen3.5-397B-A17B-FP8 (HF), local path /root/.cache/huggingface/Qwen/Qwen3.5-397B-A17B-FP8, ~379 GiB per node MTP: qwen3_next_mtp, num_speculative_tokens=2 (single-head MTP — see §3 for why more breaks) KV cache: fp8; max_model_len=262144; max_num_seqs=16; max_num_batched_tokens=8192 GPU memory budget: --gpu-memory-utilization 0.90 (the proven MTP ceiling — 0.92 has crashed twice) API: http://<head-node>:8000 Bench client: ../resources/bench.py, streaming via requests + iter_lines() with stream_options={"include_usage": True}

Field notes

Qwen3.5-397B-A17B-FP8 is still the default model I reach for on this cluster. Once MTP was actually working, the performance moved from "large local model experiment" into "use this as a daily tool" territory: single-stream decode is fast enough to feel responsive, and aggregate throughput scales well enough that OpenClaude, Pi, and OpenCode can sit on top of it without the serving stack feeling fragile.

The other reason it remains the baseline is operational. The same Qwen3.5 family gives a clean migration path down to 2x Spark when using NVFP4, so the work here is not a one-off 4-node stunt. The tradeoff is behavioral rather than infrastructural: the model can get itself into continuous loops, especially in agentic sessions where tool results keep feeding back into the context. That is the main reason I keep comparing newer fine-tunes against it instead of just declaring the search over.

For full cluster + image setup details shared with the Nemotron-3-Ultra post: see ../resources/INFRA.md.

Companion files: launch script, bench harness, and wheel provenance are checked into ../resources/.

Sampling for both tests: temperature=0.0. Prompts are tokenized via vLLM's /tokenize endpoint so the input token count is exact.


Benchmarks

1. Concurrency sweep — 10k in / 1024 out

Each request: ~9 818 input tokens (built from varied English filler) + an instruction to continue the narrative; max_tokens=1024. All concurrent requests fire simultaneously (synchronized via a threading.Barrier). Per‑request and aggregate throughput reported.

Aggregate throughput (single‑run, fp8 KV, MTP active):

N Wall (s) Agg prefill (t/s) Agg decode (t/s) Median TTFT (s) Median per‑req decode (t/s)
1 30.8 1 830 40.3 5.37 40.3
2 33.9 4 684 63.6 2.96 33.5
4 53.3 6 209 79.4 5.13 21.4
8 61.9 13 761 134.5 4.35 18.0
16 88.7 16 369 186.4 7.53 12.7

Per‑request totals (input always 9818 tokens, output always 1024 tokens):

N=1

req TTFT (s) duration (s) prefill (t/s) decode (t/s)
0 5.37 30.78 1 830 40.3

N=2

req TTFT (s) duration (s) prefill (t/s) decode (t/s)
0 1.72 33.21 5 693 32.5
1 4.19 33.88 2 342 34.5

N=4

req TTFT (s) duration (s) prefill (t/s) decode (t/s)
0 5.13 52.79 1 912 21.5
1 5.14 53.19 1 912 21.3
2 6.32 53.26 1 553 21.8
3 1.75 52.58 5 601 20.1

N=8

req TTFT (s) duration (s) prefill (t/s) decode (t/s)
0 4.35 61.86 2 258 17.8
1 4.35 58.85 2 258 18.8
2 4.35 60.88 2 255 18.1
3 4.35 61.62 2 257 17.9
4 5.70 61.71 1 721 18.3
5 1.01 59.84 9 732 17.4
6 5.70 59.16 1 722 19.1
7 4.35 61.50 2 255 17.9

N=16

req TTFT (s) duration (s) prefill (t/s) decode (t/s)
0 7.54 88.64 1 303 12.6
1 9.60 87.65 1 023 13.1
2 4.26 86.35 2 305 12.5
3 7.53 88.71 1 303 12.6
4 7.53 87.95 1 304 12.7
5 9.59 88.35 1 024 13.0
6 4.26 88.10 2 306 12.2
7 9.58 87.46 1 024 13.1
8 4.26 88.10 2 302 12.2
9 9.59 85.72 1 024 13.4
10 7.54 86.14 1 303 13.0
11 4.26 88.65 2 304 12.1
12 7.55 87.65 1 301 12.8
13 7.53 84.81 1 303 13.2
14 0.91 86.54 10 754 11.9
15 4.27 86.55 2 301 12.4

Notes on the shape


2. Needle-in-a-haystack — 200k prefill

200 000-token filler (varied English sentence templates) with a single needle injected at ~50% depth:

IMPORTANT: The secret access code for Project Aurora is BANANA-MOOSE-7421. Remember this exactly.

Question appended at the end: "Based on the text above, what is the secret access code for Project Aurora? Reply with just the code itself, nothing else."

Metric Value
Input tokens 199 402
TTFT (≈ prefill) 125.9 s
Prefill speed 1 584 tokens/s
Output tokens 669 (mostly the reasoning trace)
Decode speed 39.5 t/s
Total wall 142.8 s
Needle retrieved ✅ yes — exact match

Model's final answer (after <think> block):

BANANA-MOOSE-7421

Reasoning trace head (truncated to 400 chars):

Thinking Process:

1.  Analyze the Request:
    *   Input: A very long text containing numerous sentences about various
        characters (cat, team, librarian, John, chef, engineer, child, Mary,
        researcher, fox, gardener, Dr. Reed, pilot, Captain Ortiz, detective,
        astronaut) performing actions (walked, discovered, spoke, decided,
        etc.) involving various objects (printing mistake, photograph,
        door, l...

Prefill of 199 k tokens in 126 s is in the same ballpark as the small-batch prefill rate from §1 (~1.8 k t/s) and matches the engine-reported KV pool at this configuration (551 076 token-slots, 2.10× concurrency at full 262 k context — single 199 k request fits with room to spare for the 669-token completion).


Engine-reported config at boot

Pulled from docker logs qwen397-fp8-mtp-tp4 on the head GX10:

A live SpecDecoding metrics line from the same log:

SpecDecoding metrics: Mean acceptance length: 2.72,
  Accepted: 31 tokens, Drafted: 36 tokens,
  Per-position acceptance rate: 0.889, 0.833,
  Avg Draft acceptance rate: 86.1%

The ~86% acceptance is what turns the 2-token speculator into the ~1.7× decode speedup visible vs the (older, non-MTP) baseline of ~23.5 t/s on this same cluster.



3. MTP=5 retest (2026-06-13, same cluster, same bench)

Relaunched the same recipe with --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' (raised from 2) to test whether more speculative tokens win on this model. vLLM warned at startup, repeating the warning from the MTP-2 run:

WARNING speculative.py:722 Enabling num_speculative_tokens > 1 will run
multiple times of forward on same MTP layer, which may result in lower
acceptance rate

Engine init confirmed: SpeculativeConfig(method='mtp', num_spec_tokens=5). The bench got partway through then crashed the engine:

N Wall (s) Agg prefill (t/s) Agg decode (t/s) vs MTP-2
1 34.6 1 066 40.3 tie
2 43.5 3 324 54.4 −14% vs 63.6
4 76.4 6 172 malformed ⚠️ request returned 0 completion tokens before scheduler crash
8 HTTP 500 engine dead
16 not attempted

At 2026-06-13 16:13:25, mid-N=8 step, all 4 ranks crashed with:

torch.AcceleratorError: CUDA error: an illegal memory access was encountered
[rank0]:[E613 16:13:25 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 0]
  Process group watchdog thread terminated with exception:
  CUDA error: an illegal memory access was encountered
... Worker exit type: SYSTEM_ERROR ...
ERROR 06-13 16:14:15 [ray_executor_v2.py:464] RayWorkerProc rank=[0] died
  unexpectedly, shutting down executor.
ERROR async_llm.py:704 vllm.v1.engine.exceptions.EngineDeadError: EngineCore
  encountered an issue.

The container stayed Up (Ray PID1 persists per the known pitfall) but /health started refusing connections. Service was restored by relaunching the canonical MTP-2 script.

Verdict

MTP-5 is worse than MTP-2 on this stack and not safe to run.

  1. Even where MTP-5 returned valid completions, it was slower than MTP-2:
    • N=2 aggregate decode 54.4 t/s (MTP-5) vs 63.6 t/s (MTP-2) — −14%.
    • N=1 was a wash, which makes sense — at single-stream the extra speculator forwards just lose what the rare extra-accepted token wins.
  2. It crashes the engine under concurrency. Illegal memory access is not a recoverable failure mode — it requires a full Ray cluster relaunch.
  3. The per-position acceptance decay observed at MTP-2 (0.889 → 0.833, a 6 pp drop) projects a sharply worse return for MTP-5: positions 3–5 would land somewhere in the 0.65–0.78 range with the same speculator getting reused, exactly matching vLLM's warning.

The likely root cause is Qwen3.5-MoE-MTP's architecture: it has a single MTP head that vLLM re-runs num_spec_tokens times. Architectures with native multi-token MTP heads (DeepSeek-V3 / Nemotron-3-Ultra style) tolerate higher num_speculative_tokens — Qwen3.5 here does not.

Operator memory updated: project-qwen397-mtp5-crash.md — never propose MTP-5 (or higher) on this model with this wheel.

MTP-3 not yet tested

Worth a follow-up to confirm whether MTP-3 is a small win, parity, or already losing. The MTP-5 crash means the test has real risk attached — recommend running it during a window where a relaunch is acceptable.


Reproducibility — Docker image

The container is vllm-node-mimo:latest layered on top of vllm-node-tf5 (which is built from the base vllm-node image in github.com/eugr/spark-vllm-docker). Lineage:

nvidia/cuda:13.2.0-devel-ubuntu24.04
   │  + ccache, build tools, libibverbs (RDMA)
   │  + pip install torch==2.11.0+cu130, triton, nvshmem
   │
   ▼
vllm-node-tf5:latest           (older transformers, stable for non-MTP)
   │
   └── Dockerfile.mimo-runtime  (transformers≥5.0 + reinstall vllm/flashinfer wheels)
       │
       └── vllm-node-mimo:latest   ◀── this benchmark used this image

Required pinned versions (do NOT skip)

Package Version Source Why
torch 2.11.0+cu130 https://download.pytorch.org/whl/cu130 The fresh vLLM wheel ships metadata pinning torch==2.10.0 but the C++ ABI actually needs 2.11.0. If you let uv resolve naturally it lands torch==2.10.0+cpu and vllm dies at import with ImportError: libtorch_cuda.so: cannot open shared object file. Force-reinstall the cu130 wheel as a separate Dockerfile RUN layer AFTER the vllm install with --no-deps --force-reinstall.
transformers ≥5.0.0 pip qwen3_next_mtp config classes live in transformers 5.x; tf5's base pin is 4.x — mimo overrides.
vllm 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 local wheel wheels/vllm-*.whl qwen3_next_mtp speculator path.
flashinfer-python 0.6.12 local wheels wheels/flashinfer_*.whl required for FP8 KV + the MTP attention path.

Dockerfile.mimo-runtime (exact contents)

FROM vllm-node-tf5:latest

ENV PIP_BREAK_SYSTEM_PACKAGES=1
ENV UV_SYSTEM_PYTHON=1
ENV UV_BREAK_SYSTEM_PACKAGES=1
ENV UV_LINK_MODE=copy

COPY wheels/*.whl /tmp/mimo-wheels/
RUN printf "%s\n" "transformers>=5.0.0" > /tmp/tf-override.txt \
    && uv pip install /tmp/mimo-wheels/*.whl --override /tmp/tf-override.txt \
    && rm -rf /tmp/mimo-wheels /tmp/tf-override.txt

# Fix torch CPU regression: the fresh vllm wheel ships metadata pinning
# torch==2.10.0 but the C++ ABI needs 2.11.0 from cu130. Force-reinstall.
RUN uv pip install --no-deps --force-reinstall \
    --index-url https://download.pytorch.org/whl/cu130 \
    torch==2.11.0

Build + verify + distribute (~10 min total on the head GX10)

# On head GX10 (<spark-user>@<head-node>)
cd ~/spark-vllm-docker

# 1. Drop wheels in wheels/ — vllm-0.22.1rc1.dev124+... and flashinfer_*-0.6.12
ls wheels/
#   vllm-0.22.1rc1.dev124+gace95c9cf.d20260603.cu132-cp312-cp312-linux_aarch64.whl
#   flashinfer_python-0.6.12-py3-none-any.whl
#   flashinfer_jit_cache-0.6.12-cp39-abi3-manylinux_2_28_aarch64.whl
#   flashinfer_cubin-0.6.12-py3-none-any.whl

# 2. Build (assumes vllm-node-tf5 already exists from build-and-copy.sh)
docker build -f Dockerfile.mimo-runtime -t vllm-node-mimo:latest .

# 3. VERIFY torch immediately BEFORE redistributing
docker run --rm --entrypoint python3 vllm-node-mimo:latest -c \
  "import torch; print(torch.__version__, torch.version.cuda)"
# MUST print: 2.11.0+cu130 13.0
# If it prints 2.10.0+cpu, the rebuild silently regressed — do NOT distribute.

# 4. Fan out to the 3 workers over the 200 GbE fabric
docker save vllm-node-mimo:latest > /tmp/vllm-node-mimo.tar
for n in <worker-fabric-ip-1> <worker-fabric-ip-2> <worker-fabric-ip-3>; do
  (cat /tmp/vllm-node-mimo.tar | ssh $n 'docker load') &
done; wait
rm /tmp/vllm-node-mimo.tar

Pitfall — docker image prune -a will silently delete this image when no container is running. Never run prune -a blind; either remove by tag (docker image rm vllm-node-mimo:latest if you actually want to) or filter with --filter "until=24h". We've reproduced this failure mode and now keep tar backups at <control-workspace>/docker-images/vllm-node-mimo.tar on the control node so a lost image can be restored without a wheels rebuild.

Reproducibility — launch the cluster

The launcher repo and recipe yaml are checked into ~/spark-vllm-docker/ on the head GX10 (copy at ../resources/scripts/relaunch-qwen397-fp8-mtp-qwen3next-tp4.sh). One-shot:

ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
  'cd ~/spark-vllm-docker && ./relaunch-qwen397-fp8-mtp-qwen3next-tp4.sh'

The relaunch script bakes in the exact vllm serve flags used for this bench. The full expanded command (after recipe + env substitution):

docker run -d --name qwen397-fp8-mtp-tp4 \
  --runtime nvidia --network host --ipc host --shm-size 16g \
  -e HF_HUB_OFFLINE=1 -e TRANSFORMERS_OFFLINE=1 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -e VLLM_USE_DEEP_GEMM=0 -e VLLM_USE_FLASHINFER_MOE_FP16=1 \
  -e VLLM_USE_FLASHINFER_SAMPLER=0 -e OMP_NUM_THREADS=4 \
  -v <spark-model-root>:/root/.cache/huggingface \
  vllm-node-mimo:latest \
  vllm serve /root/.cache/huggingface/Qwen/Qwen3.5-397B-A17B-FP8 \
    --served-model-name Qwen3.5-397B-A17B-FP8 \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.90 \
    --load-format safetensors \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 16 \
    --trust-remote-code \
    -tp 4 --distributed-executor-backend ray \
    --mm-encoder-tp-mode data \
    --kv-cache-dtype fp8 \
    --compilation-config.cudagraph_mode none \
    --attention-backend flashinfer \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Note qwen3_next_mtp is deprecated and transparently remapped to plain "mtp" on this wheel — older builds rejected "mtp" as non-functional; the relationship inverted in dev124. Both names work today; once the alias is removed in a future wheel, flip to "mtp".

Verify the engine actually wired up MTP by greping the container logs for Detected MTP model. Sharing target model embedding weights (one line per TP rank) — that's the unambiguous signal.

Boot time: ~10-14 min to /health=200 depending on whether weights are warm in page cache. Model loading dominates (~95 GiB per rank).

Reproducibility — bench the cluster

# On the control node (Python 3.14; needs python3-requests from apt
# because pip/venv are not set up here).
cd <control-workspace>/blog/qwen-3.5-397b
/usr/bin/python3 ../resources/bench.py > results.json 2> bench.log

bench.py builds the prompts via the running engine's /tokenize endpoint so the input token count is exact, fires N concurrent streams synchronized on a threading.Barrier, parses streaming delta.content and delta.reasoning plus the usage block from the trailing chunk, and reports both per-request and wall-window aggregate throughput.

The shared harness lives at ../resources/bench.py (same file used for the Nemotron-3-Ultra post).

See ../resources/INFRA.md for full cluster + bench harness details shared with the Nemotron-3-Ultra blog post, and <control-workspace>/recipes/qwen-3.5-397b-fp8-mtp.md for the full attempt history and pitfall catalog.


Files in this folder

Shared scripts/mods/recipes used by this post live in ../resources/.