Qwen3.5-397B-A17B-FP8 with MTP on 4x ASUS GX10 — bench

Date: 2026-06-13 Hardware: 4x ASUS Ascent GX10 (NVIDIA GB10 Blackwell, 128 GiB unified memory each, ConnectX-7 200 GbE fabric), TP=4 + Ray Container image: vllm-node-mimo:latest, built from Dockerfile.mimo-runtime on top of vllm-node-tf5 — see Reproducibility — Docker image below for the exact build steps and pinned versions vLLM: 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 (local wheel) Model: Qwen/Qwen3.5-397B-A17B-FP8 (HF), local path /root/.cache/huggingface/Qwen/Qwen3.5-397B-A17B-FP8, ~379 GiB per node MTP: qwen3_next_mtp, num_speculative_tokens=2 (single-head MTP — see §3 for why more breaks) KV cache: fp8; max_model_len=262144; max_num_seqs=16; max_num_batched_tokens=8192 GPU memory budget: --gpu-memory-utilization 0.90 (the proven MTP ceiling — 0.92 has crashed twice) API: http://<head-node>:8000 Bench client: ../resources/bench.py, streaming via requests + iter_lines() with stream_options={"include_usage": True}

Field notes

Qwen3.5-397B-A17B-FP8 is still the default model I reach for on this cluster. Once MTP was actually working, the performance moved from "large local model experiment" into "use this as a daily tool" territory: single-stream decode is fast enough to feel responsive, and aggregate throughput scales well enough that OpenClaude, Pi, and OpenCode can sit on top of it without the serving stack feeling fragile.

The other reason it remains the baseline is operational. The same Qwen3.5 family gives a clean migration path down to 2x Spark when using NVFP4, so the work here is not a one-off 4-node stunt. The tradeoff is behavioral rather than infrastructural: the model can get itself into continuous loops, especially in agentic sessions where tool results keep feeding back into the context. That is the main reason I keep comparing newer fine-tunes against it instead of just declaring the search over.

For full cluster + image setup details shared with the Nemotron-3-Ultra post: see ../resources/INFRA.md.

Companion files: launch script, bench harness, and wheel provenance are checked into ../resources/.

Sampling for both tests: temperature=0.0. Prompts are tokenized via vLLM's /tokenize endpoint so the input token count is exact.

Benchmarks

1. Concurrency sweep — 10k in / 1024 out

Each request: ~9 818 input tokens (built from varied English filler) + an instruction to continue the narrative; max_tokens=1024. All concurrent requests fire simultaneously (synchronized via a threading.Barrier). Per‑request and aggregate throughput reported.

Aggregate throughput (single‑run, fp8 KV, MTP active):

N	Wall (s)	Agg prefill (t/s)	Agg decode (t/s)	Median TTFT (s)	Median per‑req decode (t/s)
1	30.8	1 830	40.3	5.37	40.3
2	33.9	4 684	63.6	2.96	33.5
4	53.3	6 209	79.4	5.13	21.4
8	61.9	13 761	134.5	4.35	18.0
16	88.7	16 369	186.4	7.53	12.7

Per‑request totals (input always 9818 tokens, output always 1024 tokens):

N=1

req	TTFT (s)	duration (s)	prefill (t/s)	decode (t/s)
0	5.37	30.78	1 830	40.3

N=2

req	TTFT (s)	duration (s)	prefill (t/s)	decode (t/s)
0	1.72	33.21	5 693	32.5
1	4.19	33.88	2 342	34.5

N=4

req	TTFT (s)	duration (s)	prefill (t/s)	decode (t/s)
0	5.13	52.79	1 912	21.5
1	5.14	53.19	1 912	21.3
2	6.32	53.26	1 553	21.8
3	1.75	52.58	5 601	20.1

N=8

req	TTFT (s)	duration (s)	prefill (t/s)	decode (t/s)
0	4.35	61.86	2 258	17.8
1	4.35	58.85	2 258	18.8
2	4.35	60.88	2 255	18.1
3	4.35	61.62	2 257	17.9
4	5.70	61.71	1 721	18.3
5	1.01	59.84	9 732	17.4
6	5.70	59.16	1 722	19.1
7	4.35	61.50	2 255	17.9

N=16

req	TTFT (s)	duration (s)	prefill (t/s)	decode (t/s)
0	7.54	88.64	1 303	12.6
1	9.60	87.65	1 023	13.1
2	4.26	86.35	2 305	12.5
3	7.53	88.71	1 303	12.6
4	7.53	87.95	1 304	12.7
5	9.59	88.35	1 024	13.0
6	4.26	88.10	2 306	12.2
7	9.58	87.46	1 024	13.1
8	4.26	88.10	2 302	12.2
9	9.59	85.72	1 024	13.4
10	7.54	86.14	1 303	13.0
11	4.26	88.65	2 304	12.1
12	7.55	87.65	1 301	12.8
13	7.53	84.81	1 303	13.2
14	0.91	86.54	10 754	11.9
15	4.27	86.55	2 301	12.4

Notes on the shape

Prefill is staircased because max_num_batched_tokens=8192. Each prefill needs ~9 818 input tokens, so vLLM admits prefills in micro-batches and TTFT comes in clusters (visible as 4 distinct TTFT bands at N=16: ~0.9, 4.3, 7.5, 9.6 s).
Per-request decode falls cleanly with N (40 → 33 → 21 → 18 → 13 t/s) as MoE expert dispatch and inter-node Ray collectives become the binding cost.
Aggregate decode scales sublinearly (40 → 64 → 79 → 134 → 186 t/s — ~4.6× from N=1 to N=16). With MTP enabled, the speculator's ~86% acceptance rate is the headroom that keeps per-stream decode usable even at N=16.
The aggregate prefill plateau around N=16 (~16 k t/s) reflects KV-block contention plus the 8192 batched-tokens limit; ramping --max-num-batched-tokens would shift this but cost KV memory.

2. Needle-in-a-haystack — 200k prefill

200 000-token filler (varied English sentence templates) with a single needle injected at ~50% depth:

IMPORTANT: The secret access code for Project Aurora is BANANA-MOOSE-7421. Remember this exactly.

Question appended at the end: "Based on the text above, what is the secret access code for Project Aurora? Reply with just the code itself, nothing else."

Metric	Value
Input tokens	199 402
TTFT (≈ prefill)	125.9 s
Prefill speed	1 584 tokens/s
Output tokens	669 (mostly the reasoning trace)
Decode speed	39.5 t/s
Total wall	142.8 s
Needle retrieved	✅ yes — exact match

Model's final answer (after <think> block):

BANANA-MOOSE-7421

Reasoning trace head (truncated to 400 chars):

Thinking Process:

1.  Analyze the Request:
    *   Input: A very long text containing numerous sentences about various
        characters (cat, team, librarian, John, chef, engineer, child, Mary,
        researcher, fox, gardener, Dr. Reed, pilot, Captain Ortiz, detective,
        astronaut) performing actions (walked, discovered, spoke, decided,
        etc.) involving various objects (printing mistake, photograph,
        door, l...

Prefill of 199 k tokens in 126 s is in the same ballpark as the small-batch prefill rate from §1 (~1.8 k t/s) and matches the engine-reported KV pool at this configuration (551 076 token-slots, 2.10× concurrency at full 262 k context — single 199 k request fits with room to spare for the 669-token completion).

Engine-reported config at boot

Pulled from docker logs qwen397-fp8-mtp-tp4 on the head GX10:

Effective speculative_config = SpeculativeConfig(method='mtp', num_spec_tokens=2) — note the wheel deprecates qwen3_next_mtp to mtp.
Drafter loaded on all 4 ranks; Detected MTP model. Sharing target model embedding weights with the draft model. per rank.
Available KV cache: 4.65–5.18 GiB per rank (tightest 4.65 GiB at TP1)
GPU KV cache size: 551 076 tokens
Maximum vLLM-reported concurrency at full 262 144-token request: 2.10×

A live SpecDecoding metrics line from the same log:

SpecDecoding metrics: Mean acceptance length: 2.72,
  Accepted: 31 tokens, Drafted: 36 tokens,
  Per-position acceptance rate: 0.889, 0.833,
  Avg Draft acceptance rate: 86.1%

The ~86% acceptance is what turns the 2-token speculator into the ~1.7× decode speedup visible vs the (older, non-MTP) baseline of ~23.5 t/s on this same cluster.

3. MTP=5 retest (2026-06-13, same cluster, same bench)

Relaunched the same recipe with --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' (raised from 2) to test whether more speculative tokens win on this model. vLLM warned at startup, repeating the warning from the MTP-2 run:

WARNING speculative.py:722 Enabling num_speculative_tokens > 1 will run
multiple times of forward on same MTP layer, which may result in lower
acceptance rate

Engine init confirmed: SpeculativeConfig(method='mtp', num_spec_tokens=5). The bench got partway through then crashed the engine:

N	Wall (s)	Agg prefill (t/s)	Agg decode (t/s)	vs MTP-2
1	34.6	1 066	40.3	tie
2	43.5	3 324	54.4	−14% vs 63.6
4	76.4	6 172	malformed ⚠️	request returned 0 completion tokens before scheduler crash
8	—	—	HTTP 500	engine dead
16	—	—	—	not attempted

At 2026-06-13 16:13:25, mid-N=8 step, all 4 ranks crashed with:

torch.AcceleratorError: CUDA error: an illegal memory access was encountered
[rank0]:[E613 16:13:25 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 0]
  Process group watchdog thread terminated with exception:
  CUDA error: an illegal memory access was encountered
... Worker exit type: SYSTEM_ERROR ...
ERROR 06-13 16:14:15 [ray_executor_v2.py:464] RayWorkerProc rank=[0] died
  unexpectedly, shutting down executor.
ERROR async_llm.py:704 vllm.v1.engine.exceptions.EngineDeadError: EngineCore
  encountered an issue.

The container stayed Up (Ray PID1 persists per the known pitfall) but /health started refusing connections. Service was restored by relaunching the canonical MTP-2 script.

Verdict

MTP-5 is worse than MTP-2 on this stack and not safe to run.

Even where MTP-5 returned valid completions, it was slower than MTP-2:
- N=2 aggregate decode 54.4 t/s (MTP-5) vs 63.6 t/s (MTP-2) — −14%.
- N=1 was a wash, which makes sense — at single-stream the extra speculator forwards just lose what the rare extra-accepted token wins.
It crashes the engine under concurrency. Illegal memory access is not a recoverable failure mode — it requires a full Ray cluster relaunch.
The per-position acceptance decay observed at MTP-2 (0.889 → 0.833, a 6 pp drop) projects a sharply worse return for MTP-5: positions 3–5 would land somewhere in the 0.65–0.78 range with the same speculator getting reused, exactly matching vLLM's warning.

The likely root cause is Qwen3.5-MoE-MTP's architecture: it has a single MTP head that vLLM re-runs num_spec_tokens times. Architectures with native multi-token MTP heads (DeepSeek-V3 / Nemotron-3-Ultra style) tolerate higher num_speculative_tokens — Qwen3.5 here does not.

Operator memory updated: project-qwen397-mtp5-crash.md — never propose MTP-5 (or higher) on this model with this wheel.

MTP-3 not yet tested

Worth a follow-up to confirm whether MTP-3 is a small win, parity, or already losing. The MTP-5 crash means the test has real risk attached — recommend running it during a window where a relaunch is acceptable.

Reproducibility — Docker image

The container is vllm-node-mimo:latest layered on top of vllm-node-tf5 (which is built from the base vllm-node image in github.com/eugr/spark-vllm-docker). Lineage:

nvidia/cuda:13.2.0-devel-ubuntu24.04
   │  + ccache, build tools, libibverbs (RDMA)
   │  + pip install torch==2.11.0+cu130, triton, nvshmem
   │
   ▼
vllm-node-tf5:latest           (older transformers, stable for non-MTP)
   │
   └── Dockerfile.mimo-runtime  (transformers≥5.0 + reinstall vllm/flashinfer wheels)
       │
       └── vllm-node-mimo:latest   ◀── this benchmark used this image

Required pinned versions (do NOT skip)

Package	Version	Source	Why
`torch`	2.11.0+cu130	`https://download.pytorch.org/whl/cu130`	The fresh vLLM wheel ships metadata pinning `torch==2.10.0` but the C++ ABI actually needs 2.11.0. If you let uv resolve naturally it lands `torch==2.10.0+cpu` and vllm dies at import with `ImportError: libtorch_cuda.so: cannot open shared object file`. Force-reinstall the cu130 wheel as a separate Dockerfile RUN layer AFTER the vllm install with `--no-deps --force-reinstall`.
`transformers`	`≥5.0.0`	pip	`qwen3_next_mtp` config classes live in transformers 5.x; tf5's base pin is 4.x — mimo overrides.
`vllm`	`0.22.1rc1.dev124+gace95c9cf.d20260603.cu132`	local wheel `wheels/vllm-*.whl`	qwen3_next_mtp speculator path.
`flashinfer-python`	`0.6.12`	local wheels `wheels/flashinfer_*.whl`	required for FP8 KV + the MTP attention path.

`Dockerfile.mimo-runtime` (exact contents)

FROM vllm-node-tf5:latest

ENV PIP_BREAK_SYSTEM_PACKAGES=1
ENV UV_SYSTEM_PYTHON=1
ENV UV_BREAK_SYSTEM_PACKAGES=1
ENV UV_LINK_MODE=copy

COPY wheels/*.whl /tmp/mimo-wheels/
RUN printf "%s\n" "transformers>=5.0.0" > /tmp/tf-override.txt \
    && uv pip install /tmp/mimo-wheels/*.whl --override /tmp/tf-override.txt \
    && rm -rf /tmp/mimo-wheels /tmp/tf-override.txt

# Fix torch CPU regression: the fresh vllm wheel ships metadata pinning
# torch==2.10.0 but the C++ ABI needs 2.11.0 from cu130. Force-reinstall.
RUN uv pip install --no-deps --force-reinstall \
    --index-url https://download.pytorch.org/whl/cu130 \
    torch==2.11.0

Build + verify + distribute (~10 min total on the head GX10)

# On head GX10 (<spark-user>@<head-node>)
cd ~/spark-vllm-docker

# 1. Drop wheels in wheels/ — vllm-0.22.1rc1.dev124+... and flashinfer_*-0.6.12
ls wheels/
#   vllm-0.22.1rc1.dev124+gace95c9cf.d20260603.cu132-cp312-cp312-linux_aarch64.whl
#   flashinfer_python-0.6.12-py3-none-any.whl
#   flashinfer_jit_cache-0.6.12-cp39-abi3-manylinux_2_28_aarch64.whl
#   flashinfer_cubin-0.6.12-py3-none-any.whl

# 2. Build (assumes vllm-node-tf5 already exists from build-and-copy.sh)
docker build -f Dockerfile.mimo-runtime -t vllm-node-mimo:latest .

# 3. VERIFY torch immediately BEFORE redistributing
docker run --rm --entrypoint python3 vllm-node-mimo:latest -c \
  "import torch; print(torch.__version__, torch.version.cuda)"
# MUST print: 2.11.0+cu130 13.0
# If it prints 2.10.0+cpu, the rebuild silently regressed — do NOT distribute.

# 4. Fan out to the 3 workers over the 200 GbE fabric
docker save vllm-node-mimo:latest > /tmp/vllm-node-mimo.tar
for n in <worker-fabric-ip-1> <worker-fabric-ip-2> <worker-fabric-ip-3>; do
  (cat /tmp/vllm-node-mimo.tar | ssh $n 'docker load') &
done; wait
rm /tmp/vllm-node-mimo.tar

Pitfall — docker image prune -a will silently delete this image when no container is running. Never run prune -a blind; either remove by tag (docker image rm vllm-node-mimo:latest if you actually want to) or filter with --filter "until=24h". We've reproduced this failure mode and now keep tar backups at <control-workspace>/docker-images/vllm-node-mimo.tar on the control node so a lost image can be restored without a wheels rebuild.

Reproducibility — launch the cluster

The launcher repo and recipe yaml are checked into ~/spark-vllm-docker/ on the head GX10 (copy at ../resources/scripts/relaunch-qwen397-fp8-mtp-qwen3next-tp4.sh). One-shot:

ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
  'cd ~/spark-vllm-docker && ./relaunch-qwen397-fp8-mtp-qwen3next-tp4.sh'

The relaunch script bakes in the exact vllm serve flags used for this bench. The full expanded command (after recipe + env substitution):

docker run -d --name qwen397-fp8-mtp-tp4 \
  --runtime nvidia --network host --ipc host --shm-size 16g \
  -e HF_HUB_OFFLINE=1 -e TRANSFORMERS_OFFLINE=1 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -e VLLM_USE_DEEP_GEMM=0 -e VLLM_USE_FLASHINFER_MOE_FP16=1 \
  -e VLLM_USE_FLASHINFER_SAMPLER=0 -e OMP_NUM_THREADS=4 \
  -v <spark-model-root>:/root/.cache/huggingface \
  vllm-node-mimo:latest \
  vllm serve /root/.cache/huggingface/Qwen/Qwen3.5-397B-A17B-FP8 \
    --served-model-name Qwen3.5-397B-A17B-FP8 \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.90 \
    --load-format safetensors \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 16 \
    --trust-remote-code \
    -tp 4 --distributed-executor-backend ray \
    --mm-encoder-tp-mode data \
    --kv-cache-dtype fp8 \
    --compilation-config.cudagraph_mode none \
    --attention-backend flashinfer \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Note qwen3_next_mtp is deprecated and transparently remapped to plain "mtp" on this wheel — older builds rejected "mtp" as non-functional; the relationship inverted in dev124. Both names work today; once the alias is removed in a future wheel, flip to "mtp".

Verify the engine actually wired up MTP by greping the container logs for Detected MTP model. Sharing target model embedding weights (one line per TP rank) — that's the unambiguous signal.

Boot time: ~10-14 min to /health=200 depending on whether weights are warm in page cache. Model loading dominates (~95 GiB per rank).

Reproducibility — bench the cluster

# On the control node (Python 3.14; needs python3-requests from apt
# because pip/venv are not set up here).
cd <control-workspace>/blog/qwen-3.5-397b
/usr/bin/python3 ../resources/bench.py > results.json 2> bench.log

bench.py builds the prompts via the running engine's /tokenize endpoint so the input token count is exact, fires N concurrent streams synchronized on a threading.Barrier, parses streaming delta.content and delta.reasoning plus the usage block from the trailing chunk, and reports both per-request and wall-window aggregate throughput.

The shared harness lives at ../resources/bench.py (same file used for the Nemotron-3-Ultra post).

See ../resources/INFRA.md for full cluster + bench harness details shared with the Nemotron-3-Ultra blog post, and <control-workspace>/recipes/qwen-3.5-397b-fp8-mtp.md for the full attempt history and pitfall catalog.

Files in this folder

README.md — this file
results.json — full raw output (every per-request timing)
logs/ — bench.log + MTP-5 retest artifacts (bench-mtp5.log, results-mtp5.json) + the historical wait_and_bench.py wrapper (hardcoded pre-restructure paths; kept as a record of the original run, not runnable as-is)

Shared scripts/mods/recipes used by this post live in ../resources/.