Nex-N2-Pro-fp8 on 4x ASUS GX10 — patch, bench, rebench
Date: 2026-06-17
Hardware: 4x ASUS Ascent GX10 (NVIDIA GB10 Blackwell, 128 GiB unified memory each, ConnectX-7 200 GbE fabric), TP=4 + Ray
Container image: vllm-node-mimo:latest (same image as the Qwen3.5-397B post — see Reproducibility — Docker image)
vLLM: 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 (local wheel, image id 75429f413d11)
Model: nex-agi/Nex-N2-Pro-fp8 (HF), local path /root/.cache/huggingface/nex-agi/Nex-N2-Pro-fp8, 372 GiB per node (FP8 block-quant of Qwen3.5-397B-A17B, 40 shards × ~9.3 GiB)
Mod patch (required): mods/nex-fp8-loader — see §2
API: http://<head-node>:8000
Bench client: logs/bench.py (MTP-on baseline) / logs/bench-mtp-off.py (MTP-off rebench), streaming via requests + iter_lines() with stream_options={"include_usage": True}.
Field notes
Nex-N2-Pro-fp8 felt pretty good once it was up, but I did not end up using it as heavily as Qwen397. The useful story here is less "this replaced the baseline" and more "a same-family fine-tune can be made to serve on the same Spark stack, but the checkpoint details matter." The loader patch was a narrow fix, the MTP acceptance issue was not, and disabling MTP turned the model from an interesting failed speculation run into a serviceable target-only deployment.
The practical result is a model that feels in-family with Qwen3.5-397B and Ornith, but without Qwen397's working MTP speedup. That makes it worth documenting as a recovery and comparison point: the tool stack can serve it, the numerics are coherent, and the performance becomes sane after the speculator is removed, but it did not give me enough extra behavioral upside to become the go-to model.
Nex-N2-Pro-fp8 is nex-agi's FP8 block-quantized release of the same Qwen3.5-397B-A17B agentic checkpoint the Qwen team shipped at BF16 + a separate Qwen-team FP8 line. The model card recommends the vendor's sglang fork (which is x86_64 + sm_90, no GB10 build available), so we put it on the same 4x GX10 vLLM stack we already run Qwen3.5-397B-A17B-FP8 on. This post is the full journey from "won't load" to 2.1× the throughput we first measured at, in three acts.
Sampling: temperature=0.0. Prompts tokenized via vLLM's /tokenize endpoint so input token count is exact.
Benchmarks
1. TL;DR — the perf story in one table
What we ended up serving at vs. what the first working bench produced.
| Metric | First working bench (MTP-2 from Qwen397 recipe) | After we disabled MTP | Δ |
|---|---|---|---|
| Single-stream decode (N=1) | 15.1 t/s | 19.6 t/s | +30 % |
| Aggregate decode @ N=2 | 22.9 t/s | 36.7 t/s | +60 % |
| Aggregate decode @ N=4 | 28.8 t/s | 64.6 t/s | +124 % |
| Aggregate decode @ N=8 | 55.7 t/s | 81.1 t/s | +46 % |
| Aggregate decode @ N=16 | 72.8 t/s | 114.0 t/s | +57 % |
| NIAH 200k prefill | 1 227 t/s | 1 519 t/s | +24 % |
| NIAH 200k decode | 15.5 t/s | 21.3 t/s | +37 % |
| Engine-reported KV cache pool | 530 028 tokens | 844 024 tokens | +59 % |
| Tightest-rank available KV cache memory | 4.47 GiB | 6.38 GiB | +43 % |
The recipe shipped on the head GX10 as relaunch-nex-n2-pro-fp8-tp4.sh — same as Qwen397's MTP-2 recipe but with --speculative-config removed.
Below is how we got there: what the vendor's recommended recipe does on Spark (nothing — it crashes on weight load), what the bare Qwen397-FP8 recipe does (also crashes, but on a different line), the patch that fixed the load, the first bench with the Qwen397-style flags, the 0%-MTP-acceptance finding that explained why the numbers were below the same-arch baseline, and the rebench that recovered most of the gap.
2. Act I — the loader crash and the patch
The Nex model card recommends the vendor sglang fork (nexagi/sglang:v0.5.12-579f84b). On Spark, that image will not run as-is — it is built for x86_64 + sm_90 (H100), and Spark is aarch64 + sm_121. The supported alternative is a from-source build of the fork at github.com/nex-agi/sglang, which has had no GB10 validation and none of our existing tuning carries over. We've already shipped Qwen3.5-397B-A17B-FP8 (same arch family) on the Spark vLLM stack, so the natural first attempt was the Qwen397 recipe with just the model path swapped.
That attempt died during weight load:
KeyError: 'layers.0.mlp.experts.w2_weight_scale'
File ".../vllm/model_executor/models/qwen3_5.py", line 395, in load_weights
param = params_dict[name_mapped]
Precursor warnings on the same boot:
WARNING qwen3_5.py:420 Parameter layers.0.linear_attn.in_proj_qkvz.weight_scale
not found in params_dict, skip loading
WARNING qwen3_5.py:420 Parameter layers.0.linear_attn.out_proj.weight_scale
not found in params_dict, skip loading
Full trace: logs/engine-fail-pre-patch.log.
2.1 Why the stock loader misses
Nex's FP8 release is block-quantized (weight_block_size=[128,128], activation_scheme=dynamic) and stores per-expert and per-linear-projection scales in the conventional companion-tensor shape:
layers.0.mlp.experts.0.gate_proj.weight (FP8)
layers.0.mlp.experts.0.gate_proj.weight_scale ◀── per-block scale
layers.0.mlp.experts.0.up_proj.weight (FP8)
layers.0.mlp.experts.0.up_proj.weight_scale ◀── per-block scale
layers.0.mlp.experts.0.down_proj.weight (FP8)
layers.0.mlp.experts.0.down_proj.weight_scale ◀── per-block scale
…
layers.0.linear_attn.in_proj_qkv.weight (FP8)
layers.0.linear_attn.in_proj_qkv.weight_scale ◀── per-block scale
layers.0.linear_attn.in_proj_z.weight (FP8)
layers.0.linear_attn.in_proj_z.weight_scale ◀── per-block scale
layers.0.linear_attn.out_proj.weight (FP8)
layers.0.linear_attn.out_proj.weight_scale ◀── per-block scale
vLLM's Qwen3_5Model.load_weights rewrites the per-expert unfused weights into the fused experts.w13_weight / experts.w2_weight slots and the unfused linear-attn projections into linear_attn.in_proj_qkvz (via the stacked_params_mapping substring rule in_proj_qkv → in_proj_qkvz). That substring replace runs on the .weight_scale companion tensors too, producing names like experts.w2_weight_scale and linear_attn.in_proj_qkvz.weight_scale.
But vLLM's Fp8MoEMethod (and the matching Fp8LinearMethod) for block-quant register their scale slot as .weight_scale_inv — the FP8 block-scale convention vLLM uses internally (a leftover from the inverse-scale layout DeepSeek's checkpoints use). The lookup params_dict["…experts.w2_weight_scale"] misses because the actual registered slot is …experts.w2_weight_scale_inv. Hence the KeyError.
The linear-attn case fires the warning instead of crashing because the fall-through branch in load_weights warns-and-skips on missing-from-params_dict — but those scales never got loaded, so even if the engine had survived the expert KeyError, the linear-attn FP8 dequant would have used uninitialized scales and silently produced garbage.
Qwen-team's own FP8 release of Qwen397 does not hit this because its per-expert checkpoint already uses the _scale_inv suffix and matches vLLM's slot convention. Nex's FP8 toolchain (the same stack that produced Nex-N2-mini) emits .weight_scale instead.
2.2 The patch
A 13-line generator wrap at the top of Qwen3_5Model.load_weights that renames .weight_scale → .weight_scale_inv on every incoming tensor before the existing substring-remap loops run:
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
class="tok-comment"># Nex-N2-Pro-fp8 alias (added by spark-vllm-docker mods/nex-fp8-loader):
class="tok-comment"># FP8 block-quant stores per-(expert|projection) block scales as
class="tok-comment"># `<x>.weight_scale`; vLLM Fp8 quant methods register the slot as
class="tok-comment"># `<x>.weight_scale_inv`. Normalize at ingestion so the rest of the
class="tok-comment"># loader (stacked_params_mapping + expert_params_mapping substring
class="tok-comment"># remaps) lands on the correct param key.
def _alias_fp8_scales(ws):
for n, w in ws:
if n.endswith(class="tok-string">".weight_scale"):
yield n + "_inv", w
else:
yield n, w
weights = _alias_fp8_scales(weights)
stacked_params_mapping = [
…
After this generator runs, the existing substring-replace logic in stacked_params_mapping (in_proj_qkv → in_proj_qkvz) and expert_params_mapping (experts.{i}.down_proj → experts.w2_weight) operates on names that already have the _inv suffix, and every rewritten name lands on the correct Fp8MoEMethod / Fp8LinearMethod slot.
Unified diff at ../resources/mods/nex-fp8-loader/qwen3_5-fp8-scale-alias.diff.
2.3 Pitfall — . vs _ in the suffix check (cost ~25 min)
First version of the alias generator used:
if n.endswith(class="tok-string">"_weight_scale"): class="tok-comment"># WRONG — underscore separator
This silently no-op'd on every name. The HF safetensors tensor names use dots as the module-path separator (linear_attn.in_proj_qkv.weight_scale, experts.0.down_proj.weight_scale), not underscores. The check has to be:
if n.endswith(class="tok-string">".weight_scale"): class="tok-comment"># right — dot separator
The symptom was confusing: engine still KeyError'd on experts.w2_weight_scale and the not found in params_dict warnings still showed the bare weight_scale suffix (no _inv) even though inspect.getsource(Qwen3_5Model.load_weights) confirmed the generator code was in the function body. That was the tell: the generator was running but the endswith check matched zero names, so it passed the iterable through unchanged. Lesson: print one real tensor name from the source shard before writing a suffix-match condition; never assume _ vs ..
2.4 Why a loader patch and not the vendor sglang fork
For Spark, the vLLM patch is bounded: one KeyError on one named slot, single function in qwen3_5.py. We have the patch infrastructure (--apply-mod + auto-SCP + patch --fuzz=5 + Python fallback), matching wheels, and an existing Qwen397 baseline to A/B against. Hours-worst-case vs days-or-never for sglang-on-Spark, which would have needed an aarch64 + sm_121 source build of an externally-maintained fork.
If the loader patch had hit a deeper FP8 math-layout divergence (not just naming), the sglang source-build path would have been next. It didn't — the numerics work, the smoke tests are coherent, and the bench numbers in §3/§4 are in family with Qwen397.
2.5 Why the patch is Nex-scoped, not baked into the image
The rename .weight_scale → .weight_scale_inv is correct for Nex's block-quant unfused checkpoint (verified by smoke + bench), but would silently corrupt a per-tensor-scale FP8 checkpoint that legitimately uses .weight_scale as the registered slot name. The patch is therefore wired into Nex only:
- Applied via
--apply-mod mods/nex-fp8-loaderon the Nex relaunch script —launch-cluster.shauto-SCPs the mod dir to every node before container start. - Not baked into
vllm-node-mimo:latest. relaunch-qwen397-fp8-mtp-qwen3next-tp4.shand the Nemotron-3-Ultra relaunch script are untouched.
3. Act II — the first working bench, and the 0%-accept finding
With the patch applied, the engine booted in ~15 min (40 × ~15 s/shard + 138 s profile/compile) and /health returned 200. We ran the same concurrency sweep + NIAH harness used in the Qwen397 and Nemotron-3-Ultra posts (raw results: logs/results.json).
3.1 First bench (MTP-2, copied from the Qwen397 recipe)
Each request: ~9 808 input tokens + an instruction to continue, max_tokens=1024. All concurrent requests fire simultaneously via a threading.Barrier.
| N | Wall (s) | Agg prefill (t/s) | Agg decode (t/s) | Median TTFT (s) | Median per-req decode (t/s) |
|---|---|---|---|---|---|
| 1 | 119.0 | 192 | 15.1 | 51.08 | 15.1 |
| 2 | 91.2 | 4 060 | 22.9 | 4.83 [^n2-ttft] | 11.8 |
| 4 | 144.6 | 4 774 | 28.8 | 7.02 | 7.4 |
| 8 | 148.2 | 8 882 | 55.7 | 7.52 | 7.3 |
| 16 | 226.1 | 9 588 | 72.8 | 13.81 | 4.8 |
[^n2-ttft]: Only two requests at N=2; this column reports the higher of the two TTFTs (per the same convention as the other N rows). The lower TTFT (1.82 s) is in the raw per-request data in logs/results.json.
These were lower than expected for the same architecture as Qwen397 (which sits at ~40 t/s single-stream and ~186 t/s aggregate @ N=16 on the same image and the same flags). Something on the decode path was off.
3.2 The clue — SpecDecoding metrics showed 0% acceptance
The engine prints SpecDecoding metrics periodically when a speculator is configured. Through the entire bench, every sampling window looked like this:
INFO 06-17 15:18:49 [metrics.py:101] SpecDecoding metrics:
Mean acceptance length: 1.00,
Accepted throughput: 0.00 tokens/s,
Drafted throughput: 163.18 tokens/s,
Accepted: 0 tokens, Drafted: 1632 tokens,
Per-position acceptance rate: 0.000, 0.000,
Avg Draft acceptance rate: 0.0%
The drafter loaded (Detected MTP model. Sharing target model embedding weights with the draft model. × 4 ranks at boot, logs/engine-boot.log), it generated 2 draft tokens per step (Drafted throughput ≈ 2× decode_tps × N), but the target model rejected every single draft. By comparison, Qwen-team's Qwen3.5-397B-A17B-FP8 checkpoint runs at ~86% MTP acceptance on the same image, same flags, same architecture, same MTP single-head re-run pattern — that's the ~1.7× decode multiplier visible in the Qwen397 post.
3.3 Why MTP doesn't land on this checkpoint
Two plausible causes, with (1) the leading hypothesis:
- The MTP head was not recalibrated against the FP8 base. Vendor's FP8 release likely ships the BF16-trained MTP head verbatim (or quantized through a path the vLLM draft loader doesn't reproduce). The draft logits diverge from the FP8 target enough that every speculation fails verification.
- The MTP layer's own FP8 scales may hit the same
.weight_scalevs.weight_scale_invdivergence on the draft loader path — and unlike the main loader (which the patch in §2 fixes), the failure mode here might be "loads but uses uninitialized scales" rather than a crash, producing bit-noise that the target always rejects. The existingif name.startswith("mtp."): continueshort-circuit inQwen3_5Model.load_weightsmeans our generator wrap doesn't run onmtp.*names — the draft model has its own loader path.
Both are open. We didn't chase them further because the operational fix doesn't require knowing which one is true: any drafter contributing 0 accepted tokens is overhead.
3.4 What the 0%-accept finding cost on this bench
The intuition "MTP layer is 1 of 61 forward layers, so the overhead is bounded" was wrong by a factor of ~10×. Beyond the drafter's own forward pass, speculation forces the target to verify N+1 positions per step (where N is the number of draft tokens), even when every draft rejects. With 0% acceptance, every step pays: draft compute + target verification on extra positions + zero accepted tokens of speedup. That overhead, plus the fixed memory cost of the draft model's parameters (which take ~2 GiB of GPU memory per rank that would otherwise be available as KV cache), drops decode throughput dramatically.
We didn't know how dramatically until we measured. So we relaunched.
4. Act III — MTP-off rebench
Single change: dropped the --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' flag from the relaunch script. Everything else identical (KV-fp8, flashinfer attention, max-num-seqs=16, max-model-len=262144, the same --apply-mod mods/nex-fp8-loader patch). The new script ships as relaunch-nex-n2-pro-fp8-tp4.sh (vs the MTP-on relaunch-nex-n2-pro-fp8-mtp-tp4.sh). New container name nex-n2-pro-fp8-tp4.
Boot log: logs/engine-boot-mtp-off.log. Engine reported:
speculative_config=None(vs MTP-on'sSpeculativeConfig(method='mtp', num_spec_tokens=2))- No
Detected MTP modellines — the draft model is gone GPU KV cache size: 844,024 tokens(vs MTP-on's 530,028 — +59%)Available KV cache memory: 6.38 GiBon the tightest rank TP1 (vs MTP-on's 4.47 GiB — +43%)init engine took 137.54 s (compilation: 58.33 s)— unchanged vs MTP-on; disabling MTP saves no boot time
4.1 MTP-off concurrency bench
Raw results: logs/results-mtp-off.json.
| N | Wall (s) | Agg prefill (t/s) | Agg decode (t/s) | Median TTFT (s) | Median per-req decode (t/s) |
|---|---|---|---|---|---|
| 1 | 58.4 | 1 564 | 19.6 | 6.28 | 19.6 |
| 2 | 55.9 | 4 932 | 36.7 | 1.85 | 18.7 |
| 4 | 64.5 | 7 022 | 64.6 | 2.91 | 16.7 |
| 8 | 103.1 | 12 091 | 81.1 | 4.28 | 10.5 |
| 16 | 144.0 | 14 989 | 114.0 | 9.49 | 7.6 |
4.2 Side-by-side — what each N looks like with and without MTP
| N | MTP-on agg decode (t/s) | MTP-off agg decode (t/s) | Δ | MTP-on median per-req decode (t/s) | MTP-off median per-req decode (t/s) |
|---|---|---|---|---|---|
| 1 | 15.1 | 19.6 | +30 % | 15.1 | 19.6 |
| 2 | 22.9 | 36.7 | +60 % | 11.8 | 18.7 |
| 4 | 28.8 | 64.6 | +124 % | 7.4 | 16.7 |
| 8 | 55.7 | 81.1 | +46 % | 7.3 | 10.5 |
| 16 | 72.8 | 114.0 | +57 % | 4.8 | 7.6 |
The biggest wins are at N=2 and N=4 where the decode-vs-prefill mix is most decode-heavy and the per-step speculation overhead dominates. At N=8 / N=16 the engine is increasingly prefill-bound (look at the aggregate prefill column: it climbs from 7 k → 15 k t/s as N grows, vs MTP-on's 8.9 k → 9.6 k plateau) and the relative decode gain compresses, though the absolute aggregate decode still moves from 72.8 → 114 t/s.
The N=1 TTFT is also dramatically more sane on the MTP-off boot: 6.28 s vs MTP-on's anomalous 51.08 s. That difference is partly the cold-start staircase (MTP-on N=1 fired immediately after /health=200, MTP-off N=1 also did but with a tighter compile/profile footprint and no MTP warm-up) and partly the missing per-step MTP autotune work.
4.3 MTP-off NIAH — 200k prefill, same needle
Same NIAH harness as Qwen397 + Nemotron-3-Ultra posts: 200 000-token English filler with one needle injected at ~50% depth.
IMPORTANT: The secret access code for Project Aurora is BANANA-MOOSE-7421. Remember this exactly.
| Metric | MTP-on | MTP-off |
|---|---|---|
| Input tokens | 199 402 | 199 402 |
| TTFT (≈ prefill) | 162.5 s | 131.3 s |
| Prefill speed | 1 227 t/s | 1 519 t/s |
| Output tokens | 62 | 62 |
| Decode speed | 15.5 t/s | 21.3 t/s |
| Total wall | ~166 s | ~134 s |
| Needle retrieved | ✅ | ✅ |
Model's final answer (MTP-off run):
BANANA-MOOSE-7421
Both runs retrieve the needle exactly. The prefill speedup is roughly the share of prefill time that was previously also paying for the MTP head's parameter sharding work (the draft model shares the target's embedding + lm_head, but each TP rank still loads + processes its slice of the draft model on every step). At MTP-off, the per-rank KV pool grew from 4.47 GiB → 6.38 GiB which gave the engine a fatter batch budget for the 199 k prefill micro-batches.
4.4 Where we landed
The final serving config is the MTP-off recipe: N=1 single-stream 19.6 t/s decode, N=16 aggregate 114 t/s decode, 200 k NIAH at 1 519 t/s prefill + 21.3 t/s decode. That's still slower than Qwen397's same-hardware baseline (40 t/s / 186 t/s / 1 584 t/s) — Nex doesn't have a working drafter, so we're paying the full target-model decode cost on every step where Qwen397 amortizes ~86% of its decodes through MTP. The honest read: Nex-N2-Pro-fp8 on Spark with vLLM is throughput-limited by the broken drafter, not by anything we can patch from the launcher. The remaining gap closes only with a re-trained / re-quantized MTP head from the vendor (or the source build of sglang if it turns out the draft loader there handles the FP8 scales differently).
5. Engine-reported config — MTP-off, the version we serve
Pulled from docker logs nex-n2-pro-fp8-tp4 (full log: logs/engine-boot-mtp-off.log):
speculative_config=None,quantization=fp8,kv_cache_dtype=fp8,tensor_parallel_size=4Available KV cache memory: 6.38 GiBon TP1 (tightest), 7.39 GiB on TP3GPU KV cache size: 844,024 tokensinit engine (profile, create kv cache, warmup model) took 137.54 s (compilation: 58.33 s)- Full boot to
/health=200: ~15 min 40 s from containerupto first 200 response, with ~13 min of that in safetensors shard load (40 × ~9.3 GiB at ~18 s / shard).
The MTP-on baseline (for comparison) had:
speculative_config=SpeculativeConfig(method='mtp', num_spec_tokens=2)Available KV cache memory: 4.47 GiBtightestGPU KV cache size: 530,028 tokensinit engine … took 138.08 s (compilation: 64.94 s)— essentially the same; MTP doesn't move the boot needle
6. Reproducibility — Docker image
Same vllm-node-mimo:latest used for the Qwen3.5-397B post. Lineage:
nvidia/cuda:13.2.0-devel-ubuntu24.04
│ + ccache, build tools, libibverbs (RDMA)
│ + pip install torch==2.11.0+cu130, triton, nvshmem
│
▼
vllm-node-tf5:latest (older transformers, stable for non-MTP)
│
└── Dockerfile.mimo-runtime (transformers≥5.0 + reinstall vllm/flashinfer wheels)
│
└── vllm-node-mimo:latest ◀── this benchmark used this image
Required pinned versions (do NOT skip)
| Package | Version | Source | Why |
|---|---|---|---|
torch |
2.11.0+cu130 | https://download.pytorch.org/whl/cu130 |
Fresh vLLM wheel ships metadata pinning torch==2.10.0 but the C++ ABI actually needs 2.11.0. If you let uv resolve naturally it lands torch==2.10.0+cpu and vllm dies at import with ImportError: libtorch_cuda.so: cannot open shared object file. Force-reinstall the cu130 wheel as a separate Dockerfile RUN layer AFTER the vllm install with --no-deps --force-reinstall. |
transformers |
≥5.0.0 |
pip | qwen3_next_mtp config classes live in transformers 5.x; tf5's base pin is 4.x — mimo overrides. |
vllm |
0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 |
local wheel wheels/vllm-*.whl |
the Qwen3_5MoeForConditionalGeneration model code our patch modifies. |
flashinfer-python |
0.6.12 |
local wheels wheels/flashinfer_*.whl |
required for FP8 KV + the long-context attention path. |
Full Dockerfile.mimo-runtime, build + verify + distribute steps, and the docker image prune -a pitfall are documented identically in the Qwen3.5-397B post — see Reproducibility — Docker image. Tar backup at <control-workspace>/docker-images/vllm-node-mimo.tar so a lost image is restorable without a wheels rebuild.
7. Reproducibility — apply the Nex mod
Copy ../resources/mods/nex-fp8-loader/ into ~/spark-vllm-docker/mods/ on the head GX10. Two files:
run.sh— appliesqwen3_5-fp8-scale-alias.diffviapatch -p1 --fuzz=5, with a Python sentinel-anchored fallback that inserts the same generator wrap by string substitution if the diff drifts vs a wheel update. Both paths converge on the same final source.qwen3_5-fp8-scale-alias.diff— the 13-line unified diff.
Both files in this post are exact copies of what's on the head GX10; the --apply-mod machinery does not require the mod be checked into the image, only on the launcher node's filesystem.
8. Reproducibility — launch the cluster (MTP-off, the recommended recipe)
Relaunch wrapper at ../resources/scripts/relaunch-nex-n2-pro-fp8-tp4.sh (exact copy of the head GX10's ~/spark-vllm-docker/relaunch-nex-n2-pro-fp8-tp4.sh). One-shot:
ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
'cd ~/spark-vllm-docker && ./relaunch-nex-n2-pro-fp8-tp4.sh'
The MTP-on baseline recipe — the one §3's numbers came from, kept on the head GX10 so the comparison can be re-run on demand — is at ../resources/scripts/relaunch-nex-n2-pro-fp8-mtp-tp4.sh.
The relaunch script:
- Stops any prior
nex-n2-pro-fp8-tp4container plus any leftovervllm serve/ray start/RayWorkerProc/boot-launch-tp.shprocesses on every node (PID-walk, SIGTERM then SIGKILL), thendrop_cachesfor a clean weight reload. - Verifies every node is clean (same PID walk, asserts no
vllm/ray/RayWorkerProc/boot script survivors). - Calls
launch-cluster.sh … --apply-mod mods/nex-fp8-loader …which auto-SCPs the mod to every Spark, applies the patch inside each container, then runsvllm serve.
The full expanded vllm serve command for the MTP-off recipe (after recipe + env substitution; the only difference vs the MTP-on script is the missing --speculative-config line and a different container name):
docker run -d --name nex-n2-pro-fp8-tp4 \
--runtime nvidia --network host --ipc host --shm-size 16g \
-e HF_HUB_OFFLINE=1 -e TRANSFORMERS_OFFLINE=1 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e VLLM_USE_DEEP_GEMM=0 -e VLLM_USE_FLASHINFER_MOE_FP16=1 \
-e VLLM_USE_FLASHINFER_SAMPLER=0 -e OMP_NUM_THREADS=4 \
-v <spark-model-root>:/root/.cache/huggingface \
-v <spark-launcher-dir>/mods/nex-fp8-loader:/mods/nex-fp8-loader \
vllm-node-mimo:latest \
bash -c 'cd /mods/nex-fp8-loader && bash run.sh && \
vllm serve /root/.cache/huggingface/nex-agi/Nex-N2-Pro-fp8 \
--served-model-name Nex-N2-Pro-fp8 \
--host 0.0.0.0 --port 8000 \
--max-model-len 262144 \
--gpu-memory-utilization 0.90 \
--load-format safetensors \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--max-num-batched-tokens 8192 \
--max-num-seqs 16 \
--trust-remote-code \
-tp 4 --distributed-executor-backend ray \
--mm-encoder-tp-mode data \
--kv-cache-dtype fp8 \
--compilation-config.cudagraph_mode none \
--attention-backend flashinfer'
(The bind-mount of /mods/nex-fp8-loader + the bash run.sh && prefix is what launch-cluster.sh --apply-mod produces on each node before invoking vllm serve — the launcher generates this wrapper.)
Verify the engine actually wired up KV after a relaunch by grepping the container logs for Available KV cache memory: (4 lines, one per rank) and Started server process. With MTP off, do not look for Detected MTP model — that line is the wrong signal here; absence confirms the speculator is off.
Boot time: ~15 min 40 s to /health=200 on a cold cache (40 shards × ~18 s/shard load + 138 s profile/compile). Engine-init compile is fixed; shard load dominates and is bottlenecked by per-rank disk read of ~93 GiB.
9. Reproducibility — bench the cluster
# On the control node (Python 3.14; needs python3-requests from apt
# because pip/venv are not set up here).
cd <control-workspace>/blog/nex-n2-pro-fp8
/usr/bin/python3 logs/bench-mtp-off.py > logs/results-mtp-off.json 2> logs/bench-mtp-off.log
logs/bench-mtp-off.py and logs/bench.py are byte-identical copies of the shared ../resources/bench.py with MODEL="Nex-N2-Pro-fp8". Same concurrency sweep (N ∈ {1, 2, 4, 8, 16}), same NIAH (~199 k filler + needle at 50% depth, same BANANA-MOOSE-7421 secret as the Qwen397 + Nemotron posts so the numbers compare apples-to-apples). The two filenames are kept distinct so the MTP-on and MTP-off raw outputs (results.json vs results-mtp-off.json) don't collide.
10. Prior art — where the launch flags came from
- Qwen3.5-397B-A17B-FP8 4x GX10 post (
../qwen-3.5-397b/README.md) — the recipe this is derived from. Nex is the same architecture (Qwen3_5MoeForConditionalGeneration, 60 hybrid linear/full layers, 512 routed experts top-10, MTP-1) and the only differences in the launch command are the model path, served name, container name, the--apply-mod mods/nex-fp8-loaderinjection, and (the entire point of this post) the missing--speculative-config. - Nex-N2-Pro model card (
huggingface.co/nex-agi/Nex-N2-Pro-fp8) — confirms--reasoning-parser qwen3 --tool-call-parser qwen3_coder(we use both) and--mamba-scheduler-strategy extra_buffer(sglang-only; vLLM handles linear-attention internally and has no equivalent flag). - Qwen3.5-397B-A17B model card (
huggingface.co/Qwen/Qwen3.5-397B-A17B) — for the underlying architecture. - vLLM
qwen3_5.pysource — directly inspected atvllm/model_executor/models/qwen3_5.pyin the running container; theQwen3_5Model.load_weightssubstring-remap loops +AutoWeightsLoaderrouting invllm/model_executor/models/utils.py::_load_moduleare the surface area the patch sits on.
11. Files in this folder
README.md— this filelogs/bench.py— bench harness for the MTP-on baseline (copy of../resources/bench.py,MODEL=Nex-N2-Pro-fp8)logs/bench-mtp-off.py— identical, kept distinct so results files don't collidelogs/results.json— full raw MTP-on bench output (§3)logs/results-mtp-off.json— full raw MTP-off bench output (§4)logs/bench.log— MTP-on bench progress stderrlogs/bench-mtp-off.log— MTP-off bench progress stderrlogs/engine-boot.log— fulldocker logs nex-n2-pro-fp8-mtp-tp4from the MTP-on boot (Act II)logs/engine-boot-mtp-off.log— fulldocker logs nex-n2-pro-fp8-tp4from the MTP-off boot (Act III)logs/engine-fail-pre-patch.log— the pre-patch boot showing theKeyError+ linear-attn warnings the patch fixes (Act I)
Shared scripts / mods / harness used by this post live in ../resources/.