Nex-N2-Pro-fp8 on 4x ASUS GX10 — patch, bench, rebench

Date: 2026-06-17 Hardware: 4x ASUS Ascent GX10 (NVIDIA GB10 Blackwell, 128 GiB unified memory each, ConnectX-7 200 GbE fabric), TP=4 + Ray Container image: vllm-node-mimo:latest (same image as the Qwen3.5-397B post — see Reproducibility — Docker image) vLLM: 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 (local wheel, image id 75429f413d11) Model: nex-agi/Nex-N2-Pro-fp8 (HF), local path /root/.cache/huggingface/nex-agi/Nex-N2-Pro-fp8, 372 GiB per node (FP8 block-quant of Qwen3.5-397B-A17B, 40 shards × ~9.3 GiB) Mod patch (required): mods/nex-fp8-loader — see §2 API: http://<head-node>:8000 Bench client: logs/bench.py (MTP-on baseline) / logs/bench-mtp-off.py (MTP-off rebench), streaming via requests + iter_lines() with stream_options={"include_usage": True}.

Field notes

Nex-N2-Pro-fp8 felt pretty good once it was up, but I did not end up using it as heavily as Qwen397. The useful story here is less "this replaced the baseline" and more "a same-family fine-tune can be made to serve on the same Spark stack, but the checkpoint details matter." The loader patch was a narrow fix, the MTP acceptance issue was not, and disabling MTP turned the model from an interesting failed speculation run into a serviceable target-only deployment.

The practical result is a model that feels in-family with Qwen3.5-397B and Ornith, but without Qwen397's working MTP speedup. That makes it worth documenting as a recovery and comparison point: the tool stack can serve it, the numerics are coherent, and the performance becomes sane after the speculator is removed, but it did not give me enough extra behavioral upside to become the go-to model.

Nex-N2-Pro-fp8 is nex-agi's FP8 block-quantized release of the same Qwen3.5-397B-A17B agentic checkpoint the Qwen team shipped at BF16 + a separate Qwen-team FP8 line. The model card recommends the vendor's sglang fork (which is x86_64 + sm_90, no GB10 build available), so we put it on the same 4x GX10 vLLM stack we already run Qwen3.5-397B-A17B-FP8 on. This post is the full journey from "won't load" to 2.1× the throughput we first measured at, in three acts.

Sampling: temperature=0.0. Prompts tokenized via vLLM's /tokenize endpoint so input token count is exact.

Benchmarks

1. TL;DR — the perf story in one table

What we ended up serving at vs. what the first working bench produced.

Metric	First working bench (MTP-2 from Qwen397 recipe)	After we disabled MTP	Δ
Single-stream decode (N=1)	15.1 t/s	19.6 t/s	+30 %
Aggregate decode @ N=2	22.9 t/s	36.7 t/s	+60 %
Aggregate decode @ N=4	28.8 t/s	64.6 t/s	+124 %
Aggregate decode @ N=8	55.7 t/s	81.1 t/s	+46 %
Aggregate decode @ N=16	72.8 t/s	114.0 t/s	+57 %
NIAH 200k prefill	1 227 t/s	1 519 t/s	+24 %
NIAH 200k decode	15.5 t/s	21.3 t/s	+37 %
Engine-reported KV cache pool	530 028 tokens	844 024 tokens	+59 %
Tightest-rank available KV cache memory	4.47 GiB	6.38 GiB	+43 %

The recipe shipped on the head GX10 as relaunch-nex-n2-pro-fp8-tp4.sh — same as Qwen397's MTP-2 recipe but with --speculative-config removed.

Below is how we got there: what the vendor's recommended recipe does on Spark (nothing — it crashes on weight load), what the bare Qwen397-FP8 recipe does (also crashes, but on a different line), the patch that fixed the load, the first bench with the Qwen397-style flags, the 0%-MTP-acceptance finding that explained why the numbers were below the same-arch baseline, and the rebench that recovered most of the gap.

2. Act I — the loader crash and the patch

The Nex model card recommends the vendor sglang fork (nexagi/sglang:v0.5.12-579f84b). On Spark, that image will not run as-is — it is built for x86_64 + sm_90 (H100), and Spark is aarch64 + sm_121. The supported alternative is a from-source build of the fork at github.com/nex-agi/sglang, which has had no GB10 validation and none of our existing tuning carries over. We've already shipped Qwen3.5-397B-A17B-FP8 (same arch family) on the Spark vLLM stack, so the natural first attempt was the Qwen397 recipe with just the model path swapped.

That attempt died during weight load:

KeyError: 'layers.0.mlp.experts.w2_weight_scale'
  File ".../vllm/model_executor/models/qwen3_5.py", line 395, in load_weights
    param = params_dict[name_mapped]

Precursor warnings on the same boot:

WARNING qwen3_5.py:420 Parameter layers.0.linear_attn.in_proj_qkvz.weight_scale
   not found in params_dict, skip loading
WARNING qwen3_5.py:420 Parameter layers.0.linear_attn.out_proj.weight_scale
   not found in params_dict, skip loading

Full trace: logs/engine-fail-pre-patch.log.

2.1 Why the stock loader misses

Nex's FP8 release is block-quantized (weight_block_size=[128,128], activation_scheme=dynamic) and stores per-expert and per-linear-projection scales in the conventional companion-tensor shape:

layers.0.mlp.experts.0.gate_proj.weight        (FP8)
layers.0.mlp.experts.0.gate_proj.weight_scale  ◀── per-block scale
layers.0.mlp.experts.0.up_proj.weight          (FP8)
layers.0.mlp.experts.0.up_proj.weight_scale    ◀── per-block scale
layers.0.mlp.experts.0.down_proj.weight        (FP8)
layers.0.mlp.experts.0.down_proj.weight_scale  ◀── per-block scale
…
layers.0.linear_attn.in_proj_qkv.weight        (FP8)
layers.0.linear_attn.in_proj_qkv.weight_scale  ◀── per-block scale
layers.0.linear_attn.in_proj_z.weight          (FP8)
layers.0.linear_attn.in_proj_z.weight_scale    ◀── per-block scale
layers.0.linear_attn.out_proj.weight           (FP8)
layers.0.linear_attn.out_proj.weight_scale     ◀── per-block scale

vLLM's Qwen3_5Model.load_weights rewrites the per-expert unfused weights into the fused experts.w13_weight / experts.w2_weight slots and the unfused linear-attn projections into linear_attn.in_proj_qkvz (via the stacked_params_mapping substring rule in_proj_qkv → in_proj_qkvz). That substring replace runs on the .weight_scale companion tensors too, producing names like experts.w2_weight_scale and linear_attn.in_proj_qkvz.weight_scale.

But vLLM's Fp8MoEMethod (and the matching Fp8LinearMethod) for block-quant register their scale slot as .weight_scale_inv — the FP8 block-scale convention vLLM uses internally (a leftover from the inverse-scale layout DeepSeek's checkpoints use). The lookup params_dict["…experts.w2_weight_scale"] misses because the actual registered slot is …experts.w2_weight_scale_inv. Hence the KeyError.

The linear-attn case fires the warning instead of crashing because the fall-through branch in load_weights warns-and-skips on missing-from-params_dict — but those scales never got loaded, so even if the engine had survived the expert KeyError, the linear-attn FP8 dequant would have used uninitialized scales and silently produced garbage.

Qwen-team's own FP8 release of Qwen397 does not hit this because its per-expert checkpoint already uses the _scale_inv suffix and matches vLLM's slot convention. Nex's FP8 toolchain (the same stack that produced Nex-N2-mini) emits .weight_scale instead.

2.2 The patch

A 13-line generator wrap at the top of Qwen3_5Model.load_weights that renames .weight_scale → .weight_scale_inv on every incoming tensor before the existing substring-remap loops run:

def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
    class="tok-comment"># Nex-N2-Pro-fp8 alias (added by spark-vllm-docker mods/nex-fp8-loader):
    class="tok-comment"># FP8 block-quant stores per-(expert|projection) block scales as
    class="tok-comment"># `<x>.weight_scale`; vLLM Fp8 quant methods register the slot as
    class="tok-comment"># `<x>.weight_scale_inv`. Normalize at ingestion so the rest of the
    class="tok-comment"># loader (stacked_params_mapping + expert_params_mapping substring
    class="tok-comment"># remaps) lands on the correct param key.
    def _alias_fp8_scales(ws):
        for n, w in ws:
            if n.endswith(class="tok-string">".weight_scale"):
                yield n + "_inv", w
            else:
                yield n, w
    weights = _alias_fp8_scales(weights)
    stacked_params_mapping = [
        …

After this generator runs, the existing substring-replace logic in stacked_params_mapping (in_proj_qkv → in_proj_qkvz) and expert_params_mapping (experts.{i}.down_proj → experts.w2_weight) operates on names that already have the _inv suffix, and every rewritten name lands on the correct Fp8MoEMethod / Fp8LinearMethod slot.

Unified diff at ../resources/mods/nex-fp8-loader/qwen3_5-fp8-scale-alias.diff.

2.3 Pitfall — `.` vs `_` in the suffix check (cost ~25 min)

First version of the alias generator used:

if n.endswith(class="tok-string">"_weight_scale"):     class="tok-comment"># WRONG — underscore separator

This silently no-op'd on every name. The HF safetensors tensor names use dots as the module-path separator (linear_attn.in_proj_qkv.weight_scale, experts.0.down_proj.weight_scale), not underscores. The check has to be:

if n.endswith(class="tok-string">".weight_scale"):     class="tok-comment"># right — dot separator

The symptom was confusing: engine still KeyError'd on experts.w2_weight_scale and the not found in params_dict warnings still showed the bare weight_scale suffix (no _inv) even though inspect.getsource(Qwen3_5Model.load_weights) confirmed the generator code was in the function body. That was the tell: the generator was running but the endswith check matched zero names, so it passed the iterable through unchanged. Lesson: print one real tensor name from the source shard before writing a suffix-match condition; never assume _ vs ..

2.4 Why a loader patch and not the vendor `sglang` fork

For Spark, the vLLM patch is bounded: one KeyError on one named slot, single function in qwen3_5.py. We have the patch infrastructure (--apply-mod + auto-SCP + patch --fuzz=5 + Python fallback), matching wheels, and an existing Qwen397 baseline to A/B against. Hours-worst-case vs days-or-never for sglang-on-Spark, which would have needed an aarch64 + sm_121 source build of an externally-maintained fork.

If the loader patch had hit a deeper FP8 math-layout divergence (not just naming), the sglang source-build path would have been next. It didn't — the numerics work, the smoke tests are coherent, and the bench numbers in §3/§4 are in family with Qwen397.

2.5 Why the patch is Nex-scoped, not baked into the image

The rename .weight_scale → .weight_scale_inv is correct for Nex's block-quant unfused checkpoint (verified by smoke + bench), but would silently corrupt a per-tensor-scale FP8 checkpoint that legitimately uses .weight_scale as the registered slot name. The patch is therefore wired into Nex only:

Applied via --apply-mod mods/nex-fp8-loader on the Nex relaunch script — launch-cluster.sh auto-SCPs the mod dir to every node before container start.
Not baked into vllm-node-mimo:latest.
relaunch-qwen397-fp8-mtp-qwen3next-tp4.sh and the Nemotron-3-Ultra relaunch script are untouched.

3. Act II — the first working bench, and the 0%-accept finding

With the patch applied, the engine booted in ~15 min (40 × ~15 s/shard + 138 s profile/compile) and /health returned 200. We ran the same concurrency sweep + NIAH harness used in the Qwen397 and Nemotron-3-Ultra posts (raw results: logs/results.json).

3.1 First bench (MTP-2, copied from the Qwen397 recipe)

Each request: ~9 808 input tokens + an instruction to continue, max_tokens=1024. All concurrent requests fire simultaneously via a threading.Barrier.

N	Wall (s)	Agg prefill (t/s)	Agg decode (t/s)	Median TTFT (s)	Median per-req decode (t/s)
1	119.0	192	15.1	51.08	15.1
2	91.2	4 060	22.9	4.83 [^n2-ttft]	11.8
4	144.6	4 774	28.8	7.02	7.4
8	148.2	8 882	55.7	7.52	7.3
16	226.1	9 588	72.8	13.81	4.8

[^n2-ttft]: Only two requests at N=2; this column reports the higher of the two TTFTs (per the same convention as the other N rows). The lower TTFT (1.82 s) is in the raw per-request data in logs/results.json.

These were lower than expected for the same architecture as Qwen397 (which sits at ~40 t/s single-stream and ~186 t/s aggregate @ N=16 on the same image and the same flags). Something on the decode path was off.

3.2 The clue — `SpecDecoding metrics` showed 0% acceptance

The engine prints SpecDecoding metrics periodically when a speculator is configured. Through the entire bench, every sampling window looked like this:

INFO 06-17 15:18:49 [metrics.py:101] SpecDecoding metrics:
  Mean acceptance length: 1.00,
  Accepted throughput: 0.00 tokens/s,
  Drafted throughput: 163.18 tokens/s,
  Accepted: 0 tokens, Drafted: 1632 tokens,
  Per-position acceptance rate: 0.000, 0.000,
  Avg Draft acceptance rate: 0.0%

The drafter loaded (Detected MTP model. Sharing target model embedding weights with the draft model. × 4 ranks at boot, logs/engine-boot.log), it generated 2 draft tokens per step (Drafted throughput ≈ 2× decode_tps × N), but the target model rejected every single draft. By comparison, Qwen-team's Qwen3.5-397B-A17B-FP8 checkpoint runs at ~86% MTP acceptance on the same image, same flags, same architecture, same MTP single-head re-run pattern — that's the ~1.7× decode multiplier visible in the Qwen397 post.

3.3 Why MTP doesn't land on this checkpoint

Two plausible causes, with (1) the leading hypothesis:

The MTP head was not recalibrated against the FP8 base. Vendor's FP8 release likely ships the BF16-trained MTP head verbatim (or quantized through a path the vLLM draft loader doesn't reproduce). The draft logits diverge from the FP8 target enough that every speculation fails verification.
The MTP layer's own FP8 scales may hit the same .weight_scale vs .weight_scale_inv divergence on the draft loader path — and unlike the main loader (which the patch in §2 fixes), the failure mode here might be "loads but uses uninitialized scales" rather than a crash, producing bit-noise that the target always rejects. The existing if name.startswith("mtp."): continue short-circuit in Qwen3_5Model.load_weights means our generator wrap doesn't run on mtp.* names — the draft model has its own loader path.

Both are open. We didn't chase them further because the operational fix doesn't require knowing which one is true: any drafter contributing 0 accepted tokens is overhead.

3.4 What the 0%-accept finding cost on this bench

The intuition "MTP layer is 1 of 61 forward layers, so the overhead is bounded" was wrong by a factor of ~10×. Beyond the drafter's own forward pass, speculation forces the target to verify N+1 positions per step (where N is the number of draft tokens), even when every draft rejects. With 0% acceptance, every step pays: draft compute + target verification on extra positions + zero accepted tokens of speedup. That overhead, plus the fixed memory cost of the draft model's parameters (which take ~2 GiB of GPU memory per rank that would otherwise be available as KV cache), drops decode throughput dramatically.

We didn't know how dramatically until we measured. So we relaunched.

4. Act III — MTP-off rebench

Single change: dropped the --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' flag from the relaunch script. Everything else identical (KV-fp8, flashinfer attention, max-num-seqs=16, max-model-len=262144, the same --apply-mod mods/nex-fp8-loader patch). The new script ships as relaunch-nex-n2-pro-fp8-tp4.sh (vs the MTP-on relaunch-nex-n2-pro-fp8-mtp-tp4.sh). New container name nex-n2-pro-fp8-tp4.

Boot log: logs/engine-boot-mtp-off.log. Engine reported:

speculative_config=None (vs MTP-on's SpeculativeConfig(method='mtp', num_spec_tokens=2))
No Detected MTP model lines — the draft model is gone
GPU KV cache size: 844,024 tokens (vs MTP-on's 530,028 — +59%)
Available KV cache memory: 6.38 GiB on the tightest rank TP1 (vs MTP-on's 4.47 GiB — +43%)
init engine took 137.54 s (compilation: 58.33 s) — unchanged vs MTP-on; disabling MTP saves no boot time

4.1 MTP-off concurrency bench

Raw results: logs/results-mtp-off.json.

N	Wall (s)	Agg prefill (t/s)	Agg decode (t/s)	Median TTFT (s)	Median per-req decode (t/s)
1	58.4	1 564	19.6	6.28	19.6
2	55.9	4 932	36.7	1.85	18.7
4	64.5	7 022	64.6	2.91	16.7
8	103.1	12 091	81.1	4.28	10.5
16	144.0	14 989	114.0	9.49	7.6

4.2 Side-by-side — what each N looks like with and without MTP

N	MTP-on agg decode (t/s)	MTP-off agg decode (t/s)	Δ	MTP-on median per-req decode (t/s)	MTP-off median per-req decode (t/s)
1	15.1	19.6	+30 %	15.1	19.6
2	22.9	36.7	+60 %	11.8	18.7
4	28.8	64.6	+124 %	7.4	16.7
8	55.7	81.1	+46 %	7.3	10.5
16	72.8	114.0	+57 %	4.8	7.6

The biggest wins are at N=2 and N=4 where the decode-vs-prefill mix is most decode-heavy and the per-step speculation overhead dominates. At N=8 / N=16 the engine is increasingly prefill-bound (look at the aggregate prefill column: it climbs from 7 k → 15 k t/s as N grows, vs MTP-on's 8.9 k → 9.6 k plateau) and the relative decode gain compresses, though the absolute aggregate decode still moves from 72.8 → 114 t/s.

The N=1 TTFT is also dramatically more sane on the MTP-off boot: 6.28 s vs MTP-on's anomalous 51.08 s. That difference is partly the cold-start staircase (MTP-on N=1 fired immediately after /health=200, MTP-off N=1 also did but with a tighter compile/profile footprint and no MTP warm-up) and partly the missing per-step MTP autotune work.

4.3 MTP-off NIAH — 200k prefill, same needle

Same NIAH harness as Qwen397 + Nemotron-3-Ultra posts: 200 000-token English filler with one needle injected at ~50% depth.

IMPORTANT: The secret access code for Project Aurora is BANANA-MOOSE-7421. Remember this exactly.

Metric	MTP-on	MTP-off
Input tokens	199 402	199 402
TTFT (≈ prefill)	162.5 s	131.3 s
Prefill speed	1 227 t/s	1 519 t/s
Output tokens	62	62
Decode speed	15.5 t/s	21.3 t/s
Total wall	~166 s	~134 s
Needle retrieved	✅	✅

Model's final answer (MTP-off run):

BANANA-MOOSE-7421

Both runs retrieve the needle exactly. The prefill speedup is roughly the share of prefill time that was previously also paying for the MTP head's parameter sharding work (the draft model shares the target's embedding + lm_head, but each TP rank still loads + processes its slice of the draft model on every step). At MTP-off, the per-rank KV pool grew from 4.47 GiB → 6.38 GiB which gave the engine a fatter batch budget for the 199 k prefill micro-batches.

4.4 Where we landed

The final serving config is the MTP-off recipe: N=1 single-stream 19.6 t/s decode, N=16 aggregate 114 t/s decode, 200 k NIAH at 1 519 t/s prefill + 21.3 t/s decode. That's still slower than Qwen397's same-hardware baseline (40 t/s / 186 t/s / 1 584 t/s) — Nex doesn't have a working drafter, so we're paying the full target-model decode cost on every step where Qwen397 amortizes ~86% of its decodes through MTP. The honest read: Nex-N2-Pro-fp8 on Spark with vLLM is throughput-limited by the broken drafter, not by anything we can patch from the launcher. The remaining gap closes only with a re-trained / re-quantized MTP head from the vendor (or the source build of sglang if it turns out the draft loader there handles the FP8 scales differently).

5. Engine-reported config — MTP-off, the version we serve

Pulled from docker logs nex-n2-pro-fp8-tp4 (full log: logs/engine-boot-mtp-off.log):

speculative_config=None, quantization=fp8, kv_cache_dtype=fp8, tensor_parallel_size=4
Available KV cache memory: 6.38 GiB on TP1 (tightest), 7.39 GiB on TP3
GPU KV cache size: 844,024 tokens
init engine (profile, create kv cache, warmup model) took 137.54 s (compilation: 58.33 s)
Full boot to /health=200: ~15 min 40 s from container up to first 200 response, with ~13 min of that in safetensors shard load (40 × ~9.3 GiB at ~18 s / shard).

The MTP-on baseline (for comparison) had:

speculative_config=SpeculativeConfig(method='mtp', num_spec_tokens=2)
Available KV cache memory: 4.47 GiB tightest
GPU KV cache size: 530,028 tokens
init engine … took 138.08 s (compilation: 64.94 s) — essentially the same; MTP doesn't move the boot needle

6. Reproducibility — Docker image

Same vllm-node-mimo:latest used for the Qwen3.5-397B post. Lineage:

nvidia/cuda:13.2.0-devel-ubuntu24.04
   │  + ccache, build tools, libibverbs (RDMA)
   │  + pip install torch==2.11.0+cu130, triton, nvshmem
   │
   ▼
vllm-node-tf5:latest           (older transformers, stable for non-MTP)
   │
   └── Dockerfile.mimo-runtime  (transformers≥5.0 + reinstall vllm/flashinfer wheels)
       │
       └── vllm-node-mimo:latest   ◀── this benchmark used this image

Required pinned versions (do NOT skip)

Package	Version	Source	Why
`torch`	2.11.0+cu130	`https://download.pytorch.org/whl/cu130`	Fresh vLLM wheel ships metadata pinning `torch==2.10.0` but the C++ ABI actually needs 2.11.0. If you let `uv` resolve naturally it lands `torch==2.10.0+cpu` and vllm dies at import with `ImportError: libtorch_cuda.so: cannot open shared object file`. Force-reinstall the cu130 wheel as a separate Dockerfile RUN layer AFTER the vllm install with `--no-deps --force-reinstall`.
`transformers`	`≥5.0.0`	pip	`qwen3_next_mtp` config classes live in transformers 5.x; tf5's base pin is 4.x — mimo overrides.
`vllm`	`0.22.1rc1.dev124+gace95c9cf.d20260603.cu132`	local wheel `wheels/vllm-*.whl`	the `Qwen3_5MoeForConditionalGeneration` model code our patch modifies.
`flashinfer-python`	`0.6.12`	local wheels `wheels/flashinfer_*.whl`	required for FP8 KV + the long-context attention path.

Full Dockerfile.mimo-runtime, build + verify + distribute steps, and the docker image prune -a pitfall are documented identically in the Qwen3.5-397B post — see Reproducibility — Docker image. Tar backup at <control-workspace>/docker-images/vllm-node-mimo.tar so a lost image is restorable without a wheels rebuild.

7. Reproducibility — apply the Nex mod

Copy ../resources/mods/nex-fp8-loader/ into ~/spark-vllm-docker/mods/ on the head GX10. Two files:

run.sh — applies qwen3_5-fp8-scale-alias.diff via patch -p1 --fuzz=5, with a Python sentinel-anchored fallback that inserts the same generator wrap by string substitution if the diff drifts vs a wheel update. Both paths converge on the same final source.
qwen3_5-fp8-scale-alias.diff — the 13-line unified diff.

Both files in this post are exact copies of what's on the head GX10; the --apply-mod machinery does not require the mod be checked into the image, only on the launcher node's filesystem.

8. Reproducibility — launch the cluster (MTP-off, the recommended recipe)

Relaunch wrapper at ../resources/scripts/relaunch-nex-n2-pro-fp8-tp4.sh (exact copy of the head GX10's ~/spark-vllm-docker/relaunch-nex-n2-pro-fp8-tp4.sh). One-shot:

ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
  'cd ~/spark-vllm-docker && ./relaunch-nex-n2-pro-fp8-tp4.sh'

The MTP-on baseline recipe — the one §3's numbers came from, kept on the head GX10 so the comparison can be re-run on demand — is at ../resources/scripts/relaunch-nex-n2-pro-fp8-mtp-tp4.sh.

The relaunch script:

Stops any prior nex-n2-pro-fp8-tp4 container plus any leftover vllm serve / ray start / RayWorkerProc / boot-launch-tp.sh processes on every node (PID-walk, SIGTERM then SIGKILL), then drop_caches for a clean weight reload.
Verifies every node is clean (same PID walk, asserts no vllm/ray/RayWorkerProc/boot script survivors).
Calls launch-cluster.sh … --apply-mod mods/nex-fp8-loader … which auto-SCPs the mod to every Spark, applies the patch inside each container, then runs vllm serve.

The full expanded vllm serve command for the MTP-off recipe (after recipe + env substitution; the only difference vs the MTP-on script is the missing --speculative-config line and a different container name):

docker run -d --name nex-n2-pro-fp8-tp4 \
  --runtime nvidia --network host --ipc host --shm-size 16g \
  -e HF_HUB_OFFLINE=1 -e TRANSFORMERS_OFFLINE=1 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -e VLLM_USE_DEEP_GEMM=0 -e VLLM_USE_FLASHINFER_MOE_FP16=1 \
  -e VLLM_USE_FLASHINFER_SAMPLER=0 -e OMP_NUM_THREADS=4 \
  -v <spark-model-root>:/root/.cache/huggingface \
  -v <spark-launcher-dir>/mods/nex-fp8-loader:/mods/nex-fp8-loader \
  vllm-node-mimo:latest \
  bash -c 'cd /mods/nex-fp8-loader && bash run.sh && \
    vllm serve /root/.cache/huggingface/nex-agi/Nex-N2-Pro-fp8 \
      --served-model-name Nex-N2-Pro-fp8 \
      --host 0.0.0.0 --port 8000 \
      --max-model-len 262144 \
      --gpu-memory-utilization 0.90 \
      --load-format safetensors \
      --enable-prefix-caching \
      --enable-auto-tool-choice \
      --tool-call-parser qwen3_coder \
      --reasoning-parser qwen3 \
      --max-num-batched-tokens 8192 \
      --max-num-seqs 16 \
      --trust-remote-code \
      -tp 4 --distributed-executor-backend ray \
      --mm-encoder-tp-mode data \
      --kv-cache-dtype fp8 \
      --compilation-config.cudagraph_mode none \
      --attention-backend flashinfer'

(The bind-mount of /mods/nex-fp8-loader + the bash run.sh && prefix is what launch-cluster.sh --apply-mod produces on each node before invoking vllm serve — the launcher generates this wrapper.)

Verify the engine actually wired up KV after a relaunch by grepping the container logs for Available KV cache memory: (4 lines, one per rank) and Started server process. With MTP off, do not look for Detected MTP model — that line is the wrong signal here; absence confirms the speculator is off.

Boot time: ~15 min 40 s to /health=200 on a cold cache (40 shards × ~18 s/shard load + 138 s profile/compile). Engine-init compile is fixed; shard load dominates and is bottlenecked by per-rank disk read of ~93 GiB.

9. Reproducibility — bench the cluster

# On the control node (Python 3.14; needs python3-requests from apt
# because pip/venv are not set up here).
cd <control-workspace>/blog/nex-n2-pro-fp8
/usr/bin/python3 logs/bench-mtp-off.py > logs/results-mtp-off.json 2> logs/bench-mtp-off.log

logs/bench-mtp-off.py and logs/bench.py are byte-identical copies of the shared ../resources/bench.py with MODEL="Nex-N2-Pro-fp8". Same concurrency sweep (N ∈ {1, 2, 4, 8, 16}), same NIAH (~199 k filler + needle at 50% depth, same BANANA-MOOSE-7421 secret as the Qwen397 + Nemotron posts so the numbers compare apples-to-apples). The two filenames are kept distinct so the MTP-on and MTP-off raw outputs (results.json vs results-mtp-off.json) don't collide.

10. Prior art — where the launch flags came from

Qwen3.5-397B-A17B-FP8 4x GX10 post (../qwen-3.5-397b/README.md) — the recipe this is derived from. Nex is the same architecture (Qwen3_5MoeForConditionalGeneration, 60 hybrid linear/full layers, 512 routed experts top-10, MTP-1) and the only differences in the launch command are the model path, served name, container name, the --apply-mod mods/nex-fp8-loader injection, and (the entire point of this post) the missing --speculative-config.
Nex-N2-Pro model card (huggingface.co/nex-agi/Nex-N2-Pro-fp8) — confirms --reasoning-parser qwen3 --tool-call-parser qwen3_coder (we use both) and --mamba-scheduler-strategy extra_buffer (sglang-only; vLLM handles linear-attention internally and has no equivalent flag).
Qwen3.5-397B-A17B model card (huggingface.co/Qwen/Qwen3.5-397B-A17B) — for the underlying architecture.
vLLM qwen3_5.py source — directly inspected at vllm/model_executor/models/qwen3_5.py in the running container; the Qwen3_5Model.load_weights substring-remap loops + AutoWeightsLoader routing in vllm/model_executor/models/utils.py::_load_module are the surface area the patch sits on.

11. Files in this folder

README.md — this file
logs/bench.py — bench harness for the MTP-on baseline (copy of ../resources/bench.py, MODEL=Nex-N2-Pro-fp8)
logs/bench-mtp-off.py — identical, kept distinct so results files don't collide
logs/results.json — full raw MTP-on bench output (§3)
logs/results-mtp-off.json — full raw MTP-off bench output (§4)
logs/bench.log — MTP-on bench progress stderr
logs/bench-mtp-off.log — MTP-off bench progress stderr
logs/engine-boot.log — full docker logs nex-n2-pro-fp8-mtp-tp4 from the MTP-on boot (Act II)
logs/engine-boot-mtp-off.log — full docker logs nex-n2-pro-fp8-tp4 from the MTP-off boot (Act III)
logs/engine-fail-pre-patch.log — the pre-patch boot showing the KeyError + linear-attn warnings the patch fixes (Act I)

Shared scripts / mods / harness used by this post live in ../resources/.