Ornith-1.0-397B-FP8 on 4x ASUS GX10

Date: 2026-06-30 Model: deepreinforce-ai/Ornith-1.0-397B-FP8 — DeepReinforce's MIT-licensed 397B agentic coding model, FP8 release Hardware: 4x ASUS Ascent GX10 (NVIDIA GB10, 128 GiB unified per node, 200 GbE ConnectX-7 fabric), TP=4 via Ray Image: vllm-node-mimo:latest (vllm 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132) Served at: http://<head-node>:8000, model name Ornith-1.0-397B-FP8

Field notes

Ornith is the one I am warming up to. It is still early, but so far it seems to mess around less with general tool calling than the other Qwen3.5-derived runs. That matters more in actual local-agent use than a small leaderboard-style quality delta: a model that calls tools plainly, does not over-negotiate the next step, and keeps the loop moving is often the one that feels better in OpenClaude, Pi, or OpenCode.

The caveat is the same one Nex exposed: the shipped MTP head does not accept against the FP8 target, so the model has to run without speculative decoding. That keeps it behind Qwen397 on raw decode speed, but the behavioral profile is interesting enough that I still want it in the rotation. If Qwen397 is the fast default, Ornith is the current candidate for "maybe the tool-use temperament is better."

Ornith is post-trained from Qwen3.5 (Qwen3_5MoeForConditionalGeneration — same 60 hybrid layers, 512 experts × 10 active, single-head MTP, 262 144 ctx). The FP8 release uses compressed-tensors per-channel weights + per-token dynamic activations, which vLLM loads natively. The launch is the Qwen3.5-397B-A17B-FP8 recipe with three swaps: model path, served-model-name, and --tool-call-parser qwen3_xml (per the Ornith card). MTP shipped on the recipe but turns out to land at 0% accept on this checkpoint — disabling it is what gets you the numbers in §1.


Benchmarks

1. TL;DR — what we got

Single concurrency sweep on the cluster, FP8 KV cache, max_num_seqs=16, max_model_len=262 144, temperature=0.7, ~1 040-token prompt × 512 decode tokens.

N Wall (s) Aggregate decode (t/s) Median per-req decode (t/s) Median TTFT (s)
1 28.5 18.0 18.8 1.36
2 30.2 33.9 17.6 1.24
4 36.1 56.7 15.1 2.26
8 55.6 72.4 10.3 6.19
16 69.5 112.8 7.9 4.75

Long-context single-stream NIAH-shaped prefill probe at ~200 k input tokens — TTFT 125.3 s ⇒ 1 597 t/s prefill, then 18.7 t/s decode.

Raw JSON: logs/results-mtp-off.json. Bench progress: logs/bench-mtp-off.log.

How it stacks up on the same hardware

Same 4x GX10 hardware, same image, same launcher, same bench harness:

Model Single-stream decode (t/s) Agg @ N=16 (t/s) 200 k prefill (t/s) Notes
Qwen3.5-397B-A17B-FP8 (MTP-2) 40.0 186 1 584 MTP ~86% accept → ~1.7× decode multiplier
Ornith-1.0-397B-FP8 (MTP off) 18.0 113 1 597 Same arch; vendor-shipped MTP head at 0% accept — see §4.2
Nex-N2-Pro-fp8 (MTP off) 19.6 114 1 519 Same arch; same MTP issue; FP8 loader patch required

Prefill numbers are essentially identical across all three (same prefill kernels, same FP8 arithmetic). Decode is where MTP shows up: Qwen-team's checkpoint ships an MTP head trained against the FP8 base and gets ~1.7× decode, while both fine-tuned variants (Ornith and Nex) ship MTP heads that don't survive the SFT pass and have to run target-only. That's the gap.


2. Setup at a glance

Topology 4x GX10, head , three worker GX10s
Distributed backend Ray, TP=4
Image vllm-node-mimo:latest (image id 75429f413d11)
Model on disk <spark-model-root>/deepreinforce-ai/Ornith-1.0-397B-FP8/ (377 GiB per node, 122 shards)
Context length 262 144
KV cache dtype fp8
Tool / reasoning parsers qwen3_xml / qwen3
Speculative decoding none (Ornith MTP head doesn't accept against the FP8 base — see §4.2)
API http://<head-node>:8000
Boot time (cold) ~20.5 min to /health=200 (122 shard load + 121 s engine init)
Available KV @ TP1 (tightest) 4.98 GiB
GPU KV cache pool 659 331 tokens

Pinned wheels in the image:

Package Version Why
torch 2.11.0+cu130 Fresh vLLM wheel ships metadata pinning torch==2.10.0 but the C++ ABI actually needs 2.11.0. Force-reinstall the cu130 wheel as a separate Dockerfile RUN AFTER vllm install with --no-deps --force-reinstall, otherwise vllm dies at import with ImportError: libtorch_cuda.so.
transformers ≥5.0.0 qwen3_5_moe + qwen3_next_mtp config classes live in 5.x
vllm 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 local wheel; carries the Qwen3_5MoeForConditionalGeneration model code
flashinfer-python 0.6.12 FP8 KV + long-context attention path

Full image lineage and Dockerfile details are documented in the Qwen3.5-397B post; same image. Tar backup at <control-workspace>/docker-images/vllm-node-mimo.tar.


3. Reproduce

Launch the cluster

Relaunch wrapper at ../resources/scripts/relaunch-ornith-fp8-tp4.sh (the MTP-on variant kept alongside for reference is relaunch-ornith-fp8-mtp-tp4.sh):

ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
  'cd ~/spark-vllm-docker && ./relaunch-ornith-fp8-tp4.sh'

The wrapper stops any stale vllm / ray / RayWorkerProc processes on all four nodes (PID walk + SIGTERM/SIGKILL), drop_caches, then dispatches the cluster via launch-cluster.sh. The expanded vllm serve it runs on each node:

vllm serve /root/.cache/huggingface/deepreinforce-ai/Ornith-1.0-397B-FP8 \
  --served-model-name Ornith-1.0-397B-FP8 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --load-format safetensors \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 16 \
  --trust-remote-code \
  -tp 4 \
  --distributed-executor-backend ray \
  --mm-encoder-tp-mode data \
  --kv-cache-dtype fp8 \
  --compilation-config.cudagraph_mode none \
  --attention-backend flashinfer

Env passed in by the launcher: HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True VLLM_USE_DEEP_GEMM=0 VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_USE_FLASHINFER_SAMPLER=0 OMP_NUM_THREADS=4.

Verify the engine is up:

python3 -c "import urllib.request; print(urllib.request.urlopen('http://<head-node>:8000/health', timeout=4).status)"
# 200

python3 - <<'PY'
import json, urllib.request as u
r = u.urlopen(u.Request("http://<head-node>:8000/v1/chat/completions",
    data=json.dumps({"model":"Ornith-1.0-397B-FP8",
      "messages":[{"role":"user","content":"Reply with the single word: alive"}],
      "max_tokens":256,"temperature":0}).encode(),
    headers={"Content-Type":"application/json"}), timeout=120)
print(json.loads(r.read())["choices"][0]["message"])
PY

Note max_tokens=256 — the reasoning parser needs budget for <think>…</think> plus the answer. With max_tokens=50 the reasoning eats the budget and content returns null.

Bench

# Control node — Python 3.14 with apt's python3-requests.
/usr/bin/python3 <control-workspace>/scripts/bench_ornith_concurrency.py \
  --levels 1,2,4,8,16 --decode-tokens 512 --long-ctx 200000

Streaming bench uses requests + iter_lines() (not urllib — urlopen buffers SSE and breaks TTFT/per-stream timing). It also reads both delta.content and delta.reasoning for token-arrival events because the qwen3 reasoning parser emits the chain-of-thought via delta.reasoning in vLLM's stream chunks.

Harness: logs/bench.py. Output JSON: logs/results-mtp-off.json.


4. Details

4.1 Why no loader patch was needed

Ornith ships FP8 weights via compressed-tensors format:

"quantization_config": {
  "quant_method": "compressed-tensors",
  "config_groups": {
    "config_group_0": {
      "weights":           {"num_bits": 8, "strategy": "channel", "symmetric": true, ...},
      "input_activations": {"num_bits": 8, "strategy": "token",   "dynamic": true,  ...}
    }
  }
}

That is per-channel weights + per-token dynamic activations, registered through vLLM's native compressed-tensors loader. The Qwen-team Qwen3.5-397B-A17B-FP8 release uses block-FP8 (weight_block_size=[128,128]) with .weight_scale_inv-suffixed scales; the Nex-N2-Pro-fp8 release uses block-FP8 with .weight_scale suffix and needed the mods/nex-fp8-loader patch to rename. Ornith hits neither code path — compressed-tensors MoE loading lands on the right registered slots for the per-channel scheme on the first try. No mod, no KeyError, no warning lines.

Engine init confirms: quantization=compressed-tensors, init engine (profile, create kv cache, warmup model) took 121.25 s (compilation: 46.97 s). Engine boot log: logs/engine-boot.log.

4.2 The MTP-acceptance check (and why MTP is off)

The first launch used the Qwen3.5-397B-A17B-FP8 recipe verbatim — including its --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' flag. The drafter loaded cleanly and ran every step. But every step looked like this:

SpecDecoding metrics: Mean acceptance length: 1.00,
  Accepted throughput: 0.00 tokens/s,
  Drafted throughput: 204.74 tokens/s,
  Accepted: 0 tokens, Drafted: 2048 tokens,
  Per-position acceptance rate: 0.000, 0.000,
  Avg Draft acceptance rate: 0.0%

Same exact pattern as Nex-N2-Pro-fp8: the MTP head drafts, the target rejects everything. The leading hypothesis is that the vendor's FP8 release ships an MTP head that was trained against the BF16 base (or quantized through a path the vLLM draft loader doesn't reproduce); the draft logits diverge from the FP8 target enough that no speculation lands. Qwen-team's Qwen3.5-397B-A17B-FP8 does not exhibit this — it runs at ~86% MTP acceptance — so the fine-tune is the difference, not the architecture.

The cost of running a 0%-accept drafter is not zero: every step pays draft compute + target verification on N+1 positions, plus the drafter's parameters claim ~2 GiB of GPU memory per rank that would otherwise be KV cache. The fix is to drop --speculative-config. Numbers per N:

N MTP-on agg decode (t/s) MTP-off agg decode (t/s) Δ
1 14.2 18.0 +27%
2 23.2 33.9 +46%
4 29.2 56.7 +94%
8 55.3 72.4 +31%
16 80.3 112.8 +40%

Single-stream decode: 14.5 → 18.8 t/s. Long-context 200 k decode: 14.6 → 18.7 t/s. Prefill is unchanged (1 584 → 1 597 t/s) — MTP is a decode-side feature; the prefill kernels don't care.

The MTP-on raw is in logs/results-mtp-on.json / logs/bench-mtp-on.log for the comparison.

4.3 Engine-reported config (the MTP-off recipe)

From logs/engine-boot.log:

4.4 Pitfalls inherited from the Qwen397 recipe

These are flagged in recipes/ornith-1.0-397b-fp8.md and the Qwen3.5-397B post; only the surprising ones are listed here.

  1. --gpu-memory-utilization 0.92 has crashed the head node twice on the same architecture. 0.90 is the firm ceiling.
  2. --load-format safetensors, not fastsafetensors — the latter CUDA-OOMs on the final shards. (Also: fastsafetensors is broken on MoE at TP=2 with a duplicate qweight key.)
  3. --compilation-config.cudagraph_mode none is required when MTP is on. It's also what we ran without MTP. Setting cudagraph mode FULL does not survive MTP's variable-step input shape and the boot crashes during graph capture.
  4. Local path, not HF repo ID, in vllm serve — repo ID triggers a duplicate Hub cache write that doesn't fit.
  5. docker ps Up ≠ engine alive. vllm-node stays Up after EngineDeadError because the Ray launcher PID 1 persists. Always check /health and a smoke chat completion before claiming "serving".
  6. --reasoning-parser qwen3 emits chain-of-thought via delta.reasoning in the SSE stream (not delta.reasoning_content). If your bench client counts only delta.content for TTFT, you'll get wall-time as TTFT and decode_tok_s=None. Read both keys.

5. Files in this folder

Shared launcher scripts at ../resources/scripts/relaunch-ornith-fp8-tp4.sh (production) and ../resources/scripts/relaunch-ornith-fp8-mtp-tp4.sh (MTP-on baseline, kept so §4.2's comparison can be re-run).


6. Prior art

@misc{ornith_397b,
    title  = {{Ornith-1.0-397B}: Agentic Coding, Open to All},
    url    = {https://deep-reinforce.com/ornith_1_0.html},
    author = {{DeepReinforce Team}},
    year   = {2026}
}