Ornith-1.0-397B-FP8 on 4x ASUS GX10
Date: 2026-06-30
Model: deepreinforce-ai/Ornith-1.0-397B-FP8 — DeepReinforce's MIT-licensed 397B agentic coding model, FP8 release
Hardware: 4x ASUS Ascent GX10 (NVIDIA GB10, 128 GiB unified per node, 200 GbE ConnectX-7 fabric), TP=4 via Ray
Image: vllm-node-mimo:latest (vllm 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132)
Served at: http://<head-node>:8000, model name Ornith-1.0-397B-FP8
Field notes
Ornith is the one I am warming up to. It is still early, but so far it seems to mess around less with general tool calling than the other Qwen3.5-derived runs. That matters more in actual local-agent use than a small leaderboard-style quality delta: a model that calls tools plainly, does not over-negotiate the next step, and keeps the loop moving is often the one that feels better in OpenClaude, Pi, or OpenCode.
The caveat is the same one Nex exposed: the shipped MTP head does not accept against the FP8 target, so the model has to run without speculative decoding. That keeps it behind Qwen397 on raw decode speed, but the behavioral profile is interesting enough that I still want it in the rotation. If Qwen397 is the fast default, Ornith is the current candidate for "maybe the tool-use temperament is better."
Ornith is post-trained from Qwen3.5 (Qwen3_5MoeForConditionalGeneration — same 60 hybrid layers, 512 experts × 10 active, single-head MTP, 262 144 ctx). The FP8 release uses compressed-tensors per-channel weights + per-token dynamic activations, which vLLM loads natively. The launch is the Qwen3.5-397B-A17B-FP8 recipe with three swaps: model path, served-model-name, and --tool-call-parser qwen3_xml (per the Ornith card). MTP shipped on the recipe but turns out to land at 0% accept on this checkpoint — disabling it is what gets you the numbers in §1.
Benchmarks
1. TL;DR — what we got
Single concurrency sweep on the cluster, FP8 KV cache, max_num_seqs=16, max_model_len=262 144, temperature=0.7, ~1 040-token prompt × 512 decode tokens.
| N | Wall (s) | Aggregate decode (t/s) | Median per-req decode (t/s) | Median TTFT (s) |
|---|---|---|---|---|
| 1 | 28.5 | 18.0 | 18.8 | 1.36 |
| 2 | 30.2 | 33.9 | 17.6 | 1.24 |
| 4 | 36.1 | 56.7 | 15.1 | 2.26 |
| 8 | 55.6 | 72.4 | 10.3 | 6.19 |
| 16 | 69.5 | 112.8 | 7.9 | 4.75 |
Long-context single-stream NIAH-shaped prefill probe at ~200 k input tokens — TTFT 125.3 s ⇒ 1 597 t/s prefill, then 18.7 t/s decode.
Raw JSON: logs/results-mtp-off.json. Bench progress: logs/bench-mtp-off.log.
How it stacks up on the same hardware
Same 4x GX10 hardware, same image, same launcher, same bench harness:
| Model | Single-stream decode (t/s) | Agg @ N=16 (t/s) | 200 k prefill (t/s) | Notes |
|---|---|---|---|---|
| Qwen3.5-397B-A17B-FP8 (MTP-2) | 40.0 | 186 | 1 584 | MTP ~86% accept → ~1.7× decode multiplier |
| Ornith-1.0-397B-FP8 (MTP off) | 18.0 | 113 | 1 597 | Same arch; vendor-shipped MTP head at 0% accept — see §4.2 |
| Nex-N2-Pro-fp8 (MTP off) | 19.6 | 114 | 1 519 | Same arch; same MTP issue; FP8 loader patch required |
Prefill numbers are essentially identical across all three (same prefill kernels, same FP8 arithmetic). Decode is where MTP shows up: Qwen-team's checkpoint ships an MTP head trained against the FP8 base and gets ~1.7× decode, while both fine-tuned variants (Ornith and Nex) ship MTP heads that don't survive the SFT pass and have to run target-only. That's the gap.
2. Setup at a glance
| Topology | 4x GX10, head |
| Distributed backend | Ray, TP=4 |
| Image | vllm-node-mimo:latest (image id 75429f413d11) |
| Model on disk | <spark-model-root>/deepreinforce-ai/Ornith-1.0-397B-FP8/ (377 GiB per node, 122 shards) |
| Context length | 262 144 |
| KV cache dtype | fp8 |
| Tool / reasoning parsers | qwen3_xml / qwen3 |
| Speculative decoding | none (Ornith MTP head doesn't accept against the FP8 base — see §4.2) |
| API | http://<head-node>:8000 |
| Boot time (cold) | ~20.5 min to /health=200 (122 shard load + 121 s engine init) |
| Available KV @ TP1 (tightest) | 4.98 GiB |
| GPU KV cache pool | 659 331 tokens |
Pinned wheels in the image:
| Package | Version | Why |
|---|---|---|
torch |
2.11.0+cu130 |
Fresh vLLM wheel ships metadata pinning torch==2.10.0 but the C++ ABI actually needs 2.11.0. Force-reinstall the cu130 wheel as a separate Dockerfile RUN AFTER vllm install with --no-deps --force-reinstall, otherwise vllm dies at import with ImportError: libtorch_cuda.so. |
transformers |
≥5.0.0 |
qwen3_5_moe + qwen3_next_mtp config classes live in 5.x |
vllm |
0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 |
local wheel; carries the Qwen3_5MoeForConditionalGeneration model code |
flashinfer-python |
0.6.12 |
FP8 KV + long-context attention path |
Full image lineage and Dockerfile details are documented in the Qwen3.5-397B post; same image. Tar backup at <control-workspace>/docker-images/vllm-node-mimo.tar.
3. Reproduce
Launch the cluster
Relaunch wrapper at ../resources/scripts/relaunch-ornith-fp8-tp4.sh (the MTP-on variant kept alongside for reference is relaunch-ornith-fp8-mtp-tp4.sh):
ssh -i ~/.ssh/<spark-key> <spark-user>@<head-node> \
'cd ~/spark-vllm-docker && ./relaunch-ornith-fp8-tp4.sh'
The wrapper stops any stale vllm / ray / RayWorkerProc processes on all four nodes (PID walk + SIGTERM/SIGKILL), drop_caches, then dispatches the cluster via launch-cluster.sh. The expanded vllm serve it runs on each node:
vllm serve /root/.cache/huggingface/deepreinforce-ai/Ornith-1.0-397B-FP8 \
--served-model-name Ornith-1.0-397B-FP8 \
--host 0.0.0.0 --port 8000 \
--max-model-len 262144 \
--gpu-memory-utilization 0.90 \
--load-format safetensors \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--max-num-batched-tokens 8192 \
--max-num-seqs 16 \
--trust-remote-code \
-tp 4 \
--distributed-executor-backend ray \
--mm-encoder-tp-mode data \
--kv-cache-dtype fp8 \
--compilation-config.cudagraph_mode none \
--attention-backend flashinfer
Env passed in by the launcher: HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True VLLM_USE_DEEP_GEMM=0 VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_USE_FLASHINFER_SAMPLER=0 OMP_NUM_THREADS=4.
Verify the engine is up:
python3 -c "import urllib.request; print(urllib.request.urlopen('http://<head-node>:8000/health', timeout=4).status)"
# 200
python3 - <<'PY'
import json, urllib.request as u
r = u.urlopen(u.Request("http://<head-node>:8000/v1/chat/completions",
data=json.dumps({"model":"Ornith-1.0-397B-FP8",
"messages":[{"role":"user","content":"Reply with the single word: alive"}],
"max_tokens":256,"temperature":0}).encode(),
headers={"Content-Type":"application/json"}), timeout=120)
print(json.loads(r.read())["choices"][0]["message"])
PY
Note max_tokens=256 — the reasoning parser needs budget for <think>…</think> plus the answer. With max_tokens=50 the reasoning eats the budget and content returns null.
Bench
# Control node — Python 3.14 with apt's python3-requests.
/usr/bin/python3 <control-workspace>/scripts/bench_ornith_concurrency.py \
--levels 1,2,4,8,16 --decode-tokens 512 --long-ctx 200000
Streaming bench uses requests + iter_lines() (not urllib — urlopen buffers SSE and breaks TTFT/per-stream timing). It also reads both delta.content and delta.reasoning for token-arrival events because the qwen3 reasoning parser emits the chain-of-thought via delta.reasoning in vLLM's stream chunks.
Harness: logs/bench.py. Output JSON: logs/results-mtp-off.json.
4. Details
4.1 Why no loader patch was needed
Ornith ships FP8 weights via compressed-tensors format:
"quantization_config": {
"quant_method": "compressed-tensors",
"config_groups": {
"config_group_0": {
"weights": {"num_bits": 8, "strategy": "channel", "symmetric": true, ...},
"input_activations": {"num_bits": 8, "strategy": "token", "dynamic": true, ...}
}
}
}
That is per-channel weights + per-token dynamic activations, registered through vLLM's native compressed-tensors loader. The Qwen-team Qwen3.5-397B-A17B-FP8 release uses block-FP8 (weight_block_size=[128,128]) with .weight_scale_inv-suffixed scales; the Nex-N2-Pro-fp8 release uses block-FP8 with .weight_scale suffix and needed the mods/nex-fp8-loader patch to rename. Ornith hits neither code path — compressed-tensors MoE loading lands on the right registered slots for the per-channel scheme on the first try. No mod, no KeyError, no warning lines.
Engine init confirms: quantization=compressed-tensors, init engine (profile, create kv cache, warmup model) took 121.25 s (compilation: 46.97 s). Engine boot log: logs/engine-boot.log.
4.2 The MTP-acceptance check (and why MTP is off)
The first launch used the Qwen3.5-397B-A17B-FP8 recipe verbatim — including its --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' flag. The drafter loaded cleanly and ran every step. But every step looked like this:
SpecDecoding metrics: Mean acceptance length: 1.00,
Accepted throughput: 0.00 tokens/s,
Drafted throughput: 204.74 tokens/s,
Accepted: 0 tokens, Drafted: 2048 tokens,
Per-position acceptance rate: 0.000, 0.000,
Avg Draft acceptance rate: 0.0%
Same exact pattern as Nex-N2-Pro-fp8: the MTP head drafts, the target rejects everything. The leading hypothesis is that the vendor's FP8 release ships an MTP head that was trained against the BF16 base (or quantized through a path the vLLM draft loader doesn't reproduce); the draft logits diverge from the FP8 target enough that no speculation lands. Qwen-team's Qwen3.5-397B-A17B-FP8 does not exhibit this — it runs at ~86% MTP acceptance — so the fine-tune is the difference, not the architecture.
The cost of running a 0%-accept drafter is not zero: every step pays draft compute + target verification on N+1 positions, plus the drafter's parameters claim ~2 GiB of GPU memory per rank that would otherwise be KV cache. The fix is to drop --speculative-config. Numbers per N:
| N | MTP-on agg decode (t/s) | MTP-off agg decode (t/s) | Δ |
|---|---|---|---|
| 1 | 14.2 | 18.0 | +27% |
| 2 | 23.2 | 33.9 | +46% |
| 4 | 29.2 | 56.7 | +94% |
| 8 | 55.3 | 72.4 | +31% |
| 16 | 80.3 | 112.8 | +40% |
Single-stream decode: 14.5 → 18.8 t/s. Long-context 200 k decode: 14.6 → 18.7 t/s. Prefill is unchanged (1 584 → 1 597 t/s) — MTP is a decode-side feature; the prefill kernels don't care.
The MTP-on raw is in logs/results-mtp-on.json / logs/bench-mtp-on.log for the comparison.
4.3 Engine-reported config (the MTP-off recipe)
From logs/engine-boot.log:
speculative_config=None,quantization=compressed-tensors,kv_cache_dtype=fp8,tensor_parallel_size=4Available KV cache memory: 4.98 GiBon the tightest TP rank, 8.4 GiB on TP0GPU KV cache size: 659 331 tokensinit engine (profile, create kv cache, warmup model) took 121.25 s (compilation: 46.97 s)- Cold boot to
/health=200: ~20.5 min (122 shards × ~9.3 GiB at ~9 s/shard rank-parallel + 121 s profile/compile)
4.4 Pitfalls inherited from the Qwen397 recipe
These are flagged in recipes/ornith-1.0-397b-fp8.md and the Qwen3.5-397B post; only the surprising ones are listed here.
--gpu-memory-utilization 0.92has crashed the head node twice on the same architecture.0.90is the firm ceiling.--load-format safetensors, notfastsafetensors— the latter CUDA-OOMs on the final shards. (Also:fastsafetensorsis broken on MoE at TP=2 with a duplicate qweight key.)--compilation-config.cudagraph_mode noneis required when MTP is on. It's also what we ran without MTP. Setting cudagraph modeFULLdoes not survive MTP's variable-step input shape and the boot crashes during graph capture.- Local path, not HF repo ID, in
vllm serve— repo ID triggers a duplicate Hub cache write that doesn't fit. docker psUp ≠ engine alive. vllm-node stays Up afterEngineDeadErrorbecause the Ray launcher PID 1 persists. Always check/healthand a smoke chat completion before claiming "serving".--reasoning-parser qwen3emits chain-of-thought viadelta.reasoningin the SSE stream (notdelta.reasoning_content). If your bench client counts onlydelta.contentfor TTFT, you'll get wall-time as TTFT anddecode_tok_s=None. Read both keys.
5. Files in this folder
README.md— this filelogs/bench.py— concurrency + long-ctx harness (requests+iter_lines)logs/results-mtp-off.json— the MTP-off bench (the one §1's numbers came from)logs/bench-mtp-off.log— MTP-off bench stderrlogs/results-mtp-on.json— the original MTP-on bench (for the §4.2 comparison)logs/bench-mtp-on.log— MTP-on bench stderrlogs/engine-boot.log—docker logs ornith-fp8-tp4from the MTP-off boot
Shared launcher scripts at ../resources/scripts/relaunch-ornith-fp8-tp4.sh (production) and ../resources/scripts/relaunch-ornith-fp8-mtp-tp4.sh (MTP-on baseline, kept so §4.2's comparison can be re-run).
6. Prior art
- Qwen3.5-397B-A17B-FP8 4x GX10 post — the recipe this is derived from. Same architecture (
Qwen3_5MoeForConditionalGeneration, 60 hybrid linear/full layers, 512 experts top-10, single-head MTP), same image, same launcher. Three swaps: model path, served name, tool-call parser. - Nex-N2-Pro-fp8 4x GX10 post — sibling fine-tune of the same Qwen3.5-397B-A17B base. Documents the FP8 loader patch (not needed for Ornith — different quant scheme) and the MTP 0%-accept finding that recurs here.
- Ornith HF page — model card; sources
--tool-call-parser qwen3_xml,--reasoning-parser qwen3, and the BibTeX citation. - Local recipe doc —
recipes/ornith-1.0-397b-fp8.md— the launch recipe, pitfalls, and verification commands the operator reads before relaunching.
@misc{ornith_397b,
title = {{Ornith-1.0-397B}: Agentic Coding, Open to All},
url = {https://deep-reinforce.com/ornith_1_0.html},
author = {{DeepReinforce Team}},
year = {2026}
}