Journeys in Local LLMs

Field notes, benchmarks, launch recipes, and failure modes from running large local models on a 4-node DGX Spark cluster. Each model post starts with the usage story first, then the benchmark evidence, then the exact commands and technical details needed to reproduce the run.

Posts

hardware/ — why this lab is 4× DGX Spark, what the switch/cabling setup looks like, and why scaling memory mattered more than buying a single huge Mac
qwen-3.5-397b/ — Qwen3.5-397B-A17B-FP8 with MTP (qwen3_next_mtp, num_speculative_tokens=2)
nemotron-3-ultra/ — Nemotron-3-Ultra-550B-A55B-NVFP4 with MTP (nemotron_h_mtp, num_speculative_tokens=3)
nex-n2-pro-fp8/ — Nex-N2-Pro-fp8 (Qwen3.5-397B-A17B fine-tune) — FP8 loader patch + MTP-off rebench
ornith-1.0-397b-fp8/ — Ornith-1.0-397B-FP8 (Qwen3.5-397B-A17B fine-tune) — compressed-tensors loads native, MTP off

Resources

Shared scripts and infra docs live in resources/:

INFRA.md — cluster hardware, network, image distribution, common gotchas
WHEEL-PROVENANCE.md — vLLM + flashinfer git SHAs, build steps
bench.py — shared OpenAI-compatible concurrency + NIAH bench
recipes/ — per-model launch recipes (yaml)
Operational scripts, raw logs, and mod wrapper files are kept private because they contain site-local topology, usernames, and filesystem paths.

Cluster

4× NVIDIA DGX Spark (GB10 Blackwell, 128 GiB unified memory each), interconnected via ConnectX-7 200 GbE fabric. Hostnames, usernames, and private addresses are intentionally redacted from the public notes. Full inventory in resources/INFRA.md.