Journeys in Local LLMs
Field notes, benchmarks, launch recipes, and failure modes from running large local models on a 4-node DGX Spark cluster. Each model post starts with the usage story first, then the benchmark evidence, then the exact commands and technical details needed to reproduce the run.
Posts
- hardware/ — why this lab is 4× DGX Spark, what the switch/cabling setup looks like, and why scaling memory mattered more than buying a single huge Mac
- qwen-3.5-397b/ — Qwen3.5-397B-A17B-FP8 with MTP (qwen3_next_mtp, num_speculative_tokens=2)
- nemotron-3-ultra/ — Nemotron-3-Ultra-550B-A55B-NVFP4 with MTP (nemotron_h_mtp, num_speculative_tokens=3)
- nex-n2-pro-fp8/ — Nex-N2-Pro-fp8 (Qwen3.5-397B-A17B fine-tune) — FP8 loader patch + MTP-off rebench
- ornith-1.0-397b-fp8/ — Ornith-1.0-397B-FP8 (Qwen3.5-397B-A17B fine-tune) — compressed-tensors loads native, MTP off
Resources
Shared scripts and infra docs live in resources/:
- INFRA.md — cluster hardware, network, image distribution, common gotchas
- WHEEL-PROVENANCE.md — vLLM + flashinfer git SHAs, build steps
- bench.py — shared OpenAI-compatible concurrency + NIAH bench
- recipes/ — per-model launch recipes (yaml)
- Operational scripts, raw logs, and mod wrapper files are kept private because they contain site-local topology, usernames, and filesystem paths.
Cluster
4× NVIDIA DGX Spark (GB10 Blackwell, 128 GiB unified memory each), interconnected via ConnectX-7 200 GbE fabric. Hostnames, usernames, and private addresses are intentionally redacted from the public notes. Full inventory in resources/INFRA.md.