The Hardware Rack

The four nodes are ASUS Ascent GX10 boxes. They sit in the same GB10 software world as the NVIDIA-branded Spark machines, which is why so many notes and scripts around this setup use "Spark" as shorthand, but the actual hardware on the bench is ASUS.

This page is about the box of parts around the model posts: four GX10s, a small 400G MikroTik switch, too many DAC cables, and a bunch of decisions that were less glamorous than "buy a big GPU" but mattered more than expected.

I sold my old 8x3090 setup to help pay for this. I do not miss it. The 3090 rig was useful, but it was also heat, noise, risers, power, and the constant feeling that a tiny physical issue could take the whole evening. The GX10 cluster is still fussy, just in more interesting ways.

What I Bought

Part	Notes
4x ASUS Ascent GX10	NVIDIA GB10 Grace Blackwell, 128 GB unified memory per node, CUDA/vLLM stack, small enough that the cluster does not feel like a server-room commitment.
MikroTik CRS804-4DDQ	Four 400G QSFP-DD ports, which is the right shape for four nodes without a loud surplus datacenter switch.
NADDOD 200G DACs	Cheap enough to try, annoying enough to write about.
Existing control machine	Downloads, notes, blog build, and general orchestration stay off the GX10s.

The GX10 is interesting because it is not really a normal desktop and not really a normal server. ASUS wraps the NVIDIA GB10 platform in a little box with 128 GB LPDDR5x coherent unified memory, DGX OS, ConnectX-7, and a power budget that makes sense at home. ASUS lists peak system power at 240W, with the GB10 SoC at 140W and the rest for things like ConnectX-7, SSD, and USB-C.

That is a big part of why I went this way. The old 3090 machine could move tokens, but it made every experiment feel like I was firing up shop equipment. The GX10s are still real hardware, but they are the kind of real hardware I can leave on.

Sources:

Why Four Small Boxes Instead of One Big Mac

The Mac Studio M3 Ultra was the obvious thing to look at. It is quiet, clean, has a huge unified memory option, and would have been much simpler as a single machine. Apple will sell you up to 512 GB unified memory, and for a lot of local inference work that is a serious amount of room.

The part I could not get comfortable with was the ceiling.

I am trying to move toward 1 TB of usable model memory. Not because every model needs that today, but because memory is what keeps options open. More memory means less panic around quantization, more KV cache, longer contexts, and fewer launches where the whole plan depends on shaving off one more GiB.

Two Mac Studios do not become one 1 TB accelerator. They become two nice computers and a distributed-systems problem. For this project, I wanted the boring path to be the CUDA/NCCL/vLLM path, because that is where most of the large-model tooling already lands.

I am also just suspicious of quantization as the entire hardware strategy. Quantization is useful. I use it constantly. But I do not want to buy hardware on the assumption that every future model will be fine after being squeezed down hard. When a model gets weird after quantization, you can lose days arguing with the software before admitting the simple thing: you needed more memory.

Sources:

Why Spend This Much

The reason is not a benchmark trophy. It is that the cluster is good enough to use.

Qwen3.5 397B with MTP became my default because it crosses the line from "neat demo" to "tool I can leave wired into OpenClaude, Pi, and OpenCode." The prefill speeds are good enough that long-context work is not purely academic, and the serving shape is normal OpenAI-compatible vLLM rather than a bespoke science project for every model.

Four GX10s give 512 GB raw unified memory across the cluster. Eight would be the setup I really want: 1 TB raw, more room for KV, less pressure to choose the most aggressive quant, and more freedom to treat 400B-500B class models as normal workloads instead of special occasions.

That is the thing I kept coming back to when spending the money felt stupid: the 8x3090 rig was already a pile of compromises. This is also a pile of compromises, but they are pointed in the direction I actually care about.

RAM Is Still the Whole Game

The annoying part is that I bought the memory machines and still spend a lot of time thinking about memory.

Each GX10 is sold as 128 GB unified memory. vLLM does not get to spend 128 GB. There are reservations, driver overhead, kernel memory, container overhead, and then KV cache. The model can be "loaded" and still be one real request away from falling over if the launch flags are too optimistic.

The pain shows up in small places:

max_num_seqs becomes a serious design parameter.
Long context and concurrency fight each other.
MTP is not free; draft heads also need memory.
A launch flag that looks harmless can fail only when the first heavy prefill arrives.

There was also a broader reminder around this class of hardware: DGX Spark pricing reportedly moved up during a memory shortage cycle. My boxes are ASUS GX10s, not NVIDIA-branded Sparks, but they live in the same GB10 / 128 GB unified-memory world. Compute is easy to market. Memory is what decides whether the model actually stays up.

Source:

Tom's Hardware on DGX Spark price increase and memory shortages

NVFP4 Has Not Been Magic

NVFP4 was part of the attraction. On paper it is exactly what this kind of lab wants: smaller giant models, Blackwell-era support, less memory pressure, and speed that does not feel like a penalty box.

In practice it is another tool, not a magic door. Some checkpoints are great. Some need newer vLLM pieces. Some hit kernel or parser edges. Some boot and then do something strange enough that you start questioning the whole stack. And even when the weights are smaller, KV cache still exists, routing state still exists, MTP still costs memory, and the server still needs room to breathe.

I still want NVFP4 to work. I just do not trust it enough to make "buy less RAM" the plan.

The Switch

The switch is a MikroTik CRS804-4DDQ. It gives me four 400G QSFP-DD ports in a compact 1U box, plus two 10G management ports. For four GX10s that is the right shape: one high speed port per node, no daisy-chain games, no giant used datacenter switch screaming in the corner.

It is not fancy in the way GPUs are fancy. It is plumbing. But once there are more than two nodes, the plumbing becomes the project. Weight staging, Ray, NCCL, RoCE, interface selection, MTU, and all the other boring details either work through the fabric or they do not work.

Sources:

The Cable Problem

The NADDOD DACs were the part that made me feel like I was back in hardware debug land.

The symptoms were simple and irritating: after a cold power cycle, the switch could see that modules were present, but module initialization failed until the cables were physically reseated. Pull the tab, push it back in, and the link would come up. Leave it alone through the next cold boot and the problem would come back.

That is a miserable failure mode because it looks like five different things before it looks like the cable path. It can smell like RouterOS, a bad port, RoCE config, autoneg, FEC, firmware, or a driver issue. Then you reseat the cable and everything trains.

The lesson is boring and useful: for 200G/400G work, cable compatibility is part of the system design. QSFP56, QSFP-DD, breakout assumptions, EEPROM coding, and switch firmware all matter. A store page saying "compatible" is not the same as "known good with this NIC, this switch, and this firmware."

Sources:

What I Would Change

I would go to 8x if I could.

Four nodes made the project real. Eight nodes would make it feel less tight: 1 TB raw unified memory, more room for cache, fewer quantization compromises, and less time spent asking whether the launch is clever enough to fit.

The old 8x3090 setup got me here, but this is the direction I wanted to move: lower power, cleaner software path, more memory per node, and a cluster that feels like it can grow instead of a one-off rig I am always nursing along.