AMD's Trillion-Parameter Mini PC Cluster: What the Spec Sheet Skips

A year ago, running a trillion-parameter language model meant a server room. Racks, cooling, a power bill that needed its own meeting. Then AMD published a developer write-up showing four mini PCs sitting on a desk (the kind you could carry two at a time) doing the same job. Four identical little boxes, cabled together, running a model with more parameters than there are stars you can see from a city street.

The headline writes itself: "No cloud. No data center." And it's true. AMD really did run a 1.04-trillion-parameter model on four Framework Desktop systems with consumer silicon inside them.

But there's a part the headline skipped, and it's the part that decides whether this is a milestone or a magic trick. There's an architecture detail that makes "trillion parameters" technically honest, a catch that determines whether you could actually use this thing, and a reason it matters more than either the hype or the backlash gives it credit for.

The Short Version

The model is Kimi K2.5, and it's a Mixture-of-Experts design: 1.04 trillion total parameters, but only about 32 billion of them fire on any given token. "Trillion-parameter model" is accurate; the per-token compute is closer to a 32B-class workload.
The cluster generates around 8 to 9.5 tokens per second, with a time-to-first-token anywhere from 39.7 to 239.1 seconds depending on how long your prompt is. Fine for batch work. Brutal for an interactive coding loop.
The thing that changed isn't the speed. It's that unified memory put frontier-scale inference on hardware you can buy and set on a shelf, a category that used to start at "own a datacenter."

What AMD Actually Did

The setup is almost anticlimactic once you see it laid out. Four Framework Desktop machines, each carrying a Ryzen AI Max+ 395 and 128 GB of LPDDR5X unified memory. In BIOS, each node can expose up to 96 GB as dedicated VRAM, or 384 GB across the four nodes; AMD's Linux walkthrough then uses TTM/kernel settings to raise that to 120 GB per node, or 480 GB total. That matters because the Kimi K2.5 UD_Q2_K_XL GGUF build AMD used is listed at 375 GB, not 240 GB.

The glue is llama.cpp running in RPC mode: one controller node and three RPC servers, with the model distributed across all four machines. AMD lists the interconnect as 5 Gbps Ethernet, which fits the Framework Desktop's built-in 5Gbit Ethernet port. That's the whole rig. No exotic interconnect, no custom boards, nothing you couldn't order this afternoon.

The interesting word in all of that is unified. On a normal PC, your CPU's RAM and your GPU's VRAM are separate pools, and a model too big for the VRAM either spills to slow system memory or doesn't run. Unified memory collapses that wall: the GPU can address the whole bank, which is the entire reason a 4.5-liter desktop can hold a chunk of a model this size in the first place.

AMD's own technical write-up covers the configuration in detail. What it doesn't really cover is why "trillion parameters" is doing more rhetorical work than it looks like.

Diagram of AMD's 4-node mini PC cluster: four Framework Desktop nodes with Ryzen AI Max+ 395 and 128 GB unified memory each, linked over 5 Gbps Ethernet as one controller and three RPC servers, running the 375 GB Kimi K2.5 GGUF build with 96 GB BIOS VRAM and 120 GB Linux allocation per node (480 GB total)

The Trick: Why "Trillion Parameters" Is True but Not the Whole Truth

Here's the thing the spec sheet leans on without explaining: Kimi K2.5 is a Mixture-of-Experts model, and that changes what "trillion parameters" means in practice.

A dense model, the kind most people picture, runs every parameter for every token. A 70-billion-parameter dense model does 70 billion parameters' worth of math on each word it produces. A Mixture-of-Experts model is built differently. Kimi K2.5 has 384 separate "experts," 8 of which activate per token plus one shared expert, across 61 layers. So while the model carries 1.04 trillion parameters in total, only about 32 billion of them light up on any single forward pass. A router picks which experts to wake; the rest sit there doing nothing for that token.

So is "running a trillion-parameter model on four mini PCs" honest? Yes, you genuinely need the memory to hold all 1.04 trillion parameters, and that memory is the hard part. But the compute your hardware has to do per token is a 32B-class job, not a 1T-class one.

Which cuts both ways, and this is where it gets interesting. It makes the demo more impressive than it sounds, because holding a full trillion-parameter model in memory on consumer boxes is the genuinely hard thing they pulled off. And it makes it less impressive than the headline implies, because the actual per-token workload is something single boxes already chew through faster on smaller MoE models. A 120B MoE model runs at 50-plus tokens per second on one of these nodes. The trillion-parameter number is real, but it's a memory flex, not a compute flex.

The takeaway: when you're sizing hardware for a model, the active-parameter count is what your machine has to feed per token, not the total on the box.

Mixture-of-Experts explainer: 1.04 trillion total parameters must be held in memory, an MoE router selects 8 of 384 experts plus one shared expert per token, so only about 32 billion parameters are active per token. Total parameters decide memory, active parameters decide per-token compute

The Catch: What 8 Tokens a Second and a 40-Second-to-4-Minute Wait Actually Mean

Eight tokens a second is the number that decides everything, so sit with it for a second. AMD's article reports the cluster generating about 8.30 t/s at an 8,192-token context and roughly 9.45 t/s at steady state, with prompt processing around 100.77 t/s. Those are fine, fair numbers for what they are.

The one that hurts is time-to-first-token. Before the model produces a single word, it has to read your prompt, and AMD's own benchmark table puts that wait at 39.7 seconds for a 4,096-token prompt, 90.5 seconds for an 8,192-token prompt, and 239.1 seconds for a 16,384-token prompt with Flash Attention enabled. So you type a question, and then you wait. Possibly for nearly four minutes, before anything comes back.

For an interactive coding loop, that's rough, and developers in the Hacker News discussion said so plainly: a minute-plus of dead air before the first token does not fit the way anyone writes code with an assistant. But flip the workload. If you're running batch jobs overnight, processing documents async, generating things you'll read later, or doing private inference where the whole point is that nothing leaves the building, 8 tokens a second is completely livable. You weren't watching the screen anyway.

The asterisk: Don't expect these numbers to reproduce out of the box. The ROCm software stack on this hardware is version-sensitive in ways that bite: a GitHub issue documented a Strix Halo system stuck at idle GPU clocks and crawling at 0.5 t/s under LLM inference on ROCm 7.1.1 and Linux kernel 6.14. That's not "AMD is broken," but it does mean the published performance depends on a very specific software stack, and you may end up chasing ROCm, kernel, and firmware combinations before your rig matches the numbers in the write-up.

One more thing the backlash gets wrong, which is the cost. People keep calling it a "$10,000 cluster," but nobody's publishing that as a fixed bill of materials. Do the arithmetic yourself: four 128 GB Framework Desktops at the $1,999 launch price would put the machines alone at about $8,000, while a March 2026 Liliputing snapshot listed a 128GB/1TB Framework Desktop configuration at $2,851, or about $11,400 for four before networking. Add a few hundred dollars for switch and cabling, and the practical range is closer to roughly $8.2K to $11.7K depending on configuration, purchase date, and what you already have. Not nothing. Not a server room either.

Here's where I land on the whole thing: the cluster works. Whether eight tokens a second and a minute-plus wait is a triumph or a toy depends entirely on what you're trying to build. It is not an interactive coding workstation. It is also not a toy. It's a real machine for a specific kind of patient work, and pretending it's either more or less than that is how everyone in this argument ends up talking past each other.

Where This Actually Lands

The honest framing isn't "AMD beat Nvidia." It's that this is a different product for a different person. The reader who wants this is the one who needs privacy, wants offline, or doesn't want to pay per token forever, not the one chasing the fastest possible response.

And the strongest argument against the whole exercise deserves a straight answer: you can just hit Kimi's API. Artificial Analysis currently lists Kimi's own K2.5 endpoint around 56 to 60 tokens per second with a blended price around $0.49 per million tokens, while Kimi's official API platform lists K2.5 pricing at $0.10/M cache-hit input tokens, $0.60/M input tokens, and $3.00/M output tokens. Third-party K2.5 providers can be faster or cheaper depending on routing, but the basic point is the same: the API is faster than the cluster, avoids hardware babysitting, and will be the right call for most people on most days.

So the local story only makes sense when one of three things is true: the data can't leave (privacy), the connection can't be assumed (offline), or the token volume is high enough and sustained enough that owning the metal beats renting it forever (cost at scale). Outside those three, the API wins. Inside them, the cluster is the only thing that does the job at all.

Dimension	AMD 4-node cluster	Kimi API / cloud route
Generation speed	~8 to 9.5 t/s	~56 to 60 t/s on Kimi's own K2.5 endpoint
Time-to-first-token	39.7 to 239.1 s	provider-dependent, much lower
Cost model	~$8.2K to $11.7K hardware	per-token API pricing
Privacy / offline	fully local	provider-hosted
Best-fit use case	private, offline, batch work	interactive/API use

For the record, Nvidia's DGX Spark is the obvious "but what about" here, and it wins on some axes the AMD cluster doesn't. That's a whole separate fight, and one I'll take up elsewhere. If you want the rental side of the hardware-vs-cloud decision, Cloudzy's GPU VPS page is the more practical comparison point.

The Part That Actually Matters

Strip away the token rate and the price arguments, and one fact is left standing: the hardware that runs a trillion-parameter model is now a shelf, not a building.

That's the shift, and it's easy to miss under the speed bickering. A year ago, the category of people who could run a 1.04-trillion-parameter model was "datacenter operators." Full stop. Now it includes anyone with roughly ten grand and some patience. The line didn't move a little: a whole new group of people just walked through a door that was locked.

What that opens up is the interesting part. Private agents that run entirely on hardware you own. Inference that works on a plane or behind an air gap. Models that physically cannot phone home because there's nowhere for the call to go. An economics of AI where the marginal cost of a token is electricity instead of a metered API line. None of that was reachable on consumer hardware a year ago, and unified memory is the thing that reached it.

I've watched this pattern enough times to be wary of "this changes everything." Usually it doesn't; usually it's last year's thing with a new logo. This one's different, and not because it's fast. It's different because the floor moved. The slow, expensive, patient version of frontier-scale local inference exists now, and the fast version is just a matter of the next few hardware generations grinding it down. The hard part was never going to be speed. The hard part was access, and access just happened.

The milestone here isn't speed. It's who's allowed in the room. The machine that runs frontier-scale models used to be a building. Now it's four boxes on a shelf.

Frequently Asked Questions

Can You Really Run a Trillion-Parameter Model on a Mini PC Cluster?

Yes, with one important caveat. AMD ran Kimi K2.5, a 1.04-trillion-parameter model, across four Ryzen AI Max+ 395 mini PCs. In BIOS, the four systems can expose about 384 GB of dedicated VRAM total; AMD's Linux walkthrough then raises the allocation to 480 GB total through TTM/kernel settings. But Kimi K2.5 is a Mixture-of-Experts model: of those 1.04 trillion parameters, only about 32 billion activate on any given token. You need the memory to hold all of them, but the compute per token is closer to a 32-billion-parameter workload.

What Is Kimi K2.5 and Why Does the MoE Architecture Matter Here?

Kimi K2.5 is an open-weight language model from Moonshot AI with 1.04 trillion total parameters and 32 billion active per forward pass, built on a Mixture-of-Experts design (384 experts, 8 activated per token plus one shared). The architecture matters because the active-parameter count, not the total, is what your hardware has to compute for each token. That's why a model with a trillion parameters on paper can run on consumer boxes at all.

Is 8 Tokens per Second Fast Enough for Local AI?

It depends entirely on the workload. For batch processing, async jobs, offline use, or private inference where nothing can leave your hardware, 8 tokens per second is fine, you're not staring at the screen. For interactive coding, it's rough, mostly because the time-to-first-token on this cluster runs from about 40 seconds to nearly 4 minutes depending on prompt length, and that dead air before the first word kills an iterative loop.

Why Not Just Use Kimi's API Instead?

For most people, you should. Kimi's own K2.5 endpoint is much faster than the local cluster in current Artificial Analysis data, and third-party K2.5 providers can be faster or cheaper still. The local hardware only makes sense when you need privacy (the data can't leave), offline capability (no connection to assume), or cost-at-scale (sustained high volume where owning beats renting). Outside those cases, the API is the better choice.

AMD Built a Trillion-Parameter AI Supercomputer Out of Mini PCs