What Is Unified Memory? Why a Mini PC Runs a 235B AI Model

A roughly $2,000 to $3,000 unified-memory mini PC can load some heavily quantized 235B-class models that do not fit on a single H100-class GPU.

That sounds backwards, so let's make the comparison precise. The expensive card is much faster, but its local GPU memory is smaller. The little box on the desk may have a larger shared pool, so the model can load even if generation is slow.

The one-word answer to how is "unified memory." It is printed on the spec sheet of many new AI mini PCs and Macs as a headline number ("128 GB unified memory"), and almost nobody explains what it actually does. So that's the job here. By the end you'll know what unified memory is, why it lets a small machine run a model that used to need a server rack, and the catch nobody puts in the headline: it runs that model slowly.

TL;DR

Unified memory is one physical pool of memory that a chip's CPU and integrated GPU share, instead of a discrete graphics card's small, separate VRAM sitting next to your separate system RAM.
That shared pool is large, and the GPU can usually access far more memory than a discrete card's fixed VRAM limit, though the exact usable amount depends on the platform, firmware settings, OS, and runtime. So the first question becomes: does this quantized build fit in usable memory? A 128GB pool can fit models a 24GB or 32GB graphics card never could.
The catch is speed, not size. Unified memory moves data far slower than a discrete card's VRAM. The big model runs. It just generates tokens slowly. Unified memory lets you run the big model, not run it fast.
"Unified" isn't one thing. Apple's version is mostly invisible to the user; AMD's version exposes more knobs, because firmware and driver settings can affect how much memory is reserved for, or practically usable by, the GPU. And more memory does not mean faster.

What Is Unified Memory?

Picture two setups. A discrete graphics card has its own memory (VRAM) bolted right next to its processor, fast but small. Your system RAM is a second, separate pool the CPU uses. To run a model on the GPU, the data has to be copied from system RAM across the PCIe bus into VRAM first. Two pools, one copy step.

Unified memory throws out that split. It is a single physical pool of memory that the chip's CPU and integrated GPU both share, letting the GPU work from the shared pool instead of relying on a small separate VRAM box. On platforms like Apple Silicon, this also avoids the old copy-across-PCIe step. Apple's own architecture talk describes it as the CPU and GPU "working over the same memory" with no need to copy data across a PCIe bus. One pool. Zero copy.

The shared pool is usually LPDDR5X memory soldered onto the package, which is what lets it be both large and close to the processor. The headline examples right now are Apple Silicon Macs, AMD's Strix Halo systems built around chips like the Ryzen AI Max+ 395, and Nvidia's DGX Spark. AMD's Ryzen AI Halo developer platform lists 128GB of LPDDR5x memory at 256GB/s, while Nvidia's DGX Spark lists 128GB of LPDDR5x unified system memory at 273GB/s.

Shared memory between a CPU and an integrated GPU isn't new. Laptops have done it for years, and it was usually a compromise: slow memory, not much of it. What changed is capacity at usable bandwidth. Once a shared pool got big enough, around the 128GB class, while staying fast enough to be worth using, it crossed the line where very large open-weight models could fit locally. That's the whole story. The architecture is old; the size is new.

A note on "vs VRAM": People ask whether unified memory is VRAM. Not quite. VRAM is dedicated graphics memory on a discrete card, fast and separate. Unified memory is one shared pool that does the job of both VRAM and system RAM. It trades the discrete card's raw speed for size and the ability to skip the copy step.

Why Does a Model Need to Fit in Memory?

Comparison showing a 235B-class model failing to fit in 24GB GPU VRAM or 80-94GB H100-class GPU memory, but fitting in a 128GB unified memory pool

For normal in-memory inference, the model's weights need to sit in memory the processor can address. If the usable memory is too small, the model will not load cleanly on that device. Some tools can offload parts of a model to CPU memory or storage, but that changes the performance profile sharply and is not the same as the model fitting comfortably in GPU-addressable memory. Capacity is a hard gate that comes before any question of speed.

This is the lever unified memory pulls. Many consumer graphics cards have 24GB of VRAM or less, and even top-end single consumer cards sit around 32GB. A 70-billion- or 235-billion-parameter model is far too big for that. Raw 4-bit arithmetic for 235B parameters starts around 118GB before format overhead, runtime buffers, and context memory. In practice, actual downloadable builds vary a lot: for example, Ollama's Qwen3-235B-A22B Q4_K_M build is listed at 142GB, while more aggressive lower-bit quantizations can come in closer to the range a 128GB unified-memory machine can handle. So the card built for the job runs out of room before it can begin. (How those memory numbers are calculated, parameters times bytes per weight plus the overhead the file size hides, is its own topic, and the sibling article on quantization math does the arithmetic.)

A 128GB unified pool changes the answer to one question: does this particular quantized build fit after the OS, runtime, KV cache, and GPU allocation limits take their share? For some aggressive 235B-class quantizations, yes. That is why a compact unified-memory box can sometimes load a model that a smaller-VRAM GPU cannot. It isn't more powerful. It just has a bigger room to put the model in.

That's the first thing the headlines get right and leave unexplained. Pool size, not raw power, is what decides whether the model runs at all.

Why Is Unified Memory Slower Than a Graphics Card?

Diagram showing a 235B-class model failing to fit in 24GB GPU VRAM or 80-94GB H100-class GPU memory, but fitting in a 128GB unified memory pool at the cost of speed

Generating text one token at a time is limited by memory bandwidth, not by how fast the processor can do math. Every token you produce requires streaming the model's active weights through the processor, so the speed ceiling is how fast memory can feed the chip. This is the well-documented "memory-bound" nature of single-stream decoding, the chip spends most of its time waiting on memory, not computing.

And bandwidth is exactly where unified memory gives ground. AMD's Strix Halo pool runs at 256GB/s on paper, and independent testing at llm-tracker.info clocks it at about 212GB/s in practice. The DGX Spark sits at 273GB/s. A high-end discrete graphics card, by contrast, moves data several times faster, its dedicated VRAM is built for that. So when a model fits both a unified box and a discrete card, the discrete card generates tokens noticeably faster. Same model, same result, very different speed.

For dense models, a useful rule of thumb is:

tokens per second ≈ memory bandwidth ÷ model size in memory.

It is directional, not a benchmark, but it explains the tradeoff: smaller resident weights or higher bandwidth usually means faster decoding. For MoE models, do not apply the rule directly to the total parameter count. Capacity still depends on the total stored weights, but per-token speed depends more on the activated path, routing overhead, cache behavior, and implementation.

One nuance, then I'll leave it alone: there are two phases to a request. Reading your prompt (prefill) leans on compute. Generating the reply (decode) leans on bandwidth. The slow part you feel, words appearing one at a time, is the bandwidth-bound part.

So here's the takeaway the spec sheet skips: unified memory lets you run the big model, not run it fast. It wins the capacity argument and loses the bandwidth one. Whether that trade is worth it depends entirely on what you're doing, and that's a fair trade to make on purpose, not a surprise to discover after you buy.

Is All Unified Memory the Same?

No. "Unified" describes a category, not a single implementation, and the versions differ in ways that matter. Apple's version is mostly invisible to the user: the memory is shared by default. AMD's Strix Halo is more hands-on: firmware and driver settings can affect how much memory is reserved for, or practically usable by, the GPU. Both are unified memory. They are not the same experience.

Let me name the misconception this whole topic produces, because it's the most common one: more memory does not mean faster inference. It means a bigger model can run. Someone buys a 128GB box expecting speed, loads a model that also fits a 24GB discrete card, and is disappointed it runs slower than the smaller card did. Both statements are true at once: the big pool fits more, and the small fast card runs faster on what they share. Size and speed are different axes. Unified memory buys you the first one.

A practical wrinkle on the AMD side: how much of the pool is actually usable for a model depends on the firmware setting and the operating system. AMD's Variable Graphics Memory FAQ covers how that allocation works; the short version is that a 128GB box does not hand all 128GB to the GPU, and the usable amount depends on the VGM setting, reserved system memory, the OS, and the runtime. Plan around usable memory, not the sticker number.

Pro Tip: When you're sizing a machine for local models, read the spec sheet as two numbers, not one. Capacity tells you which models fit. Bandwidth tells you how fast they'll run once they do. A box with a huge pool and modest bandwidth is a box that runs large models slowly, which may be exactly what you want, as long as you knew that going in.

There's one more case worth flagging, because it trips people up on these big-pool machines: Mixture-of-Experts models. A model like Qwen3-235B-A22B has 235 billion total parameters but only activates about 22 billion of them per token. It's tempting to assume that means it only needs memory for the active slice. For normal in-memory inference, it doesn't. All 235 billion weights still need to be resident somewhere the runtime can use, because any token might route to any expert: only the compute per token is reduced, not the capacity requirement. That distinction is precisely where unified memory's big pool earns its keep, and the sibling article on quantization math works through what those numbers come to.

Frequently Asked Questions

Is Unified Memory the Same as VRAM?

No. VRAM is dedicated, high-speed memory built into a discrete graphics card, kept separate from your system RAM. Unified memory is a single shared pool that the CPU and GPU both use, doing the job of VRAM and system RAM at once. Unified memory is usually larger but slower than a discrete card's VRAM, and it skips the step of copying data between two pools.

Why Is My Local Model Slow Even Though It Fits in Memory?

Because fitting and running fast are two different things. Whether a model loads depends on memory capacity; how fast it generates text depends on memory bandwidth. Unified memory has plenty of capacity but much lower bandwidth than a discrete graphics card, so a model that fits comfortably can still generate tokens slowly. For dense models, the rough relationship is tokens per second ≈ bandwidth ÷ model size. For MoE models, capacity still depends on total stored weights, but speed depends more on the activated path and runtime implementation.

Do You Still Need a GPU if You Have Unified Memory?

The integrated GPU is already part of a unified-memory chip, that's what runs the model. The real question is whether you also want a discrete GPU. Many discrete cards give you far higher bandwidth, which means faster generation, but less local memory than a large unified-memory system, so they may not hold the largest models by themselves. Unified memory gives you a big pool that fits large models at lower speed. Which you want depends on model size versus speed.

Why Can a Mini PC Run a Model That Needs a Datacenter GPU?

Because the bottleneck for loading a model is memory capacity, and a mini PC with a large unified pool can have more usable model memory than many single-GPU setups. A consumer GPU may have 24 to 32GB of VRAM, and a single H100-class datacenter GPU has 80 to 94GB, while some unified-memory systems advertise 128GB shared pools. The model's weights all have to fit somewhere the processor can reach; the big shared pool fits them, the small fast VRAM doesn't. The mini PC isn't more powerful. It just has room.

Fitting Is the Win: How Much It Needs Is the Next Question

Unified memory's contribution is one clean thing: a big, shared, addressable pool that lets a small machine fit models that used to require a server. That's the capacity win. The bandwidth catch is the price, and now you can read a spec sheet knowing which number governs which behavior.

The natural next question is the one this article kept handing off: how much memory does a given model actually need? That's arithmetic: parameters, bytes per weight, the compression level you choose, and the context tax the file size hides. The sibling article on GGUF, GPTQ, AWQ, and EXL2 quantization works through exactly that math, and it's worth doing before you size a box or pick a model.

What Is Unified Memory, and Why Does It Let a Mini PC Run a 235B Model?