How AI Generates Games Without a Game Engine (GameNGen, Genie 3)

In 2024, a Google Research and Google DeepMind team showed that a neural model could simulate playable DOOM at over 20 frames per second without running the original game engine underneath it. There was no conventional engine loop explicitly storing coordinates, physics objects, health variables, or map state in the usual way. Instead, GameNGen learned to infer the next frame from recent frames and player inputs, including visual cues such as health, ammo, enemies, doors, and walls. The system, called GameNGen, is a modified version of Stable Diffusion (the same kind of model that generates images from text), and it plays DOOM by hallucinating each next frame from the previous frames plus whatever key you just pressed.

That is a fundamentally different thing from "AI inside a game engine." When a studio uses AI to generate textures or write NPC dialogue in Unity, the engine is still there doing the real work. GameNGen has no engine. The model is the game. And it is the start of a genuine frontier that the headlines keep getting wrong. GameNGen appeared through the ICLR research track, DIAMOND came through NeurIPS 2024, and companies like Google DeepMind, Microsoft Research, Decart, and Skywork AI are now pushing the idea from papers into demos, APIs, and open-source systems.

Here is what these systems actually do, how next-frame prediction works, why coherence and memory still break down over longer interaction, what they cost to run, and whether they are coming for Unity. The short answer to that last one is no, at least not the way the hype implies. The reason is architectural: more compute helps, but it does not by itself create persistent state, deterministic logic, or a debuggable game loop.

The Short Version

These models predict frames; they don't simulate rules. A game engine computes the next state from logic and stored variables. A world model like GameNGen or Oasis guesses the next image from previous frames plus your input. It is not running a traditional game-engine simulation with explicit object state, physics code, and inspectable variables; it is generating the next observation through a learned model.
Their coherence is still bounded by memory and context, but the limit is no longer as simple as "everything fails after a few seconds." GameNGen has a little over 3 seconds of direct frame history yet can remain visually stable over longer trajectories through learned heuristics. Genie 2 usually showed 10-20 second examples and could sometimes preserve out-of-view details, while Genie 3 pushes consistency to a few minutes at 720p/24fps. The core weakness remains: these systems do not yet provide the durable, inspectable, saveable state that production games rely on.
They are not naturally deterministic in the way production games need. You can constrain sampling or fix seeds, but that still does not give you the clean, inspectable state updates of a normal engine. Multiplayer, competitive balance, replays, skill progression, and save/load all depend on reliable state transitions. A frame generator can approximate that behavior, but a production game would still need a deterministic logic layer underneath or beside it.
DeepMind frames world models as a foundation for training and evaluating AI agents in rich simulated environments, while Project Genie shows the same technology in a consumer-facing world-creation prototype. Decart's newer Oasis 3 is even more explicitly aimed at physical AI, robotics, and autonomous-vehicle simulation. That reframes the "is this coming for Unity?" question: the most serious near-term market may be agent training and simulation, not finished consumer games.

What This Article Doesn't Cover

A few neighboring topics get pulled into the same conversation and don't belong here:

DLSS, FSR, upscaling, and frame generation. Those are AI replacing individual stages of a normal rendering pipeline; the engine is still running. That's a separate topic, neural rendering, and not what this article covers.
The detailed reinforcement-learning methodology used to gather training data. I'll describe it at a conceptual level; the papers have the full recipe.
Game-server hosting and infrastructure setup. This is an explainer about how the models work, not a deployment guide.

What People Mean by "AI Game Engine" (and Which One This Is)

The phrase "AI game engine" gets attached to three completely different things, and most confusion about this topic comes from collapsing them together. This article is about exactly one of them: a model that predicts every frame and replaces the engine entirely. Not AI tools bolted onto a traditional engine, and not a tool that builds 3D environments you then load into one.

The three meanings, in plain terms:

AI tools inside a traditional engine. Asset generation, texture synthesis, NPC behavior trees, dialogue writing: all running inside Unity or Unreal. The engine still renders frames, runs physics, and holds state. The AI is an assistant in the content pipeline. This is what most search results for "AI game engine" are actually about, and it is not this article's subject.
Authored 3D-space generators. World Labs, co-founded by Fei-Fei Li, offers Marble, a tool that creates persistent, downloadable 3D environments from text, images, videos, or other inputs. Crucially, Marble is closer to a spatial content-creation tool: it generates persistent 3D worlds that can be moved through, edited, downloaded, or exported into downstream workflows. That makes it different from GameNGen, Oasis, or Genie-style systems where the playable experience itself is produced live through frame-by-frame generation.
World models that replace the engine. GameNGen, Oasis, the Genie family, DIAMOND, MineWorld, Matrix-Game. These generate playable observations directly instead of loading a normal authored scene into Unity or Unreal. Some newer systems add memory and consistency mechanisms, but they still do not expose the durable, inspectable, developer-controlled state model of a traditional game engine. This is the subject here.

A quick decision rule for any article you read: if the system produces a file you load into Unity, it's category 1 or 2. If the system is the thing you're playing, with frames generated live, it's category 3: a world model.

Infographic titled Three Meanings of AI Game Engine: category 1 is AI tools inside a traditional engine for assets, textures and NPC behavior; category 2 is authored 3D-space generators that export scenes; category 3 is world models that replace the engine and generate the interactive frame by frame. A banner notes this article is about category 3.

How a Model Generates a Game With No Engine

A world model learns what a game looks like in motion, then predicts the next frame conditioned on recent frames plus the player's current input. Unlike a traditional engine, it does not expose clean variables such as "the door is open," "this enemy is dead," or "the player is at coordinate X." In early frame-prediction systems, the model mostly learns that certain visual states tend to follow certain inputs. Play is just running that learned prediction loop fast enough to feel interactive.

GameNGen is the cleanest worked example, because the paper lays out every step. The pipeline runs in two phases. First, a reinforcement-learning agent plays thousands of sessions of DOOM, and every session is recorded as a stream of frames paired with the actions that produced them. Second, a modified Stable Diffusion v1.4 is trained on that data to predict the next frame given the previous frames and the player's action. The action is baked directly into the conditioning, and that's the trick that makes it a game and not just a video generator. Your keypress is part of the prompt for the next image.

The hard part is speed. A normal diffusion model runs 20 to 50 denoising steps to turn noise into an image, which is far too slow for real-time play. GameNGen cuts that to 4 denoising steps, bringing total inference to roughly 50 milliseconds per frame: fast enough for 20 FPS on a single TPU at DOOM's native 320×240 resolution. Human raters could only do slightly better than chance at telling short clips of the simulation apart from real DOOM footage.

Most systems in this space fall into overlapping architectural patterns:

Diffusion-based systems (GameNGen, Oasis, DIAMOND, Genie 2): start from noise and iteratively denoise into the next frame. They can produce strong short-horizon visual quality, but need speed tricks to run interactively.
Autoregressive systems (MineWorld): predict future frames or tokens sequentially, closer to how a language model predicts text. MineWorld trades frame rate for tighter action-following, landing around 4-7 FPS.
Memory- and control-augmented hybrids (Matrix-Game 2.0/3.0 and newer systems): combine real-time generation with action conditioning, camera control, and explicit memory mechanisms to reduce long-horizon drift.

One detail matters for the next section. During training, GameNGen deliberately adds noise to the past frames it conditions on. That forces the model to learn to correct its own errors instead of compounding them, a mitigation for the drift problem. It helps. It does not solve it.

Diagram of how next-frame prediction works in five steps: recent frames, player input, the world model running denoising diffusion steps, the predicted next frame, and a prediction loop that repeats at real time for interactive speed.

The Lineage: From Genie 1 to Genie 3 in Two Years

The single most striking thing about this field is the slope. In February 2024, Genie 1 generated controllable 2D platformers at 256×256. Eighteen months later, Genie 3 was generating navigable 3D worlds from a text prompt at 720p and 24 FPS. That is the trajectory worth paying attention to: not any single demo, but the rate of change between them.

Read as one progression, the story goes like this. Genie 1 (DeepMind, ICML 2024) proved you could learn interactive environments from unlabeled video. GameNGen (Google, ICLR 2025) showed the same idea could run a real, fast-paced game (DOOM) in real time. Oasis (Decart, October 2024) brought it to Minecraft and made it publicly playable. Genie 2 (DeepMind, December 2024) jumped to 3D worlds generated from a single image. DIAMOND (NeurIPS 2024) made the approach open-source and runnable on a consumer GPU. GameGen-X and MineWorld (Microsoft, 2025) pushed the open ecosystem further. Genie 3 (August 2025; public as Project Genie in January 2026) reached real-time 3D from text. Matrix-Game 2.0 pushed open-source, real-time streaming generation to 25 FPS, and Matrix-Game 3.0 attacked the memory problem more directly with a long-horizon memory architecture.

This is, in a real sense, the other end of the neural-rendering trend. Neural rendering is AI replacing individual stages of the graphics pipeline (upscaling here, shading there) while the engine keeps running. World models are AI replacing the pipeline entirely. If you read the two together, neural rendering is the "AI eats the parts" story and this is the "AI eats the whole thing" story. Each is the other's logical next step.

The specs for the major systems live in the table below; the point of the narrative is the arc, not the numbers.

System	Developer	Year	Approach	Resolution / FPS	Open-source?	Source
Genie 1	Google DeepMind	2024	Latent action	256×256	No	arXiv
GameNGen	Google	2024	Diffusion	320×240 / 20 FPS	No	arXiv
Oasis	Decart + Etched	2024	Diffusion (Forcing)	360p / 20 FPS	Partial (500M ckpt)	Project
Oasis 3	Decart	2026	API-accessible interactive world model for physical AI	Real-time API preview	No	Decart / TechCrunch
Genie 2	Google DeepMind	2024	Autoregressive latent diffusion	N/A	No	DeepMind
DIAMOND	Geneva / Edinburgh / MSR	2024	Diffusion	Atari / CS:GO	Yes (MIT)	arXiv
GameGen-X	Academic	2024	Diffusion transformer	N/A	Yes	arXiv
MineWorld	Microsoft Research	2025	Autoregressive	4-7 FPS	Yes	arXiv
Genie 3	Google DeepMind	2025	General-purpose real-time world model	720p / 24 FPS	No	DeepMind
Matrix-Game 2.0	Skywork AI	2025	Few-step autoregressive diffusion	25 FPS on a single H100	Yes	Project
Matrix-Game 3.0	Skywork AI	2026	Memory-augmented interactive world model	Up to 40 FPS at 720p with a 5B model	Yes	Project / arXiv

Timeline titled Rapid Evolution of Interactive World Models showing Genie 1 in 2024, GameNGen in 2024, Oasis in 2024, Genie 2 in 2024, DIAMOND in 2024, MineWorld in 2025, Genie 3 in 2025, and Matrix-Game 3.0 in 2026, illustrating the move from controllable 2D worlds to real-time 3D interactive generation in roughly two years.

Why These Worlds Fall Apart

These systems still break in four important ways, but the failure mode is not just "not enough compute." More GPUs can improve resolution, latency, and model scale, but production-grade coherence needs better memory, state tracking, and control architecture. A model that predicts plausible frames is not the same thing as an engine with explicit rules, inspectable variables, deterministic state updates, and save/load semantics. Each limitation below is what the model can't structurally do, not what it hasn't gotten good enough at yet.

No Persistent World State

These systems do not expose variables in the way a traditional engine does. A normal engine stores the world as data: this chest is open, this enemy is dead, the player is at coordinate (412, 88). In early frame-prediction systems, there is no durable engine state in that game-development sense. The model mostly relies on recent visual context and learned priors, so objects can change, vanish, or reappear incorrectly once they leave view. Newer systems are adding explicit memory and consistency mechanisms, but they still do not expose the kind of clean, debuggable world state a traditional engine gives developers.

In weaker or early frame-prediction systems, a chest you opened can reappear closed, a monster you killed can walk back in, and a structure you built can dissolve once it leaves frame. Players described the original Oasis demo as having "dream logic": you turn, and you may not return to exactly the same place. Newer systems are trying to reduce that problem with stronger memory and consistency mechanisms, but the gap remains: they still do not expose a traditional, inspectable game-state layer.

The Context-Window Ceiling

Coherence is bounded by the model's memory design, not just by raw visual quality. GameNGen uses a short direct frame history but still reports stable multi-minute play sessions through learned correction. Genie 2 introduced visible long-horizon memory examples and maintained consistency for up to a minute, with most examples lasting 10-20 seconds. Genie 3 pushes continuous interaction to a few minutes, and Matrix-Game 3.0 directly attacks the problem with long-horizon memory. The unsolved issue is not "can the model last more than a few seconds?" It is whether it can preserve a reliable, inspectable, saveable world state for the length and complexity of a real game.

Stochastic, Not Deterministic

The output is probabilistic by default. Run the same setup twice and you may get different frames unless the system is heavily constrained. For an art tool, that can be useful; for many production games, it is a problem. Multiplayer, competitive balance, replays, skill progression, and save/load all depend on reliable state transitions. A world model can be made more repeatable, but a production game would still need a deterministic logic layer or state system to guarantee the behavior players and developers expect.

Is It a Game, or Video Prediction With a Keyboard?

The sharpest critique is that these systems are not simulating worlds in the traditional game-engine sense; they are generating plausible visual continuations and letting you steer them. A game engine encodes rules; a world model encodes plausibility. One commenter in the GameNGen Hacker News thread called it "the world's least efficient video compression," and as a provocation it lands: the model has effectively memorized a distribution over gameplay footage and is interpolating through it in response to your inputs. There's a clean test for this, in the callout below.

The "drift when standing still" tell. If a world model were truly computing a world, a motionless player should yield a stable image: nothing is changing, so nothing should change. In weaker or early frame-prediction systems, even standing still can reveal drift: small details shift because the model is predicting the next plausible frame rather than rendering from a fixed, inspectable world state. That is the tell. The scene may look stable for a while, but the system is still generating continuity rather than reading it from a conventional engine.

Key takeaway: the determinism and persistence limits are architectural problems, not issues that raw scaling will solve by itself. Any system that needs a reliable, repeatable, savable world still needs a deterministic logic layer, explicit memory/state system, or hybrid engine design that current frame-generation approaches do not provide on their own.

Infographic titled Why World Models Drift with four panels: no traditional world state means no clean inspectable engine variables; memory limits make long-horizon consistency difficult; probabilistic output means the same setup can yield different results; and drift over time means continuity is generated rather than retrieved from stable engine state.

What It Actually Costs to Run

Real-time generation is expensive, and the headline numbers hide a lot. GameNGen's "single TPU" sounds cheap until you remember it is simulating DOOM at 320×240, not a modern high-resolution game. The original Oasis demo ran in real time on H100-class infrastructure, and Decart's newer Oasis 3 makes the economics more concrete. Decart positions Oasis 3 as an API-accessible interactive world model for physical AI, and TechCrunch reported preview access pricing at $0.02 per second, or $1.20 for a 60-second session. That is useful for testing, simulation, and research workflows, but it is still a very different cost model from shipping a normal game client.

To put scale on it: real-time world generation is still expensive, but the hardware picture is moving fast. Some open research systems now report real-time or near-real-time generation on single H100-class GPUs, while frontier consumer-facing systems remain cloud-hosted and often undisclosed. The firm point is not "one GPU can never do it"; it is that production-quality, low-latency, high-resolution world generation is still a serious infrastructure problem.

The counterpoint is that the floor is dropping fast, and the open-source tier is real. DIAMOND trained in about 12 days on a single RTX 4090 and, according to its official project page, can be played at roughly 10 FPS on an RTX 3090. MineWorld and Matrix-Game are publicly runnable. So while the most impressive demos still depend on specialized, expensive infrastructure, a curious developer can already run some real world-model experiments on accessible hardware. Both things are true at once: frontier-quality interaction is costly, and the entry point for experimentation is already real.

So Will AI Replace Unity and Unreal?

Not in the near term, and the reason is the limits above, not a lack of investment. The market took this seriously. Google rolled out Project Genie to U.S. Google AI Ultra subscribers on January 29, 2026, and the next day several gaming stocks sold off sharply: The Verge reported Unity down 24.22%, Roblox down 13.17%, and Take-Two down 7.93% at Friday's close. The anxiety also showed up inside the industry: GDC's 2026 survey found that 52% of game professionals saw generative AI as having a negative impact on games, up from 30% the previous year. But stock moves and survey anxiety are reactions to a demo. The architecture is what sets the actual timeline.

Reading the trajectory as it stands, and this is my read, not a settled forecast, the next 1-3 years likely keep world models in research prototypes, simulation infrastructure, robotics/physical-AI training, and narrow consumer-facing demos rather than full commercial games. The plausible 3-7 year path is hybrid, not replacement: a world model handling visual generation sitting on top of a lightweight deterministic state machine that holds the actual game logic. That's augmentation. The trajectory is steep enough (DOOM at 320p to 720p-from-text in roughly a year) that confident long-term predictions are unwise, so I won't make one.

The detail that reframes the whole question: DeepMind ties world models to agent training and AGI research, while Project Genie shows the same technology as a consumer-facing world-creation prototype. Decart's Oasis 3 is even more explicitly aimed at robotics, autonomous vehicles, and physical-AI simulation. Consumer games matter to the story, but the near-term commercial pull may come from simulation, training, and prototyping first.

Frequently Asked Questions

What Is the Difference Between a World Model and a Game Engine?

A game engine encodes explicit rules and stores game state as data: it's deterministic, inspectable, and debuggable. A world model like GameNGen predicts plausible next frames from recent frames plus your input, without the traditional engine-style state, rules, and object variables developers normally inspect and control. The engine computes the world; the world model guesses it. That's why one is repeatable and the other isn't.

How Does GameNGen Work?

GameNGen runs DOOM in three broad steps. First, a reinforcement-learning agent plays thousands of DOOM sessions, recorded as frames paired with actions. Second, a modified Stable Diffusion v1.4 learns to predict the next frame conditioned on past frames plus the player's input. Third, inference is cut to 4 denoising steps, producing roughly 20 FPS on a single TPU at 320×240.

Why Does the World in Oasis Keep Changing When You Turn Around?

In the original Minecraft-like Oasis demo, the world could change when you turned around because the system did not preserve a traditional, engine-style world state. It generated the next view from recent visual context and learned priors, so out-of-view objects could return in altered form. Newer systems are adding stronger memory and consistency mechanisms, but that original "dream logic" is exactly what made the limitation easy to notice.

How Long Can an AI-Generated Game World Stay Consistent Before It Drifts?

It depends on the model. Early systems often drift within seconds to tens of seconds, but newer systems are extending that horizon. GameNGen has a little over 3 seconds of direct context yet can remain stable over longer gameplay through learned heuristics. Genie 2 mostly showed 10-20 second examples and up to a minute in some cases. Genie 3 raises the claim to a few minutes at 720p/24fps, and Matrix-Game 3.0 reports minute-long memory consistency. The unsolved problem is not short clips; it is durable, inspectable, saveable world state.

Will AI Replace Game Engines Like Unity or Unreal?

Not in the near term. The blockers are architectural more than purely a scale problem: production games need persistent state, reliable logic, deterministic behavior, and save/load semantics. Scaling helps quality and coherence, but it does not by itself create a traditional game loop. The plausible path is hybrid: a world model generating visuals on top of a deterministic engine for game logic, which is augmentation rather than replacement. DeepMind presents world models as important for agent training and AGI research, while Project Genie also makes the technology visible as a consumer-facing world-creation prototype. Decart's Oasis 3 is the cleaner example of a model explicitly aimed at robotics, autonomous vehicles, and physical-AI simulation.

Can You Play Any of These AI-Generated Games Right Now?

Yes, several. Decart's original Oasis had a public Minecraft-like web demo, and its newer Oasis 3 Preview is now API-accessible for real-time world-model experiments. Google's Project Genie also became available to Google AI Ultra subscribers in the U.S. in January 2026. For the open-source tier, DIAMOND and MineWorld can be downloaded and run on consumer GPUs, with DIAMOND reported at around 10 FPS on an RTX 3090.

Games Without a Game Engine: How AI Models Generate Playable Worlds