World Models: the operating system for spatial intelligence

Simulation is becoming a foundational capability for Physical AI, enabling machines to learn in virtual environments and operate more effectively in real-world contexts. 

Simulation is the new reality

3D technologies are no longer confined to entertainment or visualization. They have become a core enabler for spatial simulation, now widely adopted across product design, robotics, defense, and industrial training. This shift marks a transition from static digital assets to dynamic, physics-aware environments, where AI systems can learn how the world behaves rather than simply analyzing isolated data points. 

Despite recent advances in AI, most systems still struggle to reason about space, motion, and physical interaction. As Dr. Fei-Fei Li, one of the pioneers of modern AI, has observed, an LLM can explain quantum physics but cannot judge how far apart two objects are in an image or mentally rotate a cube. This limitation highlights the importance of spatial intelligence: the ability to model how objects relate to one another, how actions unfold over time, and how physical constraints shape outcomes. World Models address this gap by enabling AI to learn through structured, simulated environments that reflect the rules of the physical world. 

The landscape of world models 

How can AI systems learn about environments they have never directly experienced? Training exclusively in real-world conditions is often impractical, expensive, or unsafe. 

To overcome these constraints, a new generation of world model architectures has emerged, designed to reconstruct, generate, and simulate reality in ways that support learning through interaction. 

Key world model approaches 

  • Marble (World Labs) 
    A multimodal world model capable of reconstructing and simulating 3D environments from images, enabling interaction by both humans and AI agents. 

  • SAM 3D (Meta) 
    A reconstruction engine that transforms 2D objects and bodies into fully digitized 3D assets. 

  • Genie 3 (Google) 
    A general-purpose world model that generates interactive environments from text prompts, enabling real-time navigation in AI-generated spaces. 

  • HunyuanWorld-Mirror (Tencent) 
    A feed-forward model for comprehensive 3D geometric prediction, covering depth estimation, surface normals, point clouds, and novel view synthesis. 

  • Cosmos 2.5 (NVIDIA) 
    A suite of world foundation models unifying Text2World, Image2World, and Video2World generation, with strong support for Sim2Real workflows. 

  • SIMA 2 (Google) 
    A generalist agent designed to reason and act across diverse simulated environments, demonstrating transferable embodied capabilities. 

  • GWM-1 (Runway)
    An autoregressive general world model built on Gen-4.5 that simulates reality in real time, featuring three variants—Worlds for explorable environments, Avatars for conversational characters, and Robotics for synthetic training data generation.

While these approaches differ in implementation, they share a common objective: enabling AI systems to learn, test, and refine behaviors within simulated environments before deployment in real-world settings. 

Deep dive: Meta V-JEPA 2 and applied experimentation

Among these architectures, V-JEPA 2 (Video Joint Embedding Predictive Architecture) adopts a predictive, non-generative approach to learning physical dynamics from video data. Instead of generating pixels, it focuses on modeling how scenes evolve over time. 

As Reply, we explored this architecture through a focused experimental setup, assessing its ability to capture temporal relationships and motion patterns in complex scenarios. Key evaluation results include: 

  • 77.3% top-1 accuracy on the Something-Something v2 dataset 

  • 39.7 recall@5 on Epic-Kitchens-100 for first-person action anticipation 

Overall, the experimentation confirms V-JEPA 2’s effectiveness in scenarios where understanding motion and interaction over time is critical, such as robotics and autonomous systems. 

From world models to embodied AI

World Models play a central role in Embodied AI, supporting the connection between perception, reasoning, and action. 

In applied robotics scenarios, these models enable systems to interpret their surroundings, plan actions, and adapt to changing conditions. This approach is particularly relevant for autonomous robots and humanoid platforms operating in unstructured or semi-structured environments such as industrial sites, logistics hubs, and healthcare facilities. 

A key enabler in this context is the use of spatial anchors - persistent digital reference points that allow AI systems to associate learned representations with precise physical locations, improving consistency and reliability across missions. 

The road ahead

As AI systems move beyond purely conversational use cases, spatial simulation and physical reasoning are becoming increasingly important. World Models represent a foundational component in this evolution, supporting more reliable, context-aware, and physically grounded AI systems across industrial and real-world applications.