Generative World Explorer (Genex)

Author
Rahul Joshi
Founder, CEO
In This Article
Humans navigate the 3D world by perceiving, acting, and engaging socially. Through these interactions, they form mental models (Johnson-Laird, 1983) that simulate and represent reality. These models enable reasoning, problem-solving, and prediction using language and imagery.
Inspiration for Artificial Intelligence: AI systems are inspired by human cognition, creating computational analogs of mental models known as world models (WMs). These models mimic human understanding, predicting future states of the world to aid decision-making (Ha & Schmidhuber, 2018; LeCun, 2022; Diester et al., 2024).
Advancements and Gaps in Generative Vision Models: Generative vision models focus on simulating state transitions but often neglect explicit modeling of observations and beliefs. This omission is significant because agents in real-world environments frequently have only partial observations.
Importance of Observation and Belief Modeling: In partially observable settings, agents operate as POMDP agents, forming and updating beliefs—estimates of their environment—from limited observations. Explicitly modeling these beliefs is essential for making rational decisions, as they can be refined with further exploration of the environment.
Typically, in an unfamiliar environment, an embodied agent must acquire new observations through physical exploration to understand its surroundings, which is inevitably costly, unsafe, and time-consuming. However, if the agent can imagine hidden views by mentally simulating exploration, it can update its beliefs without physical effort. This enables the agent to take more informed actions and make more robust decisions.
Consider the scenario in Fig. 1: Suppose you are approaching an intersection. The light ahead is green, but you suddenly notice that the yellow taxi in front has come to an abrupt, unexpected stop. A surge of confusion and anxiety hits you, leaving you uncertain about the reason behind its halt. Physically investigating the situation would be unsafe and even impossible at that moment. However, by standing in the taxi’s position in your own imagination and envisioning the surroundings from its perspective, you sense a possible motivation behind the taxi’s puzzling behavior: perhaps an ambulance is approaching. Consequently, you clear the path for the emergency vehicle, a timely and decisive choice, thanks to your imagination.

To build agents capable of imaginative exploration in a physical world, we propose the Generative World Explorer (Genex), a video generative model that conditions on the agent’s current egocentric (first-person) view, incorporates intended movement direction as an action input, and generates future egocentric observation. Although prior works can render novel views of a scene based on 3D models, the limited render distance and field of view (FOV) constrain the range and coherence of the generated video. Fortunately, video generation offers the potential to extend the exploration range.
To address the FOV constraint, we utilize panoramic representations to train our video diffusion models with spherical-consistent learning. As a result, the proposed Genex model achieves impressive generation quality while maintaining coherence and 3D consistency throughout long-distance exploration.
The proposed Genex can be applied to embodied decision-making. With Genex, the agent can imagine hidden views via imaginative exploration and revise its belief. The revised belief allows the agent to take more informed actions. Technically, we define the agent’s behavior as an extension of POMDP with imagination-driven belief revision. Notably, the proposed Genex can naturally be extended to multi-agent scenarios, where one agent can mentally navigate to the positions of other agents and update its own beliefs based on the imagined beliefs of the other agents.
In summary, our key contributions are as follows:
- Imaginative Exploration: We introduce Genex, a novel framework that enables agents to imaginatively explore the world with high generation quality and exploration consistency.
- Integration with Decision Processes: We present one of the first approaches to integrate generative video into the partially observable decision process by introducing imagination-driven belief revision.
- Applications in Multi-Agent Systems: We highlight the compelling applications of Genex, including multi-agent decision-making.