The gist: Astra combines an RL-trained vision-language model with a world simulator to improve spatial reasoning through selectively generated perspectives.
Researchers have developed a framework that equips vision-language models with spatial imagination through an integrated world simulator. The system learns when it makes sense to simulate scenes from new viewpoints in order to perform better on spatial reasoning tasks.
Vision-language models demonstrate strong visual capabilities, but frequently fail at spatial reasoning: they cannot infer hidden layouts from limited, egocentric observations, cannot remain consistent across different perspectives, and cannot reason from alternative viewpoints. The new work therefore asks how a VLM could actively incorporate imagined visual evidence through interaction with a world simulator during reasoning.
The Astra framework couples two components: Astra-VL is a reinforcement-learning-trained VLM policy that learns a rule-based policy for simulator usage. Astra-WM is a Bagel-based world simulator that generates new perspectives from context images and natural language camera movements. To ensure reliable simulations, Astra-WM is trained with view-consistency tuning that improves camera pose and content consistency across different viewpoints. In the RL training stage, the system employs a two-phase curriculum that stabilizes tool exploration and guides the model to use the simulator only when imagined observations would improve direct answering.
In experiments, both components prove necessary: Astra-WM improves Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5 points, while Astra-VL lifts the Qwen3-VL baseline from 29.8 to 38.8 on MMSI-Bench (and from 36.8 to 42.7 on MindCube). The results demonstrate that simulated observations can provide spatial evidence, but effective world-model-augmented reasoning requires the system to learn when, where, and how to imagine.
Source: arxiv.org · Published June 3, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.6.5.