Skip to content

Optical Reasoning: Images Instead of Text as Reasoning Medium in AI Models

Bottom line: Optical reasoning uses images as the primary reasoning medium, saving an average of 28.57 percent tokens on language tasks and 16 percent on multimodal tasks.

Researchers have demonstrated that AI models can represent visual reasoning processes directly in images rather than relying exclusively on textual intermediate steps. This approach significantly reduces token consumption and improves the efficiency of multimodal language models.

Traditionally, Large Language Models (LLMs) use chain-of-thought prompting to improve their performance through explicit textual intermediate steps. This approach has already been extended to multimodal language models (MLLMs). More recent research pushes the boundary further: instead of combining individual text and visual evidence elements, it examines whether images alone can serve as a reasoning medium.

The concept of “Optical Reasoning” realizes this idea in two variants: the typographic approach optimizes visual layouts for compact rational representation, while the graphical approach combines text and graphical elements into structured visual rationales. When tested on mathematical, scientific, and multimodal reasoning benchmarks, both variants demonstrate that they match or exceed traditional text-based reasoning. Token efficiency is considerable: language tasks show a token consumption reduction of an average of 28.57 percent, while multimodal tasks see a 16 percent reduction. Overall, optical reasoning achieves 1.96 times better token efficiency than pure text reasoning.

For CTOs, this result represents a practical optimization dimension for deployment and operating costs of multimodal systems. Fewer tokens per reasoning step directly reduces latency and infrastructure requirements. The approach also opens new possibilities for how models can represent and communicate knowledge — not as a textual chain, but as visually encoded rationales. This expands the understanding of what “reasoning” means in multimodal systems: not just text plus image, but image as the reasoning carrier.


Source: arxiv.org · Published June 8, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrasing and classification by Lumi News Pipeline v1.6.5.

Share on: