Skip to content

ARM: Autoregressive Model for Unified Image and Text Processing

In brief: ARM combines discrete visual tokens with a 7-billion-parameter model to solve image and text tasks uniformly as token predictions.

A new language model called ARM unifies image understanding, image generation, and image editing in an autoregressive architecture with discrete token sequences. The system uses a trained tokenizer for visual content and is optimized via reinforcement learning.

ARM is based on three main components: First, the authors train a discrete visual tokenizer that maps images into compact token sequences. This tokenizer is trained with multiple objective functions to simultaneously promote semantic distinguishability, text alignment, and faithful image reconstruction. This creates a shared latent space for various tasks.

The core system is a 7-billion-parameter autoregressive model trained on large amounts of text and image tokens. It develops vision-language capabilities for both understanding and generation. The autoregressive principle treats all tasks uniformly as next-token prediction.

Additionally, the authors apply reinforcement learning to optimize outputs for text-to-image generation and instruction-guided editing. The RL training targets visual quality, instruction following, and consistency in editing operations. The results show measurable improvements: the WISE score rose from 0.50 to 0.56, the GEdit-Bench-EN G_O from 5.75 to 6.68.

Notably, RL tuning not only affects the target domain but also creates positive synergies between text-to-image generation and editing tasks. The authors interpret this as evidence that strong representations combined with preference optimization provide a scalable foundation for multimodal systems. The code is publicly available via GitHub.


Source: arxiv.org · Published 8 June 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.6.5.

Share on: