Skip to content

ViQ: Discrete Visual Representations at Arbitrary Resolution

At a glance: ViQ quantizes visual inputs at arbitrary resolutions into discrete representations, achieving 20–70% training acceleration compared to continuous image encodings.

Researchers present ViQ, a framework for quantized visual representations that maps text and images into a unified discrete format. The system preserves both semantic information and low-level details at native image resolution.

The core challenge is that discretizing images – similar to text tokenization – normally results in massive information loss. Existing methods fail to strike a balance between reconstructive representations (which preserve low-level details but are semantically weak) and semantically strong features (which lose details). ViQ addresses this conflict through a two-stage architecture: text-aligned pretraining and feature discretization.

During the pretraining phase, a visual encoder is enriched through supervision by a pretrained language model, enabling it to better capture semantic information. Simultaneously, the encoder is enabled to process native image resolutions. During discretization, ViQ employs a so-called proximal representation learning strategy to progressively condense the feature space. An additional position-aware head-wise quantization enables flexible processing of arbitrary resolutions – images do not need to be normalized to uniform sizes.

In experiments on multimodal tasks, ViQ achieves comparable performance to advanced continuous encoders while maintaining precision in low-level image reconstruction. The main benefit lies in training efficiency: multimodal training with quantized visual representations achieved 20–70% acceleration across different LLM bases and training recipes in tests.

For engineers, ViQ is relevant because discrete, quantized visual representations can significantly improve the efficiency of large multimodal models – whether in terms of memory, computation, or communication. The framework thus documents a practical way to treat images as tokens without accepting the information loss that earlier discretization methods imposed.


Source: arxiv.org · Published June 24, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.

Share on: