ViQ quantizes visual inputs at arbitrary resolutions into discrete representations, achieving 20–70% training acceleration compared to continuous image encodings.
InternVideo3 enables foundation models to analyze longer video sequences with iterative reasoning and tool use while avoiding efficiency problems in KV cache management.