In short: Open models are closing the gap to the frontier, but different benchmarking methods and evaluation frameworks make reliable performance comparisons between open and closed systems difficult.

Google, DeepSeek, Xiaomi and other developers have released a new generation of open AI models. An assessment by the Center for AI Standards and Innovation (CAISI) shows that open models are catching up to closed systems, but the measurement is controversial because standardized benchmarks may not adequately capture real-world capabilities.

In October 2024, several new open language models were released: Google’s Gemma 4 (4B, 9B, 31B Dense and 26B A4B MoE), DeepSeek’s V4-Flash, Moonshot AI’s Kimi K2.6, Xiaomi’s MiMo-V2.5-Pro and GLM-5.1. The CAISI Institute evaluated these models using an Elo score based on Item Response Theory, which makes models comparable across different benchmarks. The evaluation used nine different benchmarks and according to CAISI showed a widening gap to American frontier models.

The large Elo difference, however, can be explained by specific benchmark characteristics: DeepSeek V4 performed more weakly on CTF-Archive-Diamond (extrapolated from partial datasets), PortBench (CAISI-internal benchmark) and ARC-AGI-2 (with different evaluation methodology). An alternative measurement by Epoch AI using ECI, by contrast, shows that the gap between open and closed models since the release of R1 is approximately three to seven months.

Both evaluation frameworks have limitations: they use standardized, simplified setups that may underestimate real-world application capabilities. For example, coding tasks are evaluated via a Bash shell with fixed token budget, not via specialized harnesses like Claude Code or OpenCode that models are trained on. This leads to benchmarks classifying tasks such as language migrations (for example Bun’s migration from Zig to Rust with a million lines of code changes) as impossible, even though they have been solved in practice.

For a meaningful comparison between open and closed models, model-specific prompting strategies and the use of optimal evaluation conditions for each model would be necessary. Among the new open models, several stand out in particular: Google has introduced the Apache 2.0 license for Gemma 4, eliminating the legal uncertainty of earlier custom licenses. Xiaomi’s MiMo-V2.5-Pro competes on equal terms with flagship models like Kimi K2.6. Kimi K2.6 demonstrates long context windows and multi-hour task sequences, which is relevant for autonomous research systems. Poolside AI’s Laguna-XS.2 (33B A3B) offers dedicated coding optimization in a compact size for local deployments.

Source: www.interconnects.ai · Published 16 May 2026
Lumi AI News — AI-assisted curation in accordance with Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.2.0.

Share on:

Open Frontier Models: Gemma 4, DeepSeek V4 and Others Compared to Closed Systems

Lumi AI News

Legal

Topics