Bottom line: Despite a record month in open AI models, CAISI’s assessment shows: the gap to the American frontier is growing, though alternative metrics suggest a smaller lag of approximately 3–7 months.

A record-breaking month in open AI development: all major labs, including DeepSeek, released new models. The Center for AI Standards and Innovation (CAISI) evaluated these against the American frontier and reached a sobering conclusion.

The month was exceptionally productive: Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, and GLM-5.1 represent an unprecedented wave of new open models. The Center for AI Standards and Innovation (CAISI), an organization that has previously assessed open models and their risks, conducted a comprehensive evaluation.

The results paint a mixed picture: open models continue to lag behind the American frontier, with this gap widening over time. CAISI used nine different benchmarks for its version-4 assessment and calculated an Elo rating using Item Response Theory — a method commonly employed to compare models, even when evaluated on entirely different benchmark suites.

The large Elo gap can be attributed to several factors: DeepSeek V3 showed weak performance on CTF-Archive-Diamond (which was only partially evaluated and then extrapolated via IRT), on PortBench (a private CAISI benchmark), and on ARC-AGI-2 (which used a different evaluation methodology than the public leaderboards). These fluctuations significantly influence the overall picture.

An alternative perspective is offered by Epoch AI’s ECI metric, which also applies Item Response Theory across diverse benchmarks: the gap between open and closed models here amounts to approximately three to seven months since R1 — a considerably more nuanced picture than pure Elo ratings convey.

Share on:

Open AI Models on the Rise: Gemma 4, DeepSeek V4, and More Releases

Lumi AI News

Legal

Topics