In short: The maximum accuracy gain of multi-model systems is mathematically bounded by beta, the rate at which all models simultaneously fail—a parameter that classical error-correlation metrics do not capture.
Researchers analyze the limits of routing, voting, and mixture-of-agents systems across 67 frontier models and demonstrate: the accuracy gain through model combination is limited by the rate of shared mispredictions, which the field has barely measured until now.
Systems like routing, voting, cascades, and mixture-of-agents that combine multiple language models are intended to exceed the accuracy of a single model. However, a new analysis of 67 models from 21 providers reveals: the realistic upper bound for any gain is 1 − β, where β represents the rate at which all participating models answer the same query incorrectly.
The problem lies in the fact that the standard diagnostic—average pairwise error correlation ρ—does not reflect this critical value. Two error distributions can have identical marginal distributions and pairwise correlations but exhibit different rates of co-failures. This leads to systematic underestimation of the all-wrong tail probability. In the mathematical open-ended task category, for example, researchers observed β = 0.052, while a Gaussian copula model across all 67 models predicted only β = 0.023—an underestimation by a factor of 2.5 (90% CI: 1.7–3.4, k = 17). For code tasks with execution validation, β rose to as high as 0.079.
Another finding concerns the format-dependence of co-failures: on the GPQA-Diamond dataset, reformulating from multiple-choice to free-response increased β from 0.023 to 0.127—evidence that shared errors are anchored more in response format than in knowledge deficits. A five-member adjudication panel achieved Cohen’s kappa of 0.73–0.92.
In practice, this means: with strong query-level routing information, heterogeneous ensembles with low error correlation can outperform self-MoA systems. However, analysis of verifiable tasks shows that model combination typically does not beat the single best model—unless the participating models fail systematically on different questions. Larger ensembles provide no advantage when their errors overlap.
Source: arxiv.org · Published June 24, 2026
Lumi AI News — AI-assisted curation in accordance with Article 50 EU AI Act. Paraphrase and classification through Lumi News Pipeline v1.7.1.