Skip to content

Quantifying Infrastructure Noise in Agentic Coding Evaluations

The Bottom Line: Infrastructure resource configuration can shift agentic coding benchmark scores by up to 6 percentage points. Tests show that error rates decline with more resource headroom, calling into question the validity of model comparisons on such benchmarks.

Agentic coding benchmarks like SWE-bench and Terminal-Bench are used to evaluate language models, with top rankings often separated by only a few percentage points. New analysis shows that infrastructure configuration alone can produce differences that exceed these margins—with deviations of up to 6 percentage points on Terminal-Bench 2.0.

Unlike static benchmarks, which directly evaluate a model’s output, agentic coding evaluations are different: models receive a full environment in which they write programs, run tests, install dependencies, and iterate across multiple runs. The runtime environment is no longer just a passive container but an integral part of the problem-solving process.

When calibrating a Terminal-Bench 2.0 setup on a Google Kubernetes Engine cluster, significant discrepancies from official leaderboard scores were discovered. The infrastructure error rate was surprisingly high—up to 6 percent of tasks failed due to pod errors unrelated to model capability. The problem lay in resource specification enforcement: the Kubernetes implementation treated per-task resource definitions as hard ceilings, leading to out-of-memory kills during transient memory spikes.

To quantify the effect, tests were run across six different resource configurations—from strict enforcement to completely uncapped resources. Success rates increased substantially with more resource headroom. The infrastructure error rate fell monotonically from 5.8 percent under strict enforcement to 0.5 percent with uncapped resources. The difference between 1x and 3x headroom (5.8 percent to 2.1 percent) was statistically significant (p < 0.001). These findings illustrate that infrastructure configuration materially influences what agentic coding benchmarks actually measure—and that specifying resources is not the same as consistently enforcing them.


Source: www.anthropic.com

Share on: