In brief: AI agents exceed baseline on only roughly 18 percent of genuine scientific tasks because they tend to reframe problems rather than solve them with true innovation.

Researchers have developed NatureBench, a benchmark with 90 tasks from Nature publications, to test whether AI agents are capable of solving genuine scientific problems rather than merely replicating them. Early evaluation shows: the strongest model exceeds the previous state-of-the-art on only 17.8 percent of the tasks.

NatureBench builds on NatureGym, an automated pipeline that constructs standardized, containerized environments per task from original scientific papers. This addresses the previously unsolved problem of environment fragmentation, which has jeopardized the credibility of earlier agent-on-research benchmarks.

In the evaluation of ten frontier agent configurations under strict restriction of web search capabilities, the results show: the best model exceeds the previous state-of-the-art (SOTA) under the g>0.1 criterion on only 17.8 percent of the 90 tasks. Analysis of successful solution paths reveals that agents primarily succeed through methodological translation – they convert scientific problems into familiar supervised prediction tasks, not through genuine scientific invention.

Failed solutions arise predominantly from incorrect method selection and insufficient compute budgets, not from task misunderstanding. For engineers, the key takeaway: the findings show where current coding agents systematically fail and where their strengths lie – essential for setting realistic expectations for AI-assisted problem-solving in R&D contexts.

The researchers make the benchmark, NatureGym pipeline, and a public leaderboard with reproducible results available. Code is available on GitHub.

Source: arxiv.org · Published June 22, 2026
Lumi AI News — AI-assisted curation in accordance with Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.

Share on:

NatureBench: How Far Coding Agents Really Get on Scientific Tasks

Lumi AI News

Legal

Topics