At a glance: AI agents rarely cite non-existent sources, but link to incorrect papers in 15.9% of cases and stop using tools at exactly the point where they would be most critical for difficult questions.
Researchers have developed a benchmark with 12,553 unsolved biomedical research questions that reveals a critical weakness in AI agents: while they rarely link to non-existent papers (over 99% of citations are valid), 15.9% point to the wrong papers.
The new benchmark called OpenBioRQ takes a novel approach to evaluating agentic AI models. While existing tests use fixed answer keys – allowing models to simply reproduce the expected source – OpenBioRQ uses real, unsolved questions without predefined answer keys. This forces models to independently verify that sources they find actually support the claim in question. The benchmark comprises 12,553 questions from 12 biomedical domains and treats open questions explicitly as a probe for reliability and abstention capability (withholding answers).
Evaluating three independent frontier agents – Gemini-3-Pro, Opus-4.7, and GPT-5.5 – reveals considerable variance: the models solve between 29 and 60 percent of the most difficult questions. Even the best model leaves 33–40 percent unsolved, demonstrating that the benchmark is not saturated and effectively captures actual performance differences between capability tiers.
A central problem with the tested agents is the so-called “agentic collapse”: on particularly difficult questions, the models stop using their tools. According to the research, this phenomenon is severe: for the model most susceptible to it, completely blocking tool access barely changes test results – the tools stop working exactly where they would be most urgently needed. This suggests that agents fail to meaningfully weigh tool use versus direct answers when faced with difficult tasks.
To improve evaluation reliability, the researcher introduced a standardized checklist approach that raises inter-rater agreement (Spearman correlation) from 0.35 to 0.82. OpenBioRQ thus represents the first biomedical benchmark that combines agentic scenarios with multiple tool calls and unsolved questions without answer keys, while defining difficulty empirically through actual model failure – not through subjective difficulty labels.
Source: arxiv.org · Published June 19, 2026
Lumi AI News — AI-assisted curation in accordance with Art. 50 EU AI Act. Paraphrasing and classification via Lumi News Pipeline v1.7.1.