Skip to content

OpenBioRQ: Benchmark for Agent-Based Biomedical Research Questions

The Bottom Line: OpenBioRQ reveals that agent-based AI models fail on approximately 40% of complex biomedical research questions and paradoxically stop using their tools on difficult tasks, despite these tools being most critical.

Researchers have released a new benchmark dataset containing 12,553 unsolved biomedical research questions to expose weaknesses in AI agents’ source verification capabilities. The OpenBioRQ dataset reveals that approximately 15.9% of citations generated by language models link to existing papers, yet these papers do not actually support the claimed statement.

The core problem that OpenBioRQ addresses: While current agent-based models generate valid citations over 99% of the time, approximately 15.9% of these contain a critical error — the link points to the wrong paper. Existing benchmarks fail to capture this failure mode because models can simply reproduce the expected source on questions with predefined answers, rather than independently verifying whether that source actually supports the claim.

The dataset comprises 12,553 open-ended research questions spanning 12 biomedical domains and was specifically designed as an agent-based benchmark: models must make multiple tool calls in sequence but receive no answer keys. Instead, correctness is validated against genuine evidence from research literature. Difficulty levels are not subjectively labeled but are empirically anchored to questions that three open-source reference models cannot answer.

On this most difficult subset, a large performance gap emerges: models from the same family as the difficulty anchors solve only approximately 17% of questions, while three independent frontier agents (Gemini-3-Pro, Opus-4.7, GPT-5.5) achieve a broad range of 29-60%. The benchmark is thus not saturated — even the best agents leave 33-40% of questions unsolved.

Particularly problematic is a phenomenon called “agentic collapse” on difficult questions: agents stop using their tools precisely where these tools would be most needed. On the collapse-prone model, the score barely changes when tool use is disabled entirely. Structured validation through a unified per-question checklist improves consistency between evaluators from Spearman 0.35 to 0.82.


Source: arxiv.org · Published June 19, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification via Lumi News Pipeline v1.7.1.

Share on: