OpenBioRQ reveals that agent-based AI models fail on approximately 40% of complex biomedical research questions and paradoxically stop using their tools on difficult tasks, despite these tools being most critical.
A systematic data curation pipeline enables agentic models to be trained generalizably across diverse task types while achieving competitive or superior results compared to specialized models.