AI agents rarely cite non-existent sources, but link to incorrect papers in 15.9% of cases and stop using tools at exactly the point where they would be most critical for difficult questions.
DailyReport is a new open-source benchmark that evaluates search agents using everyday, multidimensional search tasks and reveals optimization opportunities in existing systems.
A new benchmark enables identification of the exact point where medical AI models produce hallucinations and enables targeted countermeasures through trace-supervised fine-tuning.
The Claw-SWE-Bench framework demonstrates that adapter design is critical for code agents: with a minimal adapter, OpenClaw achieves 19.1% Pass@1, with a complete adapter 73.4%.
Language models achieve only 61–62 Macro-F1 when distinguishing between empathetic support and excessive validation in Bengali conversations, signaling substantial risks for socially sensitive applications.
Current AI agents cannot reliably execute long-term, professional GUI workflows and fail at consistency maintenance, error propagation, and domain-specific understanding.