The Bottom Line: DailyReport is a new open-source benchmark that evaluates search agents using everyday, multidimensional search tasks and reveals optimization opportunities in existing systems.
Researchers have introduced DailyReport, a benchmark with 150 open-ended tasks and 3,546 evaluation criteria for assessing search agents in realistic deployment scenarios. Analysis of 17 agentic systems shows that current implementations fall short of user expectations.
Search agents leverage large language models to independently handle complex information retrieval tasks by exploring web sources and synthesizing information into comprehensive answers. Previous evaluation benchmarks have primarily focused on specialized tasks that rarely occur in real user scenarios.
DailyReport addresses this gap with 150 open-ended tasks that reflect frequently discussed, current information needs of real users. Each task is decomposed into subtasks and evaluated through cascading rubrics along disjoint dimensions. This structure enables more precise attribution of weaknesses to individual processing steps and aspects (research quality, synthesis performance, recency, etc.).
Evaluation of 17 agentic systems reveals that none of the tested implementations meet average user expectations. Through user-centric aggregation and dimensionalization, the benchmark provides interpretable scores per dimension as well as a user preference score that improves comparability.
Dataset and source code are publicly available at https://github.com/AGI-Eval-Official/DailyReport and enable systematic advancement of search agent architectures.
Source: arxiv.org · Published June 10, 2026
Lumi AI News — AI-assisted curation in accordance with Art. 50 EU AI Act. Paraphrase and classification through Lumi News Pipeline v1.7.1.