Skip to content

Workflow-GYM: Benchmark Reveals Limits of AI Agents in Complex GUI Tasks

The bottom line: Current AI agents cannot reliably execute long-term, professional GUI workflows and fail at consistency maintenance, error propagation, and domain-specific understanding.

Researchers have developed Workflow-GYM, a benchmark that measures how well AI agents can perform complex, multi-step GUI-based workflows in professional software — with sobering results: the best available models achieve only just over 30% success rate.

Workflow-GYM addresses a gap in the existing evaluation landscape: while AI agents are increasingly tested on complicated tasks, available benchmarks predominantly focus on general software, simple applications, and short-term tasks. The new benchmark focuses on longer-term, high-quality workflows in specialized professional software — such as financial tools, engineering platforms, or domain-specific applications that must deliver economically relevant results.

Comprehensive tests show that even the most powerful models succeed in only around 30% of tasks in such scenarios. Analysis of failed attempts reveals systematic weaknesses: AI agents frequently skip workflow steps, allow errors to propagate (an error early in the process impairs later steps), lose sight of their original goal, and lack sufficient understanding of specialized software functionality.

For CTOs, this means that current agent architectures are not yet production-ready for time-critical, multi-step tasks in professional environments — such as financial closings, approval processes, or engineering workflows. The study identifies consistency maintenance across long process sequences and deeper understanding of domain-specific software as key research directions for the next generation of GUI agents.


Source: arxiv.org · Published June 8, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.6.5.

Share on: