The bottom line: Current frontier models achieve less than 50 percent success rate on the new ITBench-AA benchmark for evaluating agentic IT capabilities, revealing a significant gap between model capabilities and production readiness for autonomous IT tasks.
IBM and Artificial Analysis have developed ITBench-AA, a benchmark that for the first time evaluates agentic AI models on their ability to autonomously solve complex IT tasks in enterprise environments. Current frontier models perform significantly worse than hoped, with scores below 50 percent.
ITBench-AA is an evaluation framework specifically designed to measure agentic capabilities in enterprise IT scenarios. The benchmark comprises realistic tasks from systems administration, network management, security configuration, and similar domains, where models must independently decide, plan, and execute actions — not just generate code, but also validate it and iteratively improve it.
The evaluation shows that even powerful frontier models such as Claude, GPT-4, and Gemini fall below 50 percent success rate in these practical IT scenarios. This suggests that code generation capability alone is insufficient for robust agent-based solutions. Critical weaknesses include faulty error handling, lack of context management across multiple steps, and insufficient validation of system state changes.
For CTOs, this is an important signal: while AI-driven IT automation and self-service portals are promising, production deployments of agentic systems should not yet rely entirely on autonomous decision-making. Instead, a hybrid approach is recommended in which models structure and propose tasks, but humans grant approvals and monitor critical systems. The benchmark results provide a foundation for measuring and comparing model progress in this domain going forward.
Source: huggingface.co · Published May 27, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.2.6.