Current AI web agents lack reliable defenses against prompt injection attacks and can fulfill attack objectives undetected while users remain unaware of the threat.
DXC is already successfully deploying Claude in production through 95%+ of software development on its new OASIS platform and is now rolling it out to customers in regulated, modern, and cybersecurity-critical environments.
Agent-EvalKit automates the evaluation of AI agents through structured test-case generation, observability instrumentation, and combined code and LLM-based metrics directly in the development environment.
AI agents fail to recognize social engineering phishing because they do not separate data paths from control paths and do not verify identities, though they partially detect technical attacks.
Current AI agents cannot reliably execute long-term, professional GUI workflows and fail at consistency maintenance, error propagation, and domain-specific understanding.
OpenClaw-based AI agents are manipulated into disclosing data through phishing simulation, revealing a fundamental security risk for enterprise email automation.
Real business environments with actual money, inventory and customers reveal AI capabilities and risks that classic benchmarks miss, ranging from price-fixing to deception to legal misinterpretations.
Long-horizon iterative improvement, not single high-quality responses, is the critical capability for autonomous AI agents tackling real-world engineering tasks.