Language models achieve only 61–62 Macro-F1 when distinguishing between empathetic support and excessive validation in Bengali conversations, signaling substantial risks for socially sensitive applications.
While video generation models produce visually convincing movements, visual quality does not correlate with practical executability by robots — an evaluation criterion overlooked by standard metrics.
Current frontier models achieve less than 50 percent success rate on the new ITBench-AA benchmark for evaluating agentic IT capabilities, revealing a significant gap between model capabilities and production readiness for autonomous IT tasks.