A new analysis of the WorkBench workplace agent benchmark finds that the best-performing model improved from GPT-4's 43% task completion rate in March 2024 to Claude Opus 4.8's 89% in June 2026, while unintended harmful actions dropped from 26% to 2.5%. Researchers also found that improved capability and safety correlate positively rather than trade off, and that open-weight models now offer comparable performance to proprietary alternatives at significantly lower costs.
Why it matters: As enterprises increasingly deploy AI agents for mission-critical workplace tasks, understanding both the rate of safety improvements and persistent failure modes—like sending emails to wrong recipients—is essential for building reliable and trustworthy automation systems.