AI & Tech·June 15, 2026·1 sources verified

WorkBench Study Shows Workplace AI Agents Doubled Task Completion While Slashing Harmful Mistakes in Two Years

Summarised by Relevant News AI · Read time: 3 min

A new analysis of the WorkBench workplace agent benchmark finds that the best-performing model improved from GPT-4's 43% task completion rate in March 2024 to Claude Opus 4.8's 89% in June 2026, while unintended harmful actions dropped from 26% to 2.5%. Researchers also found that improved capability and safety correlate positively rather than trade off, and that open-weight models now offer comparable performance to proprietary alternatives at significantly lower costs.

Why it matters: As enterprises increasingly deploy AI agents for mission-critical workplace tasks, understanding both the rate of safety improvements and persistent failure modes—like sending emails to wrong recipients—is essential for building reliable and trustworthy automation systems.

All sources

arXiv cs.AI