Researchers introduce JobBench, a benchmark evaluating AI agents across 130 real-world tasks in 35 occupations based on what workers identify as high-priority for delegation rather than pure economic replacement value. Testing 36 models reveals even top performers like Claude Opus reach only 45.9% accuracy, suggesting agents currently fall short of practical workplace deployment.
Why it matters: This research reframes how the industry should develop occupational AI—prioritizing human empowerment and augmentation over replacement narratives, which could influence product strategy and stakeholder adoption.