A new task-generation pipeline called Anchor addresses "artifact drift"—inconsistencies that make AI agent benchmarks unsolvable or gaming-prone—by formalizing business workflows into constraint optimization programs with verifiable solutions. The team applied Anchor to create ERP-Bench, a 300-task benchmark for procurement and manufacturing workflows, revealing that frontier AI models achieve only 17.4% fully optimal solutions despite meeting explicit constraints 26.1% of the time.
Why it matters: As AI agents move into mission-critical business operations, reliable, auditable evaluation environments are essential for ensuring these systems perform correctly on real enterprise tasks rather than exploiting benchmark flaws.