AI & Tech·May 27, 2026·1 sources verified

Researchers Release Anchor Framework to Fix AI Agent Evaluation Problems in Enterprise Tasks

Summarised by Relevant News AI · Read time: 3 min

A new task-generation pipeline called Anchor addresses "artifact drift"—inconsistencies that make AI agent benchmarks unsolvable or gaming-prone—by formalizing business workflows into constraint optimization programs with verifiable solutions. The team applied Anchor to create ERP-Bench, a 300-task benchmark for procurement and manufacturing workflows, revealing that frontier AI models achieve only 17.4% fully optimal solutions despite meeting explicit constraints 26.1% of the time.

Why it matters: As AI agents move into mission-critical business operations, reliable, auditable evaluation environments are essential for ensuring these systems perform correctly on real enterprise tasks rather than exploiting benchmark flaws.

All sources

arXiv cs.AI