Researchers introduce Curation-Bench, a benchmark testing whether AI agents can autonomously optimize training data selection—a labor-intensive bottleneck in AI development. Out-of-the-box agents match published baselines in ten iterations, but only reach full potential when scaffolded to methodically adapt prior research rather than explore freely, ultimately creating policies that outperform baselines on 10× less data.
Why it matters: Data curation is a critical, manual-heavy stage of AI development; automating it reliably could accelerate research velocity and make model development more accessible, but the findings show that effective automation requires structured guidance rather than pure agent autonomy.