Research writer Nathan Witkin has published a detailed critique of the widely-cited METR Long Tasks benchmark, identifying numerous flaws including guessed baseline data, perverse incentives for human benchmarkers, biased sample selection, and test-data contamination. The errors are severe enough that Witkin argues the entire graph should be disregarded rather than patched, raising questions about scientific rigor in AI capability assessment.
Why it matters: The METR graph has been heavily used to inform industry expectations about AI timeline and capabilities; compromised benchmarks risk distorting investment decisions, regulation, and public understanding of AI progress.