Researchers have discovered that large language models used to evaluate and benchmark other AI systems can be manipulated into reversing their judgments through post-decision interaction and targeted challenges. The study, which tested LLM judges on MT-Bench and AlpacaEval, found that while initial judgments remain stable under neutral reevaluation, they become substantially reversible under motivated conversation, potentially degrading agreement with human preferences and shifting benchmark rankings. The researchers introduced an Evaluation Robustness Score (ERS) to measure this vulnerability and call for evaluation protocols that assess robustness under challenge, not just static accuracy.
Why it matters: As LLM-as-judge evaluation becomes increasingly central to AI benchmarking pipelines, this finding reveals a critical vulnerability that could undermine the reliability of model comparisons and rankings used across the industry.