A new study reveals that consensus-based LLM jury systems suffer from unbounded bias when individual judges fail in typical ways like mode collapse or safety refusal. Researchers propose RoPoLL, which replaces standard averaging with a geometric median aggregation function, achieving 19% improvement over baseline jury methods and enabling a smaller 3-judge committee (38B parameters) to outperform a single 675B-parameter model on biased evaluations.
Why it matters: As multi-LLM evaluation becomes standard for assessing AI systems, understanding and fixing algorithmic bias in jury consensus mechanisms directly impacts the reliability of AI benchmarking and model selection decisions across the industry.