A developer created LLM Win, a tool that visualizes large language model benchmark results as directed graphs to reveal transitive relationships between models. Analysis of 126,937 model pairs shows that weaker models can reach stronger ones through benchmark chains 94.2% of the time via short 2-3 hop paths, suggesting AI capability rankings are better represented as multi-dimensional capability graphs than simple linear ladders.
Why it matters: Understanding that LLM performance varies across benchmarks rather than following a single hierarchy has significant implications for model evaluation methodology, benchmark design, and how organizations should select AI systems for specific tasks.