Researchers have released CrowdMath, a dataset of 164 expert-annotated mathematical discussions from MIT's collaborative research program, to benchmark how well large language models understand open-ended problem-solving. While frontier models achieve 83-88% accuracy on predicting the next post in mathematical discussions, they struggle significantly at identifying the functional role of individual contributions—the best model reaching only 0.42 macro-F1 on classifying whether a post represents progress, error correction, or proof completion.
Why it matters: This work exposes a critical limitation in how AI systems understand collaborative scientific reasoning, suggesting current models optimize for closed-problem solving but lack the nuanced understanding needed for real-world research workflows.