Researchers introduced AttuneBench, a benchmark that evaluates how well language models recognize and respond to emotions in genuine multi-turn conversations, drawing from 200 real human-model interactions with turn-by-turn emotional annotations. Testing 11 models revealed that emotional intelligence isn't monolithic—models that excel at emotion recognition may struggle with response preference prediction, suggesting emotionally intelligent behavior requires distinct, separable capabilities.
Why it matters: As LLMs increasingly serve in conversational roles, understanding their emotional intelligence gaps matters for deployment in customer service, mental health support, and other emotionally-sensitive applications where misaligned responses could cause real harm.