Researchers introduced OmniToM, a benchmark that tests whether large language models can actually construct mental-state representations rather than simply answering questions about social scenarios. Built on 895 stories with over 22,000 labeled belief propositions, the benchmark reveals that current LLMs struggle to track how different actors' knowledge and beliefs diverge, particularly when modeling false or evolving beliefs across a narrative.
Why it matters: As LLMs are deployed in social reasoning tasks—from customer service to content moderation—understanding their fundamental limitations in modeling other minds is critical for assessing their reliability in human-centric applications.