A new study questions recent findings that LLMs can detect and report their internal states, arguing that what appears to be genuine introspection may actually be pattern-matching based on surface cues. By re-examining two popular evaluation methods, researchers found that models cannot reliably distinguish tampering with internal states from input manipulation, and that outside classifiers match performance of models predicting from hidden states—suggesting models lack true privileged access to their own representations.
Why it matters: As claims about LLM self-awareness and metacognition influence AI safety research and capability assessments, rigorous scrutiny of the evidence underlying these claims is critical for accurately understanding what current models can and cannot do.