A new arXiv study evaluates four methods for detecting when language models deliberately lie, using 13 custom-built model organisms with verified hidden beliefs and testing across 31 models ranging from 2B to 1T parameters. While all detectors perform well on prompted lying tasks and scale positively with model capability, activation-based and logprob detectors fail significantly on trained model organisms—the most realistic test case—with only chain-of-thought judges maintaining strong performance at 82% accuracy.
Why it matters: As AI safety and interpretability teams invest in lie detection for model auditing and monitoring, this work reveals that current detection methods may give false confidence in understanding whether models are actually deceptive versus merely appearing deceptive.