A new study from arXiv reveals a technique called Controlled Latent-space Evasion that can suppress refusal behavior in safety-aligned language models by manipulating their internal representations. The attack achieves higher success rates than existing jailbreak methods across 15 different models, including multimodal and reasoning variants. The research frames refusal suppression as an evasion attack against linear probes, offering a theoretical framework for understanding how such attacks work.
Why it matters: As language models become more capable, understanding and strengthening defenses against jailbreak attacks is critical for AI safety—and this research demonstrates that current safety mechanisms have exploitable vulnerabilities that researchers and organizations must address.