Arc Sentry, a neural monitoring system, successfully detected the Crescendo multi-turn jailbreak attack from USENIX Security 2025 by analyzing model internal states rather than text content. While traditional text classifiers like LLM Guard failed entirely (0/8 detections), Arc Sentry flagged the attack by Turn 3 by monitoring shifts in the model's residual stream, achieving a 7x score increase on innocuous-appearing prompts.
Why it matters: As jailbreak techniques grow more sophisticated, understanding that internal model state monitoring can catch attacks invisible to text classifiers represents a significant advance in AI safety and has direct implications for how organizations should design their LLM security stacks.