Researchers mechanistically analyzed three popular vision-language models (LLaVA-1.5, PaliGemma, Qwen2-VL) and found that sharp attention maps—long assumed to signal model confidence—are nearly useless predictors of correctness, with near-zero correlation. Instead, model reliability is encoded in hidden-state geometry and sparse late-layer circuits, with hidden-state probes achieving >0.95 AUROC and self-consistency emerging as the strongest behavioral predictor. The study also reveals critical architectural differences: late-fusion models like LLaVA concentrate reliability in a fragile bottleneck, while early-fusion models distribute it robustly.
Why it matters: For AI teams building monitoring systems and safety evaluations for vision-language models, this research overturns a widespread assumption and provides actionable mechanistic insights into where actual model reliability lives—shifting focus from easily visualizable attention patterns to harder-to-inspect internal representations.