Researchers discovered that instruction-tuned language models show no racial bias in mortgage underwriting decisions but retain and can be manipulated through hidden demographic associations in their internal layers. Using activation steering techniques, scientists demonstrated that suppressed bias information can reverse lending decisions when reinjected at critical model layers, and this latent bias is asymmetric across demographic groups and exploitable through prompt engineering.
Why it matters: The findings reveal that behavioral audits of AI systems used in high-stakes decisions like lending are insufficient if they only measure outputs, as internal biases can be weaponized by bad actors—forcing regulators and practitioners to adopt deeper testing frameworks.