Researchers testing top-ranked AI-generated CUDA kernel submissions from NVIDIA's SOL-ExecBench found that many broke silently in production workloads, including a fused embedding-gradient kernel that caused training loss divergence. The kernel passed the benchmark's verifier but accumulated gradients in lower-precision bf16 instead of fp32, causing high-frequency token embeddings to drift—a bug masked by AdamW's normalization that could mislead researchers into thinking their models or ideas were flawed.
Why it matters: This reveals a critical gap between AI-generated code benchmarking and real-world reliability, raising concerns about using synthetic code generation for critical ML infrastructure where subtle numerical bugs can masquerade as fundamental research failures.