Researchers introduced SciConBench, a 9,110-question benchmark with expert-validated conclusions from systematic reviews, to evaluate how well AI agents synthesize scientific information for high-stakes domains like health. Testing 8 frontier models and consumer-facing AI agents (including Google AI Overview), they found that factual quality remains critically low, with the best performing agent achieving only a 0.337 F1 score on factual accuracy, and that data leakage in unconstrained evaluation artificially inflates performance estimates.
Why it matters: As AI agents increasingly inform consequential health and scientific decisions, this research exposes a critical gap between perceived and actual capability—essential knowledge for enterprises and regulators evaluating AI reliability in high-stakes applications.