A new benchmark called AgingBench reveals that deployed AI agents lose reliability over time even with frozen model weights, due to memory compression, interference, fact revision, and maintenance issues. Testing across 14 models and 400+ runs shows degradation is multi-faceted—behavioral tests can pass while factual accuracy decays—requiring targeted diagnosis and repair strategies specific to where failures originate in the memory pipeline.
Why it matters: As companies deploy persistent AI agents for production systems, understanding and preventing reliability decay over time is critical for operational safety and cost—current day-one benchmarks fail to catch these lifespan failures.