A new training method called Procedural Memory Distillation (PMD) enables language models to retain and reuse procedural knowledge across multiple training episodes, converting cross-episode signals into reusable memory that gets absorbed into the model's weights. Testing on Qwen3-8B and OLMo3-Instruct-7B showed PMD improved performance by 3.8–5.5% on science knowledge evaluation and 7.9–13.6% on coding benchmarks compared to existing self-distillation approaches.
Why it matters: As AI labs race to improve model reasoning and code generation through reinforcement learning, discovering training methods that capture and reuse procedural patterns could meaningfully accelerate capability gains without requiring larger model sizes.