Researchers introduce Strategy-Guided Policy Optimization (SGPO), a new method that teaches weaker language models problem-solving strategies rather than memorized solution steps from stronger models. The approach extracts reusable strategic descriptions and uses adaptive weighting to guide learning, outperforming standard fine-tuning and reinforcement learning baselines by 2.2 points on mathematical benchmarks.
Why it matters: As organizations increasingly look to optimize smaller language models for cost and latency, more effective distillation techniques that preserve reasoning generalization rather than encouraging memorization directly impact the viability of deploying capable smaller models in production.