ML Tea: RL's Razor: Why On-Policy Reinforcement Learning Forgets Less
Speaker: Idan Shenfeld
Title: RL's Razor: Why On-Policy Reinforcement Learning Forgets Less
Abstract: Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL consistently forgets less. We find that the degree of forgetting is determined by the distributional shift, namely the KL-divergence between the fine-tuned and base policy evaluated on the new task distribution. We discover that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. Our findings are empirically validated with large language models and controlled toy settings. Further, we provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle \textit{RL’s Razor}: among all ways to solve a new task, RL prefers those closest in KL to the original model.