0 - Introduction

Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO) [1] is a powerful approach for fine-tuning Large Language Models (LLMs). This method uses PPO algorithm, which is reliable and efficient, along with feedback from human evaluators to improve the quality of model-generated responses. However, training LLMs with PPO presents several challenges. These include maintaining a stable training process and achieving better performance than using Direct Preference Optimization (DPO) [2]. Consequently, we have summarized practical training tricks of RLHF with PPO to help researchers fine-tune LLMs more easily, ensuring both training stability and high performance.

1 - Advanced Tricks for Training LLM with PPO

We present three types of PPO training techniques: 1) LLM-specific tricks, 2) PPO-specific tricks, and 3) innovative strategies from recent research. The LLM-specific and PPO-specific tricks have been implemented in various RL frameworks [3, 4] and have shown effectiveness. However, the task-specific applicability of the innovative strategies proposed in recent papers remains unverified.

1.1 - LLM-specific Tricks

  • Token Level KL-Penalty: The KL-Divergence between the response distributions of the RL model and the SFT model is calculated for each token. This divergence is then incorporated as a penalty term in the reward function during training [5]. Specifically, the per-token reward is represented as follows:

r(s_t, a_t) = \textbf{I}(s_t =[\text{EOS}])r(x,y)-\beta \text{KL}(t) \ \ \ (1) \text{KL}(t) = \log({\pi_{\theta_{\text{old}}}(a_t|s_t)^{\text{RL}}}/{\pi^{\text{SFT}}(a_t|s_t)})\ \ \ (2)

where x is the prompt, y is the response, and \textbf{I}(s_t = [\text{EOS}]) is the identity function that represents whether $t$ is the last token.

Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/f8bfc76f1fc6fcf43241104dbee144a3be51ee93/openrlhf/models/utils.py#L56.

1.2 - PPO-specific Tricks