0 - Introduction
Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO) [1] is a powerful approach for fine-tuning Large Language Models (LLMs). This method uses PPO algorithm, which is reliable and efficient, along with feedback from human evaluators to improve the quality of model-generated responses. However, training LLMs with PPO presents several challenges. These include maintaining a stable training process and achieving better performance than using Direct Preference Optimization (DPO) [2]. Consequently, we have summarized practical training tricks of RLHF with PPO to help researchers fine-tune LLMs more easily, ensuring both training stability and high performance.
1 - Advanced Tricks for Training LLM with PPO
We present three types of PPO training techniques: 1) LLM-specific tricks, 2) PPO-specific tricks, and 3) innovative strategies from recent research. The LLM-specific and PPO-specific tricks have been implemented in various RL frameworks [3, 4] and have shown effectiveness. However, the task-specific applicability of the innovative strategies proposed in recent papers remains unverified.
1.1 - LLM-specific Tricks
- Token Level KL-Penalty: The KL-Divergence between the response distributions of the RL model and the SFT model is calculated for each token. This divergence is then incorporated as a penalty term in the reward function during training [5]. Specifically, the per-token reward is represented as follows:
where x is the prompt, y is the response, and \textbf{I}(s_t = [\text{EOS}]) is the identity function that represents whether $t$ is the last token.
- Generalized Advantage Estimation (GAE): GAE [5], a \text{TD}(\lambda) return estimation method, is used to estimate token-wise rewards in PPO. In practice, we typically set \lambda = 1, transforming the GAE method into a Monte Carlo estimation method.Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/f8bfc76f1fc6fcf43241104dbee144a3be51ee93/openrlhf/trainer/ppo_utils/experience_maker.py#L213.
- Adding SFT Loss: Incorporating an additional supervised next-token prediction loss, alongside the KL divergence, into PPO can preserve the pre-existing abilities of the SFT model [5].Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/9923906e758627a5ae7d76c7fcea25af9415bfd6/openrlhf/trainer/ppo_trainer.py#L296.
1.2 - PPO-specific Tricks
- Model Initialization: When training LLMs with PPO, it is essential to initialize two models: the actor model and the critic model [6, 7]. Specifically, initializing the actor model with a Supervised Fine-Tuning (SFT) model and the critic model with a reward model ensures efficient PPO training.Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/188139f809d9d14a8b1d8210f9e6746e2422e4e0/examples/train_ppo.py#L39.
- Adam Learning Rate: The Adam learning rate [6] for the actor model is approximately one-tenth of that used for the SFT model. For instance, in OpenRLHF, the Adam learning rate for the SFT model is 5 e^{-6}, while for the actor model, it is 5 e^{-7}. Additionally, the Adam learning rate for the critic model is approximately twice that of the SFT model, with an example rate of 9 e^{-6}.Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/9d8b3fdac345f6a18b37d73c53bfb95a652d1db2/examples/scripts/train_ppo_llama.sh#L20.
- Mini-batch Updates: During the learning phase, the PPO implementation shuffles the indices of the training data, which is of size N \times M (where N is the size of the replay buffer and M is the length of each response), and breaks it into mini-batches to compute the gradient and update the policy [6].
Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/9d8b3fdac345f6a18b37d73c53bfb95a652d1db2/openrlhf/trainer/ppo_trainer.py#L216. - Value Function Loss Clipping: PPO clips the value function like the PPO’s clipped surrogate objective [6, 7]. Given V_{targ} = returns = advantages + values, PPO fits the value network by minimizing the following loss:
评论 (0)