0 - Introduction

Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO) [1] is a powerful approach for fine-tuning Large Language Models (LLMs). This method uses PPO algorithm, which is reliable and efficient, along with feedback from human evaluators to improve the quality of model-generated responses. However, training LLMs with PPO presents several challenges. These include maintaining a stable training process and achieving better performance than using Direct Preference Optimization (DPO) [2]. Consequently, we have summarized practical training tricks of RLHF with PPO to help researchers fine-tune LLMs more easily, ensuring both training stability and high performance.

1 - Advanced Tricks for Training LLM with PPO

We present three types of PPO training techniques: 1) LLM-specific tricks, 2) PPO-specific tricks, and 3) innovative strategies from recent research. The LLM-specific and PPO-specific tricks have been implemented in various RL frameworks [3, 4] and have shown effectiveness. However, the task-specific applicability of the innovative strategies proposed in recent papers remains unverified.

1.1 - LLM-specific Tricks

Token Level KL-Penalty: The KL-Divergence between the response distributions of the RL model and the SFT model is calculated for each token. This divergence is then incorporated as a penalty term in the reward function during training [5]. Specifically, the per-token reward is represented as follows:

r(s_t, a_t) = \textbf{I}(s_t =[\text{EOS}])r(x,y)-\beta \text{KL}(t) \ \ \ (1)

\text{KL}(t) = \log({\pi_{\theta_{\text{old}}}(a_t|s_t)^{\text{RL}}}/{\pi^{\text{SFT}}(a_t|s_t)}）\ \ \ (2)

where $x$ is the prompt, $y$ is the response, and $\textbf{I}(s_t = [\text{EOS}])$ is the identity function that represents whether $t$ is the last token.

Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/f8bfc76f1fc6fcf43241104dbee144a3be51ee93/openrlhf/models/utils.py#L56.

Generalized Advantage Estimation (GAE): GAE [5], a $\text{TD}(\lambda)$ return estimation method, is used to estimate token-wise rewards in PPO. In practice, we typically set $\lambda = 1$ , transforming the GAE method into a Monte Carlo estimation method.Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/f8bfc76f1fc6fcf43241104dbee144a3be51ee93/openrlhf/trainer/ppo_utils/experience_maker.py#L213.
Adding SFT Loss: Incorporating an additional supervised next-token prediction loss, alongside the KL divergence, into PPO can preserve the pre-existing abilities of the SFT model [5].Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/9923906e758627a5ae7d76c7fcea25af9415bfd6/openrlhf/trainer/ppo_trainer.py#L296.

1.2 - PPO-specific Tricks

Model Initialization: When training LLMs with PPO, it is essential to initialize two models: the actor model and the critic model [6, 7]. Specifically, initializing the actor model with a Supervised Fine-Tuning (SFT) model and the critic model with a reward model ensures efficient PPO training.Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/188139f809d9d14a8b1d8210f9e6746e2422e4e0/examples/train_ppo.py#L39.
Adam Learning Rate: The Adam learning rate [6] for the actor model is approximately one-tenth of that used for the SFT model. For instance, in OpenRLHF, the Adam learning rate for the SFT model is $5 e^{-6}$ , while for the actor model, it is $5 e^{-7}$ . Additionally, the Adam learning rate for the critic model is approximately twice that of the SFT model, with an example rate of $9 e^{-6}$ .Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/9d8b3fdac345f6a18b37d73c53bfb95a652d1db2/examples/scripts/train_ppo_llama.sh#L20.
Mini-batch Updates: During the learning phase, the PPO implementation shuffles the indices of the training data, which is of size $N \times M$ (where $N$ is the size of the replay buffer and $M$ is the length of each response), and breaks it into mini-batches to compute the gradient and update the policy [6].
Code Link: https://github.com/OpenLLMAI/OpenRLHF/blob/9d8b3fdac345f6a18b37d73c53bfb95a652d1db2/openrlhf/trainer/ppo_trainer.py#L216.
Value Function Loss Clipping: PPO clips the value function like the PPO’s clipped surrogate objective [6, 7]. Given $V_{targ} = returns = advantages + values$ , PPO fits the value network by minimizing the following loss:

Advanced Tricks for Training Large Language Models with Proximal Policy Optimization

0 - Introduction

1 - Advanced Tricks for Training LLM with PPO

1.1 - LLM-specific Tricks

1.2 - PPO-specific Tricks

评论 (0)

专题展示

2025 年 1 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Advanced Tricks for Training Large Language Models with Proximal Policy Optimization

0 - Introduction

1 - Advanced Tricks for Training LLM with PPO

1.1 - LLM-specific Tricks

1.2 - PPO-specific Tricks

评论 (0)

猜你喜欢

专题展示