Rethinking the Role of PPO in RLHF

**Rethinking the Role of PPO in RLHF**

### Introduction
Large Language Models (LLMs) like GPT-4, Claude-2, Bard, and Bing Chat have revolutionized virtual assistants. These systems can efficiently manage complex queries, generate code, and create poetry. The underpinning technology facilitating these features is Reinforcement Learning with Human Feedback (RLHF). RLHF strives to align models with human values, mitigating unintended behaviors often originating from pretraining on voluminous, low-quality data. Central to this process is Proximal Policy Optimization (PPO), a widely-used RL optimizer. However, PPO exhibits instability and implementation challenges, which are magnified due to discrepancies in the RLHF process. This blog explores how Pairwise Proximal Policy Optimization (P3O) could address these issues by introducing a comparative learning approach.

### Background
Traditionally, in RL workflows, reward functions are manually defined, such as in Atari games, or derived from well-defined sources. In RLHF, the reward model is trained via human feedback comparisons, aiming to nudge models towards helpful, harmless responses. The RLHF pipeline comprises several stages:

1. **Supervised Fine-Tuning Stage:** The pre-trained model learns to mimic high-quality data responses using maximum likelihood loss.
2. **Reward Modeling Stage:** The model generates response pairs for prompts, and human labelers express preferences. These preferences train a comparative reward model.
3. **RL Fine-Tuning Stage:** This stage initializes the model from the previous stages and applies an RL algorithm to optimize the reward, while maintaining proximity to the initial policy.

A significant inconsistency arises because the reward model is trained using comparative feedback, whereas RL optimization relies on individual responses. This can exacerbate discrepancies, particularly in language generation tasks.

### Derivation of P3O
The Pairwise Proximal Policy Optimization (P3O) method stems from the vanilla policy gradient (VPG) method. Unlike VPG, which depends on the absolute reward magnitude, P3O leverages reward differences, ensuring robustness to reward translation issues. Key enhancements in P3O include:

– **Importance Sampling:** This incorporates past responses to compute weighted gradients, updating policies from batches of stored responses.
– **Clipping:** This controls the gradient updates and the importance sampling ratio, balancing KL divergence with reward improvement.

P3O is implemented in two variants, distinguished by separate or joint clipping methods. Evaluation suggests that P3O can harmonize the learning stages in RLHF, facilitating more stable and effective policy optimization.

### Evaluation
P3O was compared against traditional PPO and the newly introduced DPO across summarization and question-answering tasks. Evaluation metrics focused on reward and KL-divergence, with superior results indicating that P3O offers a better KL-Reward trade-off. Moreover, head-to-head comparisons, including GPT-4 evaluations, suggest that P3O aligns closer with human preferences compared to PPO and DPO.

### Conclusion
This blog post has examined how a comparative learning approach via P3O can address challenges in RLHF, demonstrating improvements in policy optimization and alignment with human feedback. P3O unifies reward modeling and fine-tuning stages through comparative training methods, delivering superior performance metrics and human alignment.

### Call to Action
Start your 14 days trial with us and get access to our learning community. We build custom AI and automation solutions for businesses. Get in touch today to get your custom-built AI and automation systems.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top