pypi trl 0.9.3
v0.9.3 RLOO / PPOv2 Trainer, RM Visualization

latest releases: 0.12.1, 0.12.0, 0.11.4...
5 months ago

We are excited to introduce the new v0.9.3 release. Many new exciting features and algorithms. The highlights are as follows:

  1. RLOO Trainer: RLOO (Reinforce Leave-one-out) is a new online RL algorithm for RLHF, proposed by Ahmadian et al from Cohere. Check out our docs here to get started
  2. PPOv2 Trainer: We are introducing a new experimental PPOv2 trainer which is more aligned with OpenAI's PPO implementation based on https://arxiv.org/abs/2403.17031. Check out our docs here to get started
  3. Reward model visualization: the reward model training now includes visualization on the eval dataset, as shown below.
Screen.Recording.2024-05-09.at.2.37.44.PM.mov
  1. New losses in the DPO Trainer: DPOTrainer now includes losses / support for Self-play Preference Optimization, Robust DPO, TR-DPO, Iterative Reasoning Preference Optimization, and Pairwise Noise Contrastive Alignment
  2. New losses in the KTO Trainer: KTOTrainer now includes the loss for Binary Classifier Optimization (BCO)

What's Changed

New Contributors

Full Changelog: v0.8.6...v0.9.2

Don't miss a new trl release

NewReleases is sending notifications on new releases.