trl 0.9.3 on Python PyPI

We are excited to introduce the new v0.9.3 release. Many new exciting features and algorithms. The highlights are as follows:

RLOO Trainer: RLOO (Reinforce Leave-one-out) is a new online RL algorithm for RLHF, proposed by Ahmadian et al from Cohere. Check out our docs here to get started
PPOv2 Trainer: We are introducing a new experimental PPOv2 trainer which is more aligned with OpenAI's PPO implementation based on https://arxiv.org/abs/2403.17031. Check out our docs here to get started
Reward model visualization: the reward model training now includes visualization on the eval dataset, as shown below.

Screen.Recording.2024-05-09.at.2.37.44.PM.mov

New losses in the DPO Trainer: DPOTrainer now includes losses / support for Self-play Preference Optimization, Robust DPO, TR-DPO, Iterative Reasoning Preference Optimization, and Pairwise Noise Contrastive Alignment
New losses in the KTO Trainer: KTOTrainer now includes the loss for Binary Classifier Optimization (BCO)

What's Changed

set dev version by @younesbelkada in #1568
fix add_special_tokens issue for data with template by @edixiong in #1509
[DPO] add 'bco_pair' loss_type by @seanexp in #1524
[DPO] DPOConfig class by @kashif in #1554
[SFT] add SFT Trainer Config dataclass by @kashif in #1530
FIX: Fix CI on transformers main by @younesbelkada in #1576
[SFTTrainer] Add warning in SFTTrainer when dataset already processed by @younesbelkada in #1577
Fix typo detoxifying doc by @qgallouedec in #1594
Core: removed unexisting SftArgumentParser by @younesbelkada in #1602
[KTOTrainer] add BCO (reward shift and underlying distribution matching) by @seanexp in #1599
[CLI] Use auto device map for model load by @lewtun in #1596
Removing tests/ from package data by @jamesbraza in #1607
Docs: Fix build main documentation by @younesbelkada in #1604
support loss function for Self-play Preference Optimization by @winglian in #1612
Update HH dataset on helpful only subset by @vwxyzjn in #1613
corrects loss function for Self-play Preference Optimization hard label version by @angelahzyuan in #1615
Fix ZeRO-3 generation context manager by @lewtun in #1617
fixed adding bos and eos token unconditionally by @jasonyux in #1591
visualize rm prediction by @vwxyzjn in #1636
[ORPO] Correct label mask for pad tokens by @IlyaGusev in #1625
Update sft_llama2.py to work with the latest API by @xianbaoqian in #1637
Fixed wrong logs prefixes in KTOTrainer by @bartoszzuk in #1641
Pairwise Noise Contrastive Alignment by @winglian in #1632
don't cast the trainable lora layers to half precision by @pacman100 in #1644
PPO / Reinforce Trainers by @vwxyzjn in #1540
Apply deprecated evaluation_strategy by @muellerzr in #1559
FEAT: Add support for training collator in PPOTrainer by @younesbelkada in #1658
Correct Documentation for cDPO Usage by @AliBakly in #1655
Fix inheritance order in PPOv2Config by @Nicolinho in #1659
[DPO] Add 'robust' loss_type by @Abilityguy in #1653
🤫 TR-DPO implementation by @syrn1k in #1593
Do not upcast adapters when using FSDP+QLoRA by @pacman100 in #1654
[Tests] update eval_strategy API by @kashif in #1662
Fix ppov2 test case by @vwxyzjn in #1661
FIX / PPO: Fix enable_input_require_grads issues with PPO models by @younesbelkada in #1664
fix dataset load error by @sywangyi in #1670
FIX / SFTTrainer: Fix SFTTrainer with args=None by @younesbelkada in #1678
Fix max_completion_length for encoder_decoder models in KTO Trainer by @samuki in #1588
intial RPO loss by @kashif in #1686
Fix overriding optimize_device_cache with optimize_cuda_cache in PPOConfig by @alexisrozhkov in #1690
Skip packing validation by @alex-jw-brooks in #1673
Fix typo in DPOTrainer's warnings by @qgallouedec in #1688
Quick fix on GPT4-eval by @vwxyzjn in #1696
Release 0.9.2 by @vwxyzjn in #1697

New Contributors

@edixiong made their first contribution in #1509
@seanexp made their first contribution in #1524
@jamesbraza made their first contribution in #1607
@winglian made their first contribution in #1612
@angelahzyuan made their first contribution in #1615
@jasonyux made their first contribution in #1591
@IlyaGusev made their first contribution in #1625
@xianbaoqian made their first contribution in #1637
@bartoszzuk made their first contribution in #1641
@muellerzr made their first contribution in #1559
@AliBakly made their first contribution in #1655
@Nicolinho made their first contribution in #1659
@Abilityguy made their first contribution in #1653
@syrn1k made their first contribution in #1593
@alexisrozhkov made their first contribution in #1690
@alex-jw-brooks made their first contribution in #1673

Full Changelog: v0.8.6...v0.9.2

trl 0.9.3 v0.9.3 RLOO / PPOv2 Trainer, RM Visualization on Python PyPI

What's Changed

New Contributors

trl 0.9.3
v0.9.3 RLOO / PPOv2 Trainer, RM Visualization

on Python PyPI