trl 0.11.0 on Python PyPI

We are excited to introduce the new v0.11.0 release, with many new features and post-training algorithms. The highlights are as follows:

New post-training methods

Generalized Knowledge Distillation

Generalized Knowledge Distillation (GKD) is a post-training method from Google DeepMind that extends standard knowledge distillation by allowing the student to generate outputs during training and receive online feedback from the teacher. It consistently outperforms SFT and in some cases enables the student model to match the performance of the teacher, but with far fewer parameters.

To train models with this method, check out the GKDTrainer.

Exploratory Preference Optimization

Exploratory Preference Optimization is an online post-training method from researchers at Microsoft, MIT, and Wisconsin that extends DPO to incorporate online feedback from reward models or LLM judges. It is similar to online DPO, but has a slightly different theoretical basis concerning sample efficiency.

To train models with this method, check out the XPOTrainer.

Nash Learning with Human Feedback

Nash Learning with Human Feedback is a novel post-training method from Google DeepMind that uses pairwise preference models which are conditioned on two inputs, instead of the single one used in reward models. These preference models are then used to train a policy that consistently produces responses that are preferred over those from competing policies, thus approximating a Nash equilibrium (i.e. a two player game where actions are responses and payoffs are given by the preference model).

To train models with this method, check out the NashMDTrainer.

New trainer features

Online DPO now supports training LoRA adapters with PEFT, which means you can dramatically reduce the amount of VRAM needed to train models with this method. By @qgallouedec in #2041
The OrpoTrainer has better integration with PyTorchXLA for faster step time on TPUs ⚡ . By @wenxindongwork in #2001

Deprecations 🚨

The PPOTrainer is marked for deprecated in favour of PPOv2Trainer to provide a consistent API across TRL's trainers. It will be removed in v0.12.0. By @qgallouedec in #2016
The RichProgressCallback has been removed from the example scripts as it caused a variety of problems with logging in distributed environments. You can still use it by adding it manually to the trainer callbacks. By @lewtun in #2053

Bugfixes and improvements

Adds experimental Liger support to SFT script by @edbeeching in #1992
move slow-tests CI to new cluster by @glegendre01 in #1996
[Online-DPO] fixes to the training scripts and setup.py by @kashif in #1997
[pre-commit] update pre-commit yaml by @kashif in #2002
[Docs] Add Liger-Kernel usage to SFTTrainer page by @ryankert01 in #2007
[ci] pin numpy to < 2 on windows by @kashif in #2009
Remove prompts arg from WinrateCallback by @qgallouedec in #2010
Allow WinRateCallback to be used without reference model by @qgallouedec in #2013
Feat: Add support for APO-zero in KTOTrainer by @KarelDO in #1952
Clean configs documentation by @qgallouedec in #1944
Refactor reward modelling script to work with chat models by @lewtun in #2026
correct formatting of star sign in kto_trainer.mdx by @mattany in #2031
Remove unused functions in core.py by @northern-64bit in #2017
Improves formatting of docstring + newlines by @northern-64bit in #2006
Fix packing doc in SFTConfig and fix error when neither dataset_text_field nor formatting_func is provided. by @qgallouedec in #2035
fix: unpackaging error in Custom Mixture of Experts model when aux_loss_enabled is set to True. by @Jonathanjordan21 in #2039
Drop canonical namespaces by @qgallouedec in #2048
Change non_eos_penalty to be consistent across OnPolicy trainers by @RylanSchaeffer in #2033
Temporary pin the transformers hash in the CI by @qgallouedec in #2049
[XPO] xpo trainer by @kashif in #1943
Fix logits compuation in KTO trainer prediction step by @issamemari in #2050
[Draft, don't merge] Fix failing windows by @LysandreJik in #2051
Clean up DPO example by @lewtun in #2043
Remove debug and sanity_check args by @qgallouedec in #2055
Gkd trainer by @kashif in #1814
Documentation dataset format by @qgallouedec in #2020
Add missing autodocs by @qgallouedec in #2056
Mask loss in gkd when generating from the student by @gaetanlop in #2058
Support for SFTTrainer.evaluate() and SFTTrainer.predict() with null train_dataset by @Sohaib9920 in #2004
make cuda-only tests device-agnostic by @faaany in #2044
Make ConstantLengthDataset (or packing=True) shuffle examples before they are packed by @muupan in #2037
Standardise API for WinRateCallback and LogCompletionsCallback by @lewtun in #2061
Fix dataset in GKD script by @lewtun in #2067
[online models] remove min_new_tokens=args.max_new_tokens by @kashif in #2069
Standardising datasets for testing by @qgallouedec in #2065
[KTO] learning rate recomentations for kto by @kashif in #2070
Nash md by @kashif in #1853
Use transformers utilities when possible by @qgallouedec in #2064
Minor doc fixes and comments by @qgallouedec in #2073
Added error check to RLOO, PPOv2, OnlineDPO that ref_policy and policy have different identities by @RylanSchaeffer in #2057
processor(prompt, images=image) to processor(images=image, text=prompt) by @qgallouedec in #2076
Use wrapped model for reference completions in WinRateCallback and set default freq to eval_steps in LogCompletionsCallback` by @lewtun in #2074
Conversational dataset support for Online DPO by @qgallouedec in #2075
[WIP] Fix logits/chosen and logits/rejected metrics in kto_trainer. by @PhilipMay in #2077
Standardize dataset naming by @qgallouedec in #2081
Fix deepspeed for PPOv2Trainer by @qgallouedec in #2080

New Contributors

@AdnaneKhan made their first contribution in #1822
@mkopecki made their first contribution in #1825
@DZ9 made their first contribution in #1836
@MAOJIASONG made their first contribution in #1840
@davanstrien made their first contribution in #1845
@eliebak made their first contribution in #1863
@Rishav-hub made their first contribution in #1862
@cemiu made their first contribution in #1738
@SunMarc made their first contribution in #1919
@karel-contextual made their first contribution in #1928
@RylanSchaeffer made their first contribution in #1932
@mina-parham made their first contribution in #1961
@RhuiDih made their first contribution in #1887
@SeungyounShin made their first contribution in #1969
@kit1980 made their first contribution in #1933
@akakakakakaa made their first contribution in #1987
@hvaara made their first contribution in #1990
@glegendre01 made their first contribution in #1996
@ryankert01 made their first contribution in #2007
@KarelDO made their first contribution in #1952
@mattany made their first contribution in #2031
@northern-64bit made their first contribution in #2017
@Jonathanjordan21 made their first contribution in #2039
@issamemari made their first contribution in #2050
@wenxindongwork made their first contribution in #2001
@Sohaib9920 made their first contribution in #2004
@faaany made their first contribution in #2044
@muupan made their first contribution in #2037
@PhilipMay made their first contribution in #2077

Full Changelog: v0.9.6...v0.11.0

trl 0.11.0 v0.11.0 on Python PyPI