v0.5.0 DPOTrainer and multiple bug fixes on PPOTrainer and SFTTrainer

This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current DataCollatorForCompletionOnlyLM to support chat-like training.

DPO Trainer

The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in this paper and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!

DPO Trainer by @kashif in #416
[DPO] make sure all the concated batches are on same device by @kashif in #528
[DPO] remove response/pairs from the DPO side by @kashif in #540
[DPO] remove unnecessary batch size arg to Collator by @kashif in #554
[DPO] Resolve logging for DPOTrainer by @tomaarsen in #570

What's Changed

Reward trainer multi-gpu eval bug by @rlindskog in #513
Use local process index for _get_current_device() by @lewtun in #515

Extending the `DataCollatorForCompletionOnlyLM`

You can now mask out the users prompts in the DataCollatorForCompletionOnlyLM data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!

Introducing DataCollatorForChatCompletionOnlyLM by @gaetanlop in #456

Important bug fixes

Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs

[core] Fix offline case by @younesbelkada in #538
Relax reward trainer constraint by @younesbelkada in #539
ADD: num_proc to SFTTrainer by @BramVanroy in #547
[SFTTrainer] Add warning for wrong padding_side by @younesbelkada in #550
Minor typo and whitespace fixes by @tmm1 in #559
[SFTTrainer] Add epochs and num steps on CLI by @younesbelkada in #562
Add DataCollatorForCompletionOnlyLM in the docs by @younesbelkada in #565
Add comment to explain how the sentiment pipeline is used to run the … by @jvhoffbauer in #555
Fix model output dim in reward trainer example by @liutianlin0121 in #566
Computes the KL penalty using the entire distribution by @edbeeching in #541
Add missing max_seq_length arg to example sft_trainer.py by @SharkWipf in #585
[PPO] fix corner cases with PPO batch size and forward_batch_size by @younesbelkada in #563
Update the example sft_trainer.py by @ZeusFSX in #587
docs: Replace SFTTrainer with RewardTrainer in comment by @tomaarsen in #589
Fix comparison in DataCollatorForCompletionOnlyLM (#588) by @RyujiTamaki in #594
refactor grad accum by @vwxyzjn in #546

Big refactor of examples and documentation

The examples and documentation has been refactored, check the PRs below for more details

[examples] Big refactor of examples and documentation by @younesbelkada in #509
[examples] Fix sentiment nit by @younesbelkada in #517
[examples] make the sft script more modulable by @younesbelkada in #543
Add use_auth_token arg to sft_trainer example by @corey-lambda in #544

New Contributors

@rlindskog made their first contribution in #513
@corey-lambda made their first contribution in #544
@tmm1 made their first contribution in #559
@jvhoffbauer made their first contribution in #555
@liutianlin0121 made their first contribution in #566
@SharkWipf made their first contribution in #585
@ZeusFSX made their first contribution in #587
@gaetanlop made their first contribution in #456
@RyujiTamaki made their first contribution in #594

Full Changelog: v0.4.7...v0.5.0

trl 0.5.0 v0.5.0 on Python PyPI