github huggingface/trl v0.5.0

latest releases: v0.8.6, v0.8.5, v0.8.4...
13 months ago

v0.5.0 DPOTrainer and multiple bug fixes on PPOTrainer and SFTTrainer

This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current DataCollatorForCompletionOnlyLM to support chat-like training.

DPO Trainer

The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in this paper and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!

  • DPO Trainer by @kashif in #416
  • [DPO] make sure all the concated batches are on same device by @kashif in #528
  • [DPO] remove response/pairs from the DPO side by @kashif in #540
  • [DPO] remove unnecessary batch size arg to Collator by @kashif in #554
  • [DPO] Resolve logging for DPOTrainer by @tomaarsen in #570

What's Changed

  • Reward trainer multi-gpu eval bug by @rlindskog in #513
  • Use local process index for _get_current_device() by @lewtun in #515

Extending the DataCollatorForCompletionOnlyLM

You can now mask out the users prompts in the DataCollatorForCompletionOnlyLM data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!

  • Introducing DataCollatorForChatCompletionOnlyLM by @gaetanlop in #456

Important bug fixes

Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs

Big refactor of examples and documentation

The examples and documentation has been refactored, check the PRs below for more details

New Contributors

Full Changelog: v0.4.7...v0.5.0

Don't miss a new trl release

NewReleases is sending notifications on new releases.