v0.5.0 DPOTrainer and multiple bug fixes on PPOTrainer and SFTTrainer
This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current DataCollatorForCompletionOnlyLM
to support chat-like training.
DPO Trainer
The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in this paper and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!
- DPO Trainer by @kashif in #416
- [DPO] make sure all the concated batches are on same device by @kashif in #528
- [DPO] remove response/pairs from the DPO side by @kashif in #540
- [DPO] remove unnecessary batch size arg to Collator by @kashif in #554
- [
DPO
] Resolve logging for DPOTrainer by @tomaarsen in #570
What's Changed
- Reward trainer multi-gpu eval bug by @rlindskog in #513
- Use local process index for
_get_current_device()
by @lewtun in #515
Extending the DataCollatorForCompletionOnlyLM
You can now mask out the users prompts in the DataCollatorForCompletionOnlyLM
data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!
- Introducing DataCollatorForChatCompletionOnlyLM by @gaetanlop in #456
Important bug fixes
Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs
- [
core
] Fix offline case by @younesbelkada in #538 - Relax reward trainer constraint by @younesbelkada in #539
- ADD: num_proc to SFTTrainer by @BramVanroy in #547
- [
SFTTrainer
] Add warning for wrong padding_side by @younesbelkada in #550 - Minor typo and whitespace fixes by @tmm1 in #559
- [
SFTTrainer
] Add epochs and num steps on CLI by @younesbelkada in #562 - Add
DataCollatorForCompletionOnlyLM
in the docs by @younesbelkada in #565 - Add comment to explain how the sentiment pipeline is used to run the … by @jvhoffbauer in #555
- Fix model output dim in reward trainer example by @liutianlin0121 in #566
- Computes the KL penalty using the entire distribution by @edbeeching in #541
- Add missing max_seq_length arg to example sft_trainer.py by @SharkWipf in #585
- [
PPO
] fix corner cases with PPO batch size and forward_batch_size by @younesbelkada in #563 - Update the example sft_trainer.py by @ZeusFSX in #587
- docs: Replace SFTTrainer with RewardTrainer in comment by @tomaarsen in #589
- Fix comparison in DataCollatorForCompletionOnlyLM (#588) by @RyujiTamaki in #594
- refactor grad accum by @vwxyzjn in #546
Big refactor of examples and documentation
The examples and documentation has been refactored, check the PRs below for more details
- [
examples
] Big refactor of examples and documentation by @younesbelkada in #509 - [
examples
] Fix sentiment nit by @younesbelkada in #517 - [
examples
] make the sft script more modulable by @younesbelkada in #543 - Add
use_auth_token
arg to sft_trainer example by @corey-lambda in #544
New Contributors
- @rlindskog made their first contribution in #513
- @corey-lambda made their first contribution in #544
- @tmm1 made their first contribution in #559
- @jvhoffbauer made their first contribution in #555
- @liutianlin0121 made their first contribution in #566
- @SharkWipf made their first contribution in #585
- @ZeusFSX made their first contribution in #587
- @gaetanlop made their first contribution in #456
- @RyujiTamaki made their first contribution in #594
Full Changelog: v0.4.7...v0.5.0