Large models training, Naive Pipeline Parallelism, `peft` Data Parallelism support and distributed training bug fixes

This release includes a set of features and bug fixes to scale up your RLHF experiments for much larger models leveraging peft and bitsandbytes.

Naive Pipeline Parallelism support

Let's support naive Pipeline Parallelism by @younesbelkada in #210

We introduce a new paradigm in trl , termed as Naive Pipeline Parallelism, to fit large scale models on your training setup and apply RLHF on them. This feature uses peft to train adapters and bitsandbytes to reduce the memory foot print of your active model

`peft` Data Parallelism support

[peft] Fix DP issues by @younesbelkada in #221
[core] fix DP issue by @younesbelkada in #222

There were some bugs with respect to peft integration and DP. This release includes the bug fixes to enable multi-GPU training using accelerate + DDP (DIstributed Data Parallel)

Memory optimization

Your training runs can be now much more memory efficient thanks to few tricks / bug fixes:
Now PPOConfig also supports the flag optimize_cuda_cache (set to False by default) to avoid increasing CUDA memory issues

Grad accumulation and memory bugfix by @edbeeching in #220
adds a missing detach to the ratio by @edbeeching in #224

Pytorch 2.0 fixes

This release also includes minor fixes related to PyTorch 2.0 release

[test] attempt to fix CI test for PT 2.0 by @younesbelkada in #225

What's Changed

adds sentiment example for a 20b model by @edbeeching in #208
Update README.md blog post link by @TeamDman in #212
spell mistakes by @k-for-code in #213
spell corrections by @k-for-code in #214
Small changes when integrating into H4 by @natolambert in #216

New Contributors

@TeamDman made their first contribution in #212
@k-for-code made their first contribution in #213

Full Changelog: v0.4.0...v0.4.1

huggingface/trl v0.4.1 on GitHub

Large models training, Naive Pipeline Parallelism, peft Data Parallelism support and distributed training bug fixes