Large models training, Naive Pipeline Parallelism, peft
Data Parallelism support and distributed training bug fixes
This release includes a set of features and bug fixes to scale up your RLHF experiments for much larger models leveraging peft
and bitsandbytes
.
Naive Pipeline Parallelism support
- Let's support naive Pipeline Parallelism by @younesbelkada in #210
We introduce a new paradigm in trl
, termed as Naive Pipeline Parallelism, to fit large scale models on your training setup and apply RLHF on them. This feature uses peft
to train adapters and bitsandbytes
to reduce the memory foot print of your active model
peft
Data Parallelism support
- [
peft
] Fix DP issues by @younesbelkada in #221 - [
core
] fix DP issue by @younesbelkada in #222
There were some bugs with respect to peft
integration and DP. This release includes the bug fixes to enable multi-GPU training using accelerate
+ DDP (DIstributed Data Parallel)
Memory optimization
Your training runs can be now much more memory efficient thanks to few tricks / bug fixes:
Now PPOConfig
also supports the flag optimize_cuda_cache
(set to False
by default) to avoid increasing CUDA memory issues
- Grad accumulation and memory bugfix by @edbeeching in #220
- adds a missing detach to the ratio by @edbeeching in #224
Pytorch 2.0 fixes
This release also includes minor fixes related to PyTorch 2.0 release
- [
test
] attempt to fix CI test for PT 2.0 by @younesbelkada in #225
What's Changed
- adds sentiment example for a 20b model by @edbeeching in #208
- Update README.md blog post link by @TeamDman in #212
- spell mistakes by @k-for-code in #213
- spell corrections by @k-for-code in #214
- Small changes when integrating into H4 by @natolambert in #216
New Contributors
Full Changelog: v0.4.0...v0.4.1