Overview
Here are the main improvements of this release:
- Gemini: Heterogeneous memory space manager
- Refactor the API of pipeline parallelism
What's Changed
Features
- [zero] initialize a stateful tensor manager by @feifeibear in #614
- [pipeline] refactor pipeline by @YuliangLiu0306 in #679
- [zero] stateful tensor manager by @ver217 in #687
- [zero] adapt zero hooks for unsharded module by @1SAA in #699
- [zero] refactor memstats collector by @ver217 in #706
- [zero] improve adaptability for not-shard parameters by @1SAA in #708
- [zero] check whether gradients have inf and nan in gpu by @1SAA in #712
- [refactor] refactor the memory utils by @feifeibear in #715
- [util] support detection of number of processes on current node by @FrankLeeeee in #723
- [utils] add synchronized cuda memory monitor by @1SAA in #740
- [zero] refactor ShardedParamV2 by @1SAA in #742
- [zero] add tensor placement policies by @ver217 in #743
- [zero] use factory pattern for tensor_placement_policy by @feifeibear in #752
- [zero] refactor memstats_collector by @1SAA in #746
- [gemini] init genimi individual directory by @feifeibear in #754
- refactor shard and gather operation by @1SAA in #773
Bug Fix
- [zero] fix init bugs in zero context by @1SAA in #686
- [hotfix] update requirements-test by @ver217 in #701
- [hotfix] fix a bug in 3d vocab parallel embedding by @kurisusnowdeng in #707
- [compatibility] fixed tensor parallel compatibility with torch 1.9 by @FrankLeeeee in #700
- [hotfix]fixed bugs of assigning grad states to non leaf nodes by @Gy-Lu in #711
- [hotfix] fix stateful tensor manager's cuda model data size by @ver217 in #710
- [bug] fixed broken test_found_inf by @FrankLeeeee in #725
- [util] fixed activation checkpointing on torch 1.9 by @FrankLeeeee in #719
- [util] fixed communication API with PyTorch 1.9 by @FrankLeeeee in #721
- [bug] removed zero installation requirements by @FrankLeeeee in #731
- [hotfix] remove duplicated param register to stateful tensor manager by @feifeibear in #728
- [utils] correct cpu memory used and capacity in the context of multi-process by @feifeibear in #726
- [bug] fixed grad scaler compatibility with torch 1.8 by @FrankLeeeee in #735
- [bug] fixed DDP compatibility with torch 1.8 by @FrankLeeeee in #739
- [hotfix] fix memory leak in backward of sharded model by @ver217 in #741
- [hotfix] fix initialize about zero by @ver217 in #748
- [hotfix] fix prepare grads in sharded optim by @ver217 in #749
- [hotfix] layernorm by @kurisusnowdeng in #750
- [hotfix] fix auto tensor placement policy by @ver217 in #753
- [hotfix] fix reuse_fp16_shard of sharded model by @ver217 in #756
- [hotfix] fix test_stateful_tensor_mgr by @ver217 in #762
- [compatibility] used backward-compatible API for global process group by @FrankLeeeee in #758
- [hotfix] fix the ckpt hook bugs when using DDP by @Gy-Lu in #769
- [hotfix] polish sharded optim docstr and warning by @ver217 in #770
Unit Testing
- [ci] replace the ngc docker image with self-built pytorch image by @FrankLeeeee in #672
- [ci] fixed compatibility workflow by @FrankLeeeee in #678
- [ci] update workflow trigger condition and support options by @FrankLeeeee in #691
- [ci] added missing field in workflow by @FrankLeeeee in #692
- [ci] remove ipc config for rootless docker by @FrankLeeeee in #694
- [test] added missing decorators to model checkpointing tests by @FrankLeeeee in #727
- [unitest] add checkpoint for moe zero test by @1SAA in #729
- [test] added a decorator for address already in use error with backward compatibility by @FrankLeeeee in #760
- [test] refactored with the new rerun decorator by @FrankLeeeee in #763
Documentation
- add PaLM link by @binmakeswell in #704
- [doc] removed outdated installation command by @FrankLeeeee in #730
- add video by @binmakeswell in #732
- [readme] polish readme by @feifeibear in #764
- [readme] sync CN readme by @binmakeswell in #766
Miscellaneous
- [Bot] Synchronize Submodule References by @github-actions in #556
- [Bot] Synchronize Submodule References by @github-actions in #695
- [refactor] zero directory by @feifeibear in #724
- [Bot] Synchronize Submodule References by @github-actions in #751
Full Changelog: v0.1.2...v0.1.3