hpcaitech/ColossalAI v0.1.3 on GitHub

Overview

Here are the main improvements of this release:

Gemini: Heterogeneous memory space manager
Refactor the API of pipeline parallelism

What's Changed

Features

[zero] initialize a stateful tensor manager by @feifeibear in #614
[pipeline] refactor pipeline by @YuliangLiu0306 in #679
[zero] stateful tensor manager by @ver217 in #687
[zero] adapt zero hooks for unsharded module by @1SAA in #699
[zero] refactor memstats collector by @ver217 in #706
[zero] improve adaptability for not-shard parameters by @1SAA in #708
[zero] check whether gradients have inf and nan in gpu by @1SAA in #712
[refactor] refactor the memory utils by @feifeibear in #715
[util] support detection of number of processes on current node by @FrankLeeeee in #723
[utils] add synchronized cuda memory monitor by @1SAA in #740
[zero] refactor ShardedParamV2 by @1SAA in #742
[zero] add tensor placement policies by @ver217 in #743
[zero] use factory pattern for tensor_placement_policy by @feifeibear in #752
[zero] refactor memstats_collector by @1SAA in #746
[gemini] init genimi individual directory by @feifeibear in #754
refactor shard and gather operation by @1SAA in #773

Bug Fix

[zero] fix init bugs in zero context by @1SAA in #686
[hotfix] update requirements-test by @ver217 in #701
[hotfix] fix a bug in 3d vocab parallel embedding by @kurisusnowdeng in #707
[compatibility] fixed tensor parallel compatibility with torch 1.9 by @FrankLeeeee in #700
[hotfix]fixed bugs of assigning grad states to non leaf nodes by @Gy-Lu in #711
[hotfix] fix stateful tensor manager's cuda model data size by @ver217 in #710
[bug] fixed broken test_found_inf by @FrankLeeeee in #725
[util] fixed activation checkpointing on torch 1.9 by @FrankLeeeee in #719
[util] fixed communication API with PyTorch 1.9 by @FrankLeeeee in #721
[bug] removed zero installation requirements by @FrankLeeeee in #731
[hotfix] remove duplicated param register to stateful tensor manager by @feifeibear in #728
[utils] correct cpu memory used and capacity in the context of multi-process by @feifeibear in #726
[bug] fixed grad scaler compatibility with torch 1.8 by @FrankLeeeee in #735
[bug] fixed DDP compatibility with torch 1.8 by @FrankLeeeee in #739
[hotfix] fix memory leak in backward of sharded model by @ver217 in #741
[hotfix] fix initialize about zero by @ver217 in #748
[hotfix] fix prepare grads in sharded optim by @ver217 in #749
[hotfix] layernorm by @kurisusnowdeng in #750
[hotfix] fix auto tensor placement policy by @ver217 in #753
[hotfix] fix reuse_fp16_shard of sharded model by @ver217 in #756
[hotfix] fix test_stateful_tensor_mgr by @ver217 in #762
[compatibility] used backward-compatible API for global process group by @FrankLeeeee in #758
[hotfix] fix the ckpt hook bugs when using DDP by @Gy-Lu in #769
[hotfix] polish sharded optim docstr and warning by @ver217 in #770

Unit Testing

[ci] replace the ngc docker image with self-built pytorch image by @FrankLeeeee in #672
[ci] fixed compatibility workflow by @FrankLeeeee in #678
[ci] update workflow trigger condition and support options by @FrankLeeeee in #691
[ci] added missing field in workflow by @FrankLeeeee in #692
[ci] remove ipc config for rootless docker by @FrankLeeeee in #694
[test] added missing decorators to model checkpointing tests by @FrankLeeeee in #727
[unitest] add checkpoint for moe zero test by @1SAA in #729
[test] added a decorator for address already in use error with backward compatibility by @FrankLeeeee in #760
[test] refactored with the new rerun decorator by @FrankLeeeee in #763

Documentation

add PaLM link by @binmakeswell in #704
[doc] removed outdated installation command by @FrankLeeeee in #730
add video by @binmakeswell in #732
[readme] polish readme by @feifeibear in #764
[readme] sync CN readme by @binmakeswell in #766

Miscellaneous

[Bot] Synchronize Submodule References by @github-actions in #556
[Bot] Synchronize Submodule References by @github-actions in #695
[refactor] zero directory by @feifeibear in #724
[Bot] Synchronize Submodule References by @github-actions in #751

Full Changelog: v0.1.2...v0.1.3

hpcaitech/ColossalAI v0.1.3 V0.1.3 Released! on GitHub

Overview

What's Changed

Features

Bug Fix

Unit Testing

Documentation

Miscellaneous

hpcaitech/ColossalAI v0.1.3
V0.1.3 Released!

on GitHub