What's Changed
- Update version.txt after 0.16.5 release by @loadams in #7180
- Cross layer overlapping for domino by @hwchen2017 in #7178
- async tp allreduce by @inkcherry in #7115
- Fix issue #5242 grad_norm and loss is nan by @Glaceon-Hyy in #7171
- Add qwen3 autotp support by @Yejing-Lai in #7187
- Update to new torch grad hook API: BF16Optimizer and Stage2 by @deepcharm in #7189
- Reland perf fix for nan inf check by @nelyahu in #7184
- Update to fix pydantic warning by @loadams in #7193
- update dependencies version info by @inkcherry in #7206
- HPU accelerator memory mapping is broken because of torch fill uninit memory by @oelayan7 in #7209
- Support complicated use cases with TiedLayerSpec by @limjcst in #7208
- Add defence for offload_states and reload_states w/o optimizer by @HollowMan6 in #7211
- DeepCompile for enhanced compiler integration by @tohtana in #7154
New Contributors
- @Glaceon-Hyy made their first contribution in #7171
- @limjcst made their first contribution in #7208
Full Changelog: v0.16.5...v0.16.6