deepspeedai/DeepSpeed v0.16.6 on GitHub

What's Changed

Update version.txt after 0.16.5 release by @loadams in #7180
Cross layer overlapping for domino by @hwchen2017 in #7178
async tp allreduce by @inkcherry in #7115
Fix issue #5242 grad_norm and loss is nan by @Glaceon-Hyy in #7171
Add qwen3 autotp support by @Yejing-Lai in #7187
Update to new torch grad hook API: BF16Optimizer and Stage2 by @deepcharm in #7189
Reland perf fix for nan inf check by @nelyahu in #7184
Update to fix pydantic warning by @loadams in #7193
update dependencies version info by @inkcherry in #7206
HPU accelerator memory mapping is broken because of torch fill uninit memory by @oelayan7 in #7209
Support complicated use cases with TiedLayerSpec by @limjcst in #7208
Add defence for offload_states and reload_states w/o optimizer by @HollowMan6 in #7211
DeepCompile for enhanced compiler integration by @tohtana in #7154

Full Changelog: v0.16.5...v0.16.6