Feb 23, 2026
- Add token distillation training support to distillation task wrappers
- Remove some torch.jit usage in prep for official deprecation
- Caution added to AdamP optimizer
- Call reset_parameters() even if meta-device init so that buffers get init w/ hacks like init_empty_weights
- Tweak Muon optimizer to work with DTensor/FSDP2 (clamp_ instead of clamp_min_, alternate NS branch for DTensor)
- Release 1.0.25
Jan 21, 2026
- Compat Break: Fix oversight w/ QKV vs MLP bias in
ParallelScalingBlock(&DiffParallelScalingBlock)- Does not impact any trained
timmmodels but could impact downstream use.
- Does not impact any trained
What's Changed
- Token distill task & distill task refactoring by @rwightman in #2647
- Fix distilled head dropout using wrong token in PiT forward_head by @hassonofer in #2649
- Fix #2653, no models with weights impacted so just a clean fix by @rwightman in #2654
- Add the cautious optimizer to AdamP. by @Yuan-Jinghui in #2657
- Enhance the numerical stability of the Cautious Optimizer by @Yuan-Jinghui in #2658
- Some misc fixes for torch.jit deprecation and meta device init by @rwightman in #2664
- fix(optim): replace bare except with Exception in Lion optimizer by @llukito in #2666
- Change clamp_min_ to clamp_(min=) as former doesn't work with DTensor / FSDP2 by @rwightman in #2668
- Add DTensor compatible NS impl for Muon by @rwightman in #2669
New Contributors
- @Yuan-Jinghui made their first contribution in #2657
- @llukito made their first contribution in #2666
Full Changelog: v1.0.24...v1.0.25