Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues
- is_torchdynamo_compiling -- cast a wide exception net (#32476) by @gante
- Revert "fixes to properly shard FSDP across cpu and meta for cpu_effcient_loading for prequantized 4bit (#32276)" (#32477) by @gante and @matthewdouglas
- Gemma2: fix FA2 generation (#32553) by @zucchini-nlp
- Fix: FA2 with packed training (#32487) by @zucchini-nlp
- Fix sliding window attention used in Gemma2FlashAttention2 (#32522) by @brcps12
- Automatically add transformers tag to the modelcard (#32623) by @LysandreJik
- add back the position ids (#32554) by @ArthurZucker
- Use head_dim if in config for RoPE (#32495) @suiyoubi @ArthurZucker
- Revert PR 32299, flag users when Zero-3 was missed (#32851) by @muellerzr
- fix multi-gpu with static cache (#32543) by @SunMarc
- Reduce the error log when using core models that need their weights r… (#32656) by @muellerzr
- Fix VLM generation issues (#32836) by @zucchini-nlp
- Fix generate with inputs_embeds as input (#32493) (this PR has some cherry-pick)
Full Changelog: v4.44.0...v4.44.1