Highlights
- Optimize BF16 MHA fusion to avoid transpose overhead to boost BERT-* BF16 performance #992
- Remove 64bytes alignment constraint for FP32 and BF16 AddLayerNorm fusion #992
- Fix INT8 RetinaNet accuracy issue #1032
- Fix
Cat.out
issue that does not update theout
tensor (#1053) #1074
Full Changelog: v1.12.100...v1.12.300