github pytorch/FBGEMM v1.5.0
FBGEMM_GPU v1.5.0 Release Notes

11 hours ago

Highlights

CUDA 13 and Blackwell Support

  • Enabled CUDA 13 builds in OSS with full preparation for next-generation GPU architectures (#5143, #5100,#5301)
  • Added lazy TMEM allocation for Blackwell decode kernel for improved memory efficiency (#5262)
  • Added support for Blackwell CUTLASS attention kernels in torch.compile (#5136)
  • Added Paged Attention support to FMHA CUTLASS Blackwell Forward kernel for both fixed and variable length sequences (#4999, #5033)
  • Upgraded CUTLASS dependency to 4.3 with SM100 convolution fixes (#5127, #5047)

Table Batched Embedding (TBE) Improvements

  • Added hash_zch_identities and hash_zch_runtime_meta streaming logic for improved ZCH (Zero Collision Hashing) support (#5144, #5194)
  • Introduced KVZCHEvictionTBEConfig for flexible KVZCH eviction configuration (#5058)
  • Added sync trigger eviction support with Python API and all2all synchronization (#4984, #5062)
  • Added feature score eviction policy with no-eviction mode support (#5059)

GenAI and GEMM Performance

  • Added split-K support and heuristics for decode attention kernel, improving inference performance (#5213, #5225)
  • Added sliding window attention support to split-K generation kernel (#5231)
  • Added FP16 support for CUTLASS grouped GEMM operations (#5111)
  • Improved kleidi-ai matmul register usage and matrix partitioning for better performance (#5165, #5155)
  • Optimized FmhaKernelBwdConvert block size and grid shape (#5229)

Quantization Improvements

  • Enabled direct MX4→BF16 dequantization to reduce memory footprint (#5206)
  • Added MXFP8 grouped GEMM improvements with better heuristics and assertions (#5190, #5203)
  • Enabled specifying output dtype for FP8 quantized communication (#5154)
  • Added FP8 Convolution Kernel with improved heuristics (#4994, #5118)
  • NVFP4 grouped tuning and alignment with eager PyTorch numerics (#5012, #5156)

ARM / AArch64 Platform Support

  • Added multiple NEON-optimized quantization implementations for ARM64 (#5089, #5115, #5199)
  • Vectorized requantize_ for Arm64 with NEON intrinsics (#5130)
  • Improved kleidi-ai matmul for ARM architecture (#5155, #5165)

ROCm / AMD Platform Support

  • Added MI350 performance optimizations for embedding forward and backward passes (#5064, #5177)
  • Updated OSS build script to support AMD and CPU variants (#5257)
  • Updated default target ROCm architectures in OSS build (#5219)

Better Engineering

  • Upgraded GitHub Actions to latest versions for improved CI reliability (#5223)
  • Upgraded CUTLASS dependency to version 4.3 (#5127)
  • Improved sparse ops with Kineto tracing support for better profiling (#5060, #5061)
  • Added comprehensive FMHA tests and improved test organization (#5108, #5237)

Don't miss a new FBGEMM release

NewReleases is sending notifications on new releases.