github pytorch/FBGEMM v1.6.0
FBGEMM_GPU v1.6.0 Release Notes

9 hours ago

Highlights

TBE Performance & Capabilities

  • Added LRU eviction support for improved cache management (#5439)
  • Introduced per-table EEG estimation for better resource allocation (#5374)
  • Fixed AMD FP16 UVM performance regression with vectorized stores (#5469)
  • Moved prefetched info to preallocated buffers for faster CPU TBE operations (#5450)

Quantization & Grouped GEMM

  • Added FP16 support for grouped GEMM wgrad and dgrad kernels (#5313)
  • Enabled direct MX4→BF16 dequantization to reduce memory usage (#5250)
  • Improved EmbeddingSpMDMNBitRowWiseSparse with autovectorized variant (#5244)

Sparse Operations & Optimizations

  • Added reserved slots to support always-on tables (#5377)
  • Optimized sparse_permute_2d kernel for better performance (#5370)
  • Improved group_index_select_or_add_2d_kernel on ROCm for small embedding dimensions (#5233)
  • Added export trace profiling support to sparse ops benchmarks (#5311)

Platform Support

  • Added Python 3.14 support across FBGEMM and TorchRec (#5300, #5310, #5322)
  • Re-enabled CUDA 13 in Nova builds (#5301)
  • Migrated to RocksDB 11.0 for SSD TBE (#5438)
  • Upgraded to MI300 runners for ROCm CI testing (#5414)

Software Requirements

FBGEMM_GPU v1.6.0 has been tested and known to work on the following setups:

  • PyTorch: v2.11
  • CUDA: v12.6, 12.8, 13.0
  • Python: v3.9, 3.10, 3.11, 3.12, 3.13, 3.14
  • ROCm: 7.0, 7.1

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.6.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.6.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.6.0 --index-url https://download.pytorch.org/whl/cu121/
pip install fbgemm-gpu==1.6.0 --index-url https://download.pytorch.org/whl/cu124/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.6.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table Batched Embedding (TBE) Operators

For GPU

  • [Improvement] Modifying clear_all_staged_data to accomadate KV Tensor Deletion (#5202)
  • [Improvement] Replace data_ptr with {const,mutable}_data_ptr (#5282) (#5291)
  • [Fix] include prepare_inputs in total_prefetch_duration timer (#5308)
  • [Improvement] Replace data_ptr with {const,mutable}_data_ptr (#5304) (#5312)
  • [Improvement] Improve TBEDataConfig and TBEParamsReporter (#5334)
  • [Improvement] Benchmark merged VBE output (#5338)
  • [Fix] bug: fix split cache_weights from cache_aux in memory reporting (#5343)
  • [Improvement] Apply performance fixes (#5346) (#5349)
  • [Improvement] More performance fixes (#5352) (#5355)
  • [Fix] Passthrough optimized hip kernel when weights are not on device (#5357)
  • [Fix] Fix resouce leakage and other modernization (#5362) (#5366)
  • [Improvement] Use std::ranges (#5369) (#5372)
  • [New] add support for per-table eeg estimation (#5374)
  • [Fix] Fix ROCm optimized kernel D320 case, opt out mixed precision (#5376)
  • [Improvement] Use C++17 and 20 features to simplify code (#5375) (#5378)
  • [Improvement] Refactor and improve backward_adagrad unit test (#5380)
  • [New] Supoprt EC in writeback hook, but with only one feature (#2365) (#5388)
  • [Improvement] More modernization fixes (#5391) (#5395)
  • [Fix] Fix backward_adagrad unit test (#5397)
  • [Fix] Back out "More modernization fixes" (#5398)
  • [Improvement] Change to CUDA_KERNEL_ASSERT (#5407)
  • [Fix] Fix undefined symbol error in OSS (#5417)
  • [Fix] Skip test_backward_adagrad_large_dims (#5424)
  • [Improvement] Skip tests if GPU memory is insufficient (#5425)
  • [Fix] Fix precision loss in TBE test offset construction (#5430)
  • [Improvement] Thread raw embedding streamer to dram_kv_embedding_cache (#5432)
  • [Improvement] add variable Ls support in triton bench (#5434)
  • [New] LRU + opt in support (#5439)
  • [New] Thread raw embedding streamer to dram_kv_embedding_cache (#5460)
  • [Fix] Fix AMD FP16 UVM performance regression by using vectorized stores (#5469)
  • [Fix] Fix OOM in weight comparison, variable E request gen (#5472)

For CPU

  • [Fix] Update embedding_forward_quantized_cpu_template.cpp to use initialized output memory instead of uninitialized (#5054)
  • [Improvement] Move the prefetched info to preallocated buffers (#5251)
  • [Fix] achieve gpu-cpu parity for rowwise_adagrad_with_counter (#5405)
  • [Improvement] Replace std::atomic_ref with folly::atomic_ref (#5419)
  • [Fix] Back out "Move the prefetched info to preallocated buffers" (#5423)
  • [New] Add benchmark with input files and refactor (#5448)
  • [Improvement] Move the prefetched info to preallocated buffers (#5450)
  • [Improvement] Replace C-style assert() with TORCH_CHECK/CUDA_KERNEL_ASSERT (#5452)

SSD Table Batched Embedding (TBE) Operators

  • [Improvement] torchrec related changes for APF Integration (#5286)
  • [Fix] KVZCH inference test fix minor bug (#5358)
  • [Fix] Fix feature score eviction bucket edge case (#5399)
  • [Improvement] Migrate to unique_ptr DB::Open for RocksDB 11.0 (#5438)
  • [New] creating delete_rocksdb_checkpoint_dir function under KV Tensor (#5201) (#5457)

GenAI Support and Operators

Triton GEMM Support

  • [Improvement] Enable direct MX4→BF16 dequantization to reduce memory (python side) (2/2) (#5250)
  • [Improvement] Update triton and python (#5305)

Quantization Operators

  • [New] add dynamic quantize gemm benchmark [step 1: minmax qparams compute] (#2297)
  • [Improvement] Add EmbeddingSpMDMNBitRowWiseSparse autovectorized variant (#5244)
  • [Improvement] Specialize more cases to improve EmbeddingSpMDMNBitBenchmark (#5245)
  • [Fix] Work around MX4 correctness on ROCm issue for now (#5302)
  • [Fix] Fix BF16 Grouped GEMM wgrad allocation (#5309)
  • [New] Add FP16 support for grouped gemm wgrad and dgrad kernels. (#5313)
  • [Improvement] use codesign ck for fbgemm ck moe (#5314)
  • [Fix] Fix unused exception parameter warning (#5329)
  • [Improvement] More performance fixes (#5359) (#5360)
  • [Fix] Fix CUTLASS grouped GEMM wgrad NaN for zero-token experts (#5418)
  • [Improvement] Replace C-style assert() with TORCH_CHECK/CUDA_KERNEL_ASSERT (genai) (#5455)
  • [Improvement] Remove AVX file from aarch64 compilation (#5458)

Sparse Operators

  • [Improvement] Add warp parallelism to populate_bucketized_permute (#5189)
  • [Improvement] Optimize group_index_select_or_add_2d_kernel on ROCm by adding a separate codepath for small embedding dimensions (#5233)
  • [Improvement] Choose _autovec version of GenerateEmbeddingSpMDMRowWiseSparse on AArch64 (#5247)
  • [New] Add repeat_arange cuda kernel (#5278)
  • [Fix] Synchronize before and after smem writes (#5288)
  • [Fix] Fix infer warning (#5289)
  • [Fix] Fix implicit type conversions (#5293) (#5297)
  • [New] Add remap_indices_update_utils on CPU (#5307)
  • [New] Add export trace profiling support to sparse ops benchmarks (#5311)
  • [Improvement] Enable Half support for permute_2D_indices_weights_kernel_3 (#5333)
  • [Improvement] Improve register allocation on asm transpose routine (#5344)
  • [Improvement] Optimizations for index_select_scalar_cumsum_kernel on ROCm (#5263) (#5353)
  • [Improvement] optimize sparse_permute_2d kernel (#5370)
  • [New] use reserved slots to support always-on table (#5377)
  • [Improvement] Implement cached member_id upper bound search (#5365) (#5406)
  • [Fix] Do not use constexpr with __builtin_constant_p (#5410)
  • [Improvement] Improve block_bucketize_sparse_features error messages (#5446)
  • [Fix] Add dtype validation for group_index_select_dim0 inputs (#5454)
  • [Fix] Fix to block bucketize sparse features benchmark for int dtype case (#5470)

Build / CI Improvements and Better Engineering

  • [Improvement] [fbgemm_gpu] Increase timeout for ARM nova jobs (#3690)
  • [Improvement] Remove unused AVX{2,512}_FLAGS (#5198)
  • [Improvement] Cleanup branches for CUDA 9 (#5269)
  • [Improvement] Replace .data_ptr with .mutable_data_ptr or .const_data_ptr (#5267) (#5276)
  • [Improvement] Add tidy fixes (#5268) (#5284)
  • [Fix] Remove the use of tests_to_skip.txt from the torchrec tests workflow (#5290)
  • [Improvement] Add deprecation message for FBGEMM GenAI (#5292)
  • [Improvement] Lint fixes (#5298)
  • [Improvement] Remove unused code (#5294) (#5299)
  • [Improvement] Add Python 3.14 support for FBGEMM (#5300)
  • [Improvement] Re-enable CUDA 13 in Nova builds (#5301)
  • [Improvement] Port fbgemm CPU warnings to GPU targets and fix warnings (#5303)
  • [Improvement] Upgrade torchrec CI to python 3.14 (#5310)
  • [Improvement] Add CPU support for cumem_utils to remove GPU dep on MTIA (#5315)
  • [Improvement] Concurrent inference test (#5316)
  • [Fix] Back out "Add CPU support for cumem_utils to remove GPU dep on MTIA" (#5317)
  • [Improvement] LInt fixes (#5318)
  • [Improvement] Only bother with -fno-trapping-math and -ftree-vectorize when targetting GCC (#5319)
  • [Improvement] Add CPU support for cumem_utils to remove GPU dep on MTIA (#5315) (#5320)
  • [Improvement] Upgrade torchrec CI to python 3.14, pt2 (#5322)
  • [Improvement] Replace mutable_data_ptr with const_data_ptr for readonly tensors (#5323) (#5328)
  • [Improvement] Upgrade setuptools_git_versioning (#5331)
  • [Improvement] Fix lint errors (#5332)
  • [Fix] Fix fp32 and fp16 tests of Apple Silicon (#5340)
  • [Fix] Fix release version extraction (#5341)
  • [Improvement] Remove GCC 8 workarounds (#5347)
  • [Improvement] Better asserts (#5350)
  • [Fix] Remove invalid C++ flags (#5351)
  • [Improvement] Remove old code for clang <16 (#5356)
  • [Improvement] Improvements to better assert (#5361)
  • [Improvement] Explicitly set the minimal C++ versions (#5363)
  • [Improvement] Better assert (fbgemm CPU) (#5367)
  • [Improvement] Replace gcc builtin with std::atomic_ref (#5368)
  • [Improvement] Update AMD hardware detection script (#5373)
  • [Fix] Fix workflow to extract pytorch channel first before passing to build_fbgemm_gpu_package (#5379)
  • [Improvement] Migrate FBGEMM assert to FBGEMM_CHECK, pt 1 (#5382)
  • [Improvement] Apply C++17/20 modernization (#5384) (#5385)
  • [Improvement] Update OSS build script to include test packages setup (#5387)
  • [Fix] fix Wno-unused-command-line-argument (#5390) (#5394)
  • [Improvement] Scripts for building PyTorch (#5400)
  • [Improvement] Upgrade glibc to 2.28 (#5403)
  • [Fix] Fix kernel launcher test to work with CUDA_LAUNCH_BLOCKING=1 (#5408)
  • [Improvement] Update PyTorch OSS build scripts (#5412)
  • [New] Add dual s3/r2 upload (#5416)
  • [Improvement] Disable building asmjit and fbgemm when building genai module (#5420)
  • [Improvement] Trigger pytorch builds on exact SHA (#5426)
  • [Fix] Fix ROCm Nova builds (#5427)
  • [Fix] Fix build warnings (#5428)
  • [Improvement] Remove 12.0a arch from PyPI builds to reduce binary size (#5431)
  • [Fix] Pin MarkupSafe>=3.0.0 for Python 3.14 compatibility (#5433)
  • [Fix] Fix ROCm OSS build by setting GCC toolchain for C++20 support (#5443) (#5444)
  • [Improvement] Update json dependency to release/3.12.0 (#5473)

Tests and Benchmarks

  • [New] Add general helper scripts for benchmarking (#5345)
  • [Fix] Fix ROCm benchmark workflow (#5404)
  • [Improvement] Use MI300 runners for ROCm CI (#5414)

Don't miss a new FBGEMM release

NewReleases is sending notifications on new releases.