pytorch/FBGEMM v1.6.0 on GitHub

Highlights

TBE Performance & Capabilities

Added LRU eviction support for improved cache management (#5439)
Introduced per-table EEG estimation for better resource allocation (#5374)
Fixed AMD FP16 UVM performance regression with vectorized stores (#5469)
Moved prefetched info to preallocated buffers for faster CPU TBE operations (#5450)

Quantization & Grouped GEMM

Added FP16 support for grouped GEMM wgrad and dgrad kernels (#5313)
Enabled direct MX4→BF16 dequantization to reduce memory usage (#5250)
Improved EmbeddingSpMDMNBitRowWiseSparse with autovectorized variant (#5244)

Sparse Operations & Optimizations

Added reserved slots to support always-on tables (#5377)
Optimized sparse_permute_2d kernel for better performance (#5370)
Improved group_index_select_or_add_2d_kernel on ROCm for small embedding dimensions (#5233)
Added export trace profiling support to sparse ops benchmarks (#5311)

Platform Support

Added Python 3.14 support across FBGEMM and TorchRec (#5300, #5310, #5322)
Re-enabled CUDA 13 in Nova builds (#5301)
Migrated to RocksDB 11.0 for SSD TBE (#5438)
Upgraded to MI300 runners for ROCm CI testing (#5414)

Software Requirements

FBGEMM_GPU v1.6.0 has been tested and known to work on the following setups:

PyTorch: v2.11
CUDA: v12.6, 12.8, 13.0
Python: v3.9, 3.10, 3.11, 3.12, 3.13, 3.14
ROCm: 7.0, 7.1

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.6.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.6.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.6.0 --index-url https://download.pytorch.org/whl/cu121/
pip install fbgemm-gpu==1.6.0 --index-url https://download.pytorch.org/whl/cu124/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.6.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table Batched Embedding (TBE) Operators

For GPU

[Improvement] Modifying clear_all_staged_data to accomadate KV Tensor Deletion (#5202)
[Improvement] Replace data_ptr with {const,mutable}_data_ptr (#5282) (#5291)
[Fix] include prepare_inputs in total_prefetch_duration timer (#5308)
[Improvement] Replace data_ptr with {const,mutable}_data_ptr (#5304) (#5312)
[Improvement] Improve TBEDataConfig and TBEParamsReporter (#5334)
[Improvement] Benchmark merged VBE output (#5338)
[Fix] bug: fix split cache_weights from cache_aux in memory reporting (#5343)
[Improvement] Apply performance fixes (#5346) (#5349)
[Improvement] More performance fixes (#5352) (#5355)
[Fix] Passthrough optimized hip kernel when weights are not on device (#5357)
[Fix] Fix resouce leakage and other modernization (#5362) (#5366)
[Improvement] Use std::ranges (#5369) (#5372)
[New] add support for per-table eeg estimation (#5374)
[Fix] Fix ROCm optimized kernel D320 case, opt out mixed precision (#5376)
[Improvement] Use C++17 and 20 features to simplify code (#5375) (#5378)
[Improvement] Refactor and improve backward_adagrad unit test (#5380)
[New] Supoprt EC in writeback hook, but with only one feature (#2365) (#5388)
[Improvement] More modernization fixes (#5391) (#5395)
[Fix] Fix backward_adagrad unit test (#5397)
[Fix] Back out "More modernization fixes" (#5398)
[Improvement] Change to CUDA_KERNEL_ASSERT (#5407)
[Fix] Fix undefined symbol error in OSS (#5417)
[Fix] Skip test_backward_adagrad_large_dims (#5424)
[Improvement] Skip tests if GPU memory is insufficient (#5425)
[Fix] Fix precision loss in TBE test offset construction (#5430)
[Improvement] Thread raw embedding streamer to dram_kv_embedding_cache (#5432)
[Improvement] add variable Ls support in triton bench (#5434)
[New] LRU + opt in support (#5439)
[New] Thread raw embedding streamer to dram_kv_embedding_cache (#5460)
[Fix] Fix AMD FP16 UVM performance regression by using vectorized stores (#5469)
[Fix] Fix OOM in weight comparison, variable E request gen (#5472)

For CPU

[Fix] Update embedding_forward_quantized_cpu_template.cpp to use initialized output memory instead of uninitialized (#5054)
[Improvement] Move the prefetched info to preallocated buffers (#5251)
[Fix] achieve gpu-cpu parity for rowwise_adagrad_with_counter (#5405)
[Improvement] Replace std::atomic_ref with folly::atomic_ref (#5419)
[Fix] Back out "Move the prefetched info to preallocated buffers" (#5423)
[New] Add benchmark with input files and refactor (#5448)
[Improvement] Move the prefetched info to preallocated buffers (#5450)
[Improvement] Replace C-style assert() with TORCH_CHECK/CUDA_KERNEL_ASSERT (#5452)

SSD Table Batched Embedding (TBE) Operators

[Improvement] torchrec related changes for APF Integration (#5286)
[Fix] KVZCH inference test fix minor bug (#5358)
[Fix] Fix feature score eviction bucket edge case (#5399)
[Improvement] Migrate to unique_ptr DB::Open for RocksDB 11.0 (#5438)
[New] creating delete_rocksdb_checkpoint_dir function under KV Tensor (#5201) (#5457)

GenAI Support and Operators

Triton GEMM Support

[Improvement] Enable direct MX4→BF16 dequantization to reduce memory (python side) (2/2) (#5250)
[Improvement] Update triton and python (#5305)

Quantization Operators

[New] add dynamic quantize gemm benchmark [step 1: minmax qparams compute] (#2297)
[Improvement] Add EmbeddingSpMDMNBitRowWiseSparse autovectorized variant (#5244)
[Improvement] Specialize more cases to improve EmbeddingSpMDMNBitBenchmark (#5245)
[Fix] Work around MX4 correctness on ROCm issue for now (#5302)
[Fix] Fix BF16 Grouped GEMM wgrad allocation (#5309)
[New] Add FP16 support for grouped gemm wgrad and dgrad kernels. (#5313)
[Improvement] use codesign ck for fbgemm ck moe (#5314)
[Fix] Fix unused exception parameter warning (#5329)
[Improvement] More performance fixes (#5359) (#5360)
[Fix] Fix CUTLASS grouped GEMM wgrad NaN for zero-token experts (#5418)
[Improvement] Replace C-style assert() with TORCH_CHECK/CUDA_KERNEL_ASSERT (genai) (#5455)
[Improvement] Remove AVX file from aarch64 compilation (#5458)

Sparse Operators

[Improvement] Add warp parallelism to populate_bucketized_permute (#5189)
[Improvement] Optimize group_index_select_or_add_2d_kernel on ROCm by adding a separate codepath for small embedding dimensions (#5233)
[Improvement] Choose _autovec version of GenerateEmbeddingSpMDMRowWiseSparse on AArch64 (#5247)
[New] Add repeat_arange cuda kernel (#5278)
[Fix] Synchronize before and after smem writes (#5288)
[Fix] Fix infer warning (#5289)
[Fix] Fix implicit type conversions (#5293) (#5297)
[New] Add remap_indices_update_utils on CPU (#5307)
[New] Add export trace profiling support to sparse ops benchmarks (#5311)
[Improvement] Enable Half support for permute_2D_indices_weights_kernel_3 (#5333)
[Improvement] Improve register allocation on asm transpose routine (#5344)
[Improvement] Optimizations for index_select_scalar_cumsum_kernel on ROCm (#5263) (#5353)
[Improvement] optimize sparse_permute_2d kernel (#5370)
[New] use reserved slots to support always-on table (#5377)
[Improvement] Implement cached member_id upper bound search (#5365) (#5406)
[Fix] Do not use constexpr with __builtin_constant_p (#5410)
[Improvement] Improve block_bucketize_sparse_features error messages (#5446)
[Fix] Add dtype validation for group_index_select_dim0 inputs (#5454)
[Fix] Fix to block bucketize sparse features benchmark for int dtype case (#5470)

Build / CI Improvements and Better Engineering

[Improvement] [fbgemm_gpu] Increase timeout for ARM nova jobs (#3690)
[Improvement] Remove unused AVX{2,512}_FLAGS (#5198)
[Improvement] Cleanup branches for CUDA 9 (#5269)
[Improvement] Replace .data_ptr with .mutable_data_ptr or .const_data_ptr (#5267) (#5276)
[Improvement] Add tidy fixes (#5268) (#5284)
[Fix] Remove the use of tests_to_skip.txt from the torchrec tests workflow (#5290)
[Improvement] Add deprecation message for FBGEMM GenAI (#5292)
[Improvement] Lint fixes (#5298)
[Improvement] Remove unused code (#5294) (#5299)
[Improvement] Add Python 3.14 support for FBGEMM (#5300)
[Improvement] Re-enable CUDA 13 in Nova builds (#5301)
[Improvement] Port fbgemm CPU warnings to GPU targets and fix warnings (#5303)
[Improvement] Upgrade torchrec CI to python 3.14 (#5310)
[Improvement] Add CPU support for cumem_utils to remove GPU dep on MTIA (#5315)
[Improvement] Concurrent inference test (#5316)
[Fix] Back out "Add CPU support for cumem_utils to remove GPU dep on MTIA" (#5317)
[Improvement] LInt fixes (#5318)
[Improvement] Only bother with -fno-trapping-math and -ftree-vectorize when targetting GCC (#5319)
[Improvement] Add CPU support for cumem_utils to remove GPU dep on MTIA (#5315) (#5320)
[Improvement] Upgrade torchrec CI to python 3.14, pt2 (#5322)
[Improvement] Replace mutable_data_ptr with const_data_ptr for readonly tensors (#5323) (#5328)
[Improvement] Upgrade setuptools_git_versioning (#5331)
[Improvement] Fix lint errors (#5332)
[Fix] Fix fp32 and fp16 tests of Apple Silicon (#5340)
[Fix] Fix release version extraction (#5341)
[Improvement] Remove GCC 8 workarounds (#5347)
[Improvement] Better asserts (#5350)
[Fix] Remove invalid C++ flags (#5351)
[Improvement] Remove old code for clang <16 (#5356)
[Improvement] Improvements to better assert (#5361)
[Improvement] Explicitly set the minimal C++ versions (#5363)
[Improvement] Better assert (fbgemm CPU) (#5367)
[Improvement] Replace gcc builtin with std::atomic_ref (#5368)
[Improvement] Update AMD hardware detection script (#5373)
[Fix] Fix workflow to extract pytorch channel first before passing to build_fbgemm_gpu_package (#5379)
[Improvement] Migrate FBGEMM assert to FBGEMM_CHECK, pt 1 (#5382)
[Improvement] Apply C++17/20 modernization (#5384) (#5385)
[Improvement] Update OSS build script to include test packages setup (#5387)
[Fix] fix Wno-unused-command-line-argument (#5390) (#5394)
[Improvement] Scripts for building PyTorch (#5400)
[Improvement] Upgrade glibc to 2.28 (#5403)
[Fix] Fix kernel launcher test to work with CUDA_LAUNCH_BLOCKING=1 (#5408)
[Improvement] Update PyTorch OSS build scripts (#5412)
[New] Add dual s3/r2 upload (#5416)
[Improvement] Disable building asmjit and fbgemm when building genai module (#5420)
[Improvement] Trigger pytorch builds on exact SHA (#5426)
[Fix] Fix ROCm Nova builds (#5427)
[Fix] Fix build warnings (#5428)
[Improvement] Remove 12.0a arch from PyPI builds to reduce binary size (#5431)
[Fix] Pin MarkupSafe>=3.0.0 for Python 3.14 compatibility (#5433)
[Fix] Fix ROCm OSS build by setting GCC toolchain for C++20 support (#5443) (#5444)
[Improvement] Update json dependency to release/3.12.0 (#5473)

Tests and Benchmarks

[New] Add general helper scripts for benchmarking (#5345)
[Fix] Fix ROCm benchmark workflow (#5404)
[Improvement] Use MI300 runners for ROCm CI (#5414)

pytorch/FBGEMM v1.6.0 FBGEMM_GPU v1.6.0 Release Notes on GitHub