Highlights
TBE Performance & Capabilities
- Added LRU eviction support for improved cache management (#5439)
- Introduced per-table EEG estimation for better resource allocation (#5374)
- Fixed AMD FP16 UVM performance regression with vectorized stores (#5469)
- Moved prefetched info to preallocated buffers for faster CPU TBE operations (#5450)
Quantization & Grouped GEMM
- Added FP16 support for grouped GEMM wgrad and dgrad kernels (#5313)
- Enabled direct MX4→BF16 dequantization to reduce memory usage (#5250)
- Improved EmbeddingSpMDMNBitRowWiseSparse with autovectorized variant (#5244)
Sparse Operations & Optimizations
- Added reserved slots to support always-on tables (#5377)
- Optimized sparse_permute_2d kernel for better performance (#5370)
- Improved group_index_select_or_add_2d_kernel on ROCm for small embedding dimensions (#5233)
- Added export trace profiling support to sparse ops benchmarks (#5311)
Platform Support
- Added Python 3.14 support across FBGEMM and TorchRec (#5300, #5310, #5322)
- Re-enabled CUDA 13 in Nova builds (#5301)
- Migrated to RocksDB 11.0 for SSD TBE (#5438)
- Upgraded to MI300 runners for ROCm CI testing (#5414)
Software Requirements
FBGEMM_GPU v1.6.0 has been tested and known to work on the following setups:
- PyTorch: v2.11
- CUDA: v12.6, 12.8, 13.0
- Python: v3.9, 3.10, 3.11, 3.12, 3.13, 3.14
- ROCm: 7.0, 7.1
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.6.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.6.0Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.6.0 --index-url https://download.pytorch.org/whl/cu121/
pip install fbgemm-gpu==1.6.0 --index-url https://download.pytorch.org/whl/cu124/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.6.0 --index-url https://download.pytorch.org/whl/cpuChanges
Table Batched Embedding (TBE) Operators
For GPU
- [Improvement] Modifying clear_all_staged_data to accomadate KV Tensor Deletion (#5202)
- [Improvement] Replace data_ptr with {const,mutable}_data_ptr (#5282) (#5291)
- [Fix] include prepare_inputs in total_prefetch_duration timer (#5308)
- [Improvement] Replace data_ptr with {const,mutable}_data_ptr (#5304) (#5312)
- [Improvement] Improve TBEDataConfig and TBEParamsReporter (#5334)
- [Improvement] Benchmark merged VBE output (#5338)
- [Fix] bug: fix split cache_weights from cache_aux in memory reporting (#5343)
- [Improvement] Apply performance fixes (#5346) (#5349)
- [Improvement] More performance fixes (#5352) (#5355)
- [Fix] Passthrough optimized hip kernel when weights are not on device (#5357)
- [Fix] Fix resouce leakage and other modernization (#5362) (#5366)
- [Improvement] Use std::ranges (#5369) (#5372)
- [New] add support for per-table eeg estimation (#5374)
- [Fix] Fix ROCm optimized kernel D320 case, opt out mixed precision (#5376)
- [Improvement] Use C++17 and 20 features to simplify code (#5375) (#5378)
- [Improvement] Refactor and improve backward_adagrad unit test (#5380)
- [New] Supoprt EC in writeback hook, but with only one feature (#2365) (#5388)
- [Improvement] More modernization fixes (#5391) (#5395)
- [Fix] Fix backward_adagrad unit test (#5397)
- [Fix] Back out "More modernization fixes" (#5398)
- [Improvement] Change to CUDA_KERNEL_ASSERT (#5407)
- [Fix] Fix undefined symbol error in OSS (#5417)
- [Fix] Skip test_backward_adagrad_large_dims (#5424)
- [Improvement] Skip tests if GPU memory is insufficient (#5425)
- [Fix] Fix precision loss in TBE test offset construction (#5430)
- [Improvement] Thread raw embedding streamer to dram_kv_embedding_cache (#5432)
- [Improvement] add variable Ls support in triton bench (#5434)
- [New] LRU + opt in support (#5439)
- [New] Thread raw embedding streamer to dram_kv_embedding_cache (#5460)
- [Fix] Fix AMD FP16 UVM performance regression by using vectorized stores (#5469)
- [Fix] Fix OOM in weight comparison, variable E request gen (#5472)
For CPU
- [Fix] Update embedding_forward_quantized_cpu_template.cpp to use initialized output memory instead of uninitialized (#5054)
- [Improvement] Move the prefetched info to preallocated buffers (#5251)
- [Fix] achieve gpu-cpu parity for rowwise_adagrad_with_counter (#5405)
- [Improvement] Replace std::atomic_ref with folly::atomic_ref (#5419)
- [Fix] Back out "Move the prefetched info to preallocated buffers" (#5423)
- [New] Add benchmark with input files and refactor (#5448)
- [Improvement] Move the prefetched info to preallocated buffers (#5450)
- [Improvement] Replace C-style assert() with TORCH_CHECK/CUDA_KERNEL_ASSERT (#5452)
SSD Table Batched Embedding (TBE) Operators
- [Improvement] torchrec related changes for APF Integration (#5286)
- [Fix] KVZCH inference test fix minor bug (#5358)
- [Fix] Fix feature score eviction bucket edge case (#5399)
- [Improvement] Migrate to unique_ptr DB::Open for RocksDB 11.0 (#5438)
- [New] creating delete_rocksdb_checkpoint_dir function under KV Tensor (#5201) (#5457)
GenAI Support and Operators
Triton GEMM Support
- [Improvement] Enable direct MX4→BF16 dequantization to reduce memory (python side) (2/2) (#5250)
- [Improvement] Update triton and python (#5305)
Quantization Operators
- [New] add dynamic quantize gemm benchmark [step 1: minmax qparams compute] (#2297)
- [Improvement] Add EmbeddingSpMDMNBitRowWiseSparse autovectorized variant (#5244)
- [Improvement] Specialize more cases to improve EmbeddingSpMDMNBitBenchmark (#5245)
- [Fix] Work around MX4 correctness on ROCm issue for now (#5302)
- [Fix] Fix BF16 Grouped GEMM wgrad allocation (#5309)
- [New] Add FP16 support for grouped gemm wgrad and dgrad kernels. (#5313)
- [Improvement] use codesign ck for fbgemm ck moe (#5314)
- [Fix] Fix unused exception parameter warning (#5329)
- [Improvement] More performance fixes (#5359) (#5360)
- [Fix] Fix CUTLASS grouped GEMM wgrad NaN for zero-token experts (#5418)
- [Improvement] Replace C-style assert() with TORCH_CHECK/CUDA_KERNEL_ASSERT (genai) (#5455)
- [Improvement] Remove AVX file from aarch64 compilation (#5458)
Sparse Operators
- [Improvement] Add warp parallelism to populate_bucketized_permute (#5189)
- [Improvement] Optimize group_index_select_or_add_2d_kernel on ROCm by adding a separate codepath for small embedding dimensions (#5233)
- [Improvement] Choose _autovec version of GenerateEmbeddingSpMDMRowWiseSparse on AArch64 (#5247)
- [New] Add repeat_arange cuda kernel (#5278)
- [Fix] Synchronize before and after smem writes (#5288)
- [Fix] Fix infer warning (#5289)
- [Fix] Fix implicit type conversions (#5293) (#5297)
- [New] Add
remap_indices_update_utilson CPU (#5307) - [New] Add export trace profiling support to sparse ops benchmarks (#5311)
- [Improvement] Enable Half support for permute_2D_indices_weights_kernel_3 (#5333)
- [Improvement] Improve register allocation on asm transpose routine (#5344)
- [Improvement] Optimizations for index_select_scalar_cumsum_kernel on ROCm (#5263) (#5353)
- [Improvement] optimize sparse_permute_2d kernel (#5370)
- [New] use reserved slots to support always-on table (#5377)
- [Improvement] Implement cached member_id upper bound search (#5365) (#5406)
- [Fix] Do not use constexpr with __builtin_constant_p (#5410)
- [Improvement] Improve block_bucketize_sparse_features error messages (#5446)
- [Fix] Add dtype validation for group_index_select_dim0 inputs (#5454)
- [Fix] Fix to block bucketize sparse features benchmark for int dtype case (#5470)
Build / CI Improvements and Better Engineering
- [Improvement] [fbgemm_gpu] Increase timeout for ARM nova jobs (#3690)
- [Improvement] Remove unused AVX{2,512}_FLAGS (#5198)
- [Improvement] Cleanup branches for CUDA 9 (#5269)
- [Improvement] Replace .data_ptr with .mutable_data_ptr or .const_data_ptr (#5267) (#5276)
- [Improvement] Add tidy fixes (#5268) (#5284)
- [Fix] Remove the use of tests_to_skip.txt from the torchrec tests workflow (#5290)
- [Improvement] Add deprecation message for FBGEMM GenAI (#5292)
- [Improvement] Lint fixes (#5298)
- [Improvement] Remove unused code (#5294) (#5299)
- [Improvement] Add Python 3.14 support for FBGEMM (#5300)
- [Improvement] Re-enable CUDA 13 in Nova builds (#5301)
- [Improvement] Port fbgemm CPU warnings to GPU targets and fix warnings (#5303)
- [Improvement] Upgrade torchrec CI to python 3.14 (#5310)
- [Improvement] Add CPU support for cumem_utils to remove GPU dep on MTIA (#5315)
- [Improvement] Concurrent inference test (#5316)
- [Fix] Back out "Add CPU support for cumem_utils to remove GPU dep on MTIA" (#5317)
- [Improvement] LInt fixes (#5318)
- [Improvement] Only bother with -fno-trapping-math and -ftree-vectorize when targetting GCC (#5319)
- [Improvement] Add CPU support for cumem_utils to remove GPU dep on MTIA (#5315) (#5320)
- [Improvement] Upgrade torchrec CI to python 3.14, pt2 (#5322)
- [Improvement] Replace mutable_data_ptr with const_data_ptr for readonly tensors (#5323) (#5328)
- [Improvement] Upgrade setuptools_git_versioning (#5331)
- [Improvement] Fix lint errors (#5332)
- [Fix] Fix fp32 and fp16 tests of Apple Silicon (#5340)
- [Fix] Fix release version extraction (#5341)
- [Improvement] Remove GCC 8 workarounds (#5347)
- [Improvement] Better asserts (#5350)
- [Fix] Remove invalid C++ flags (#5351)
- [Improvement] Remove old code for clang <16 (#5356)
- [Improvement] Improvements to better assert (#5361)
- [Improvement] Explicitly set the minimal C++ versions (#5363)
- [Improvement] Better assert (fbgemm CPU) (#5367)
- [Improvement] Replace gcc builtin with std::atomic_ref (#5368)
- [Improvement] Update AMD hardware detection script (#5373)
- [Fix] Fix workflow to extract pytorch channel first before passing to build_fbgemm_gpu_package (#5379)
- [Improvement] Migrate FBGEMM assert to FBGEMM_CHECK, pt 1 (#5382)
- [Improvement] Apply C++17/20 modernization (#5384) (#5385)
- [Improvement] Update OSS build script to include test packages setup (#5387)
- [Fix] fix Wno-unused-command-line-argument (#5390) (#5394)
- [Improvement] Scripts for building PyTorch (#5400)
- [Improvement] Upgrade glibc to 2.28 (#5403)
- [Fix] Fix kernel launcher test to work with CUDA_LAUNCH_BLOCKING=1 (#5408)
- [Improvement] Update PyTorch OSS build scripts (#5412)
- [New] Add dual s3/r2 upload (#5416)
- [Improvement] Disable building asmjit and fbgemm when building genai module (#5420)
- [Improvement] Trigger pytorch builds on exact SHA (#5426)
- [Fix] Fix ROCm Nova builds (#5427)
- [Fix] Fix build warnings (#5428)
- [Improvement] Remove 12.0a arch from PyPI builds to reduce binary size (#5431)
- [Fix] Pin MarkupSafe>=3.0.0 for Python 3.14 compatibility (#5433)
- [Fix] Fix ROCm OSS build by setting GCC toolchain for C++20 support (#5443) (#5444)
- [Improvement] Update json dependency to release/3.12.0 (#5473)