Release Note

Highlights

New optimizer and output type supports for Table Batched Embedding (TBE) training
Improvement and bug fixes for TBE variable batch size
Enhanced TBE pipeline prefetching for UVM caching
Many improvements on TBE CPU kernels
New and enhanced low-precision operators
Code refactoring and reorganization for faster builds
New tests and benchmarks
PyTorch 2 support for various operators
Clang compilation support

Software Requirements

FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:

PyTorch: v2.3
CUDA: v11.8, 12.1
Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.7.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.7.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

[New] Added BF16 output support in TBE training (#2382)
[New] Added Support int8 output for sequence embeddings (#2316)
[New] Added an auto-vectorization implementation for CPU TBE-NBit kernel with user selection (#2182, #2299)
[New] Added CowClip optimizer (#2226, #2243)
[Improvement] Extended support and bug fixes for variable batch size TBE (#2256, #2388, #2394, #2333)
[Improvement] Optimized cache fetch for forward split (#2216, #2282, #2289, #2262, #2218)
[Improvement] Caching and cache lookup for pipeline prefetching fixes and enhancements (#2164, #2309, #2287, #2308)
[Improvement] Built hip rules by default (#2380)
[New] Added a method to TBE module to recompute buffers (#2338)
[New] Added meta functions for PyTorch 2 support (#2347)
[New] Added support for MTIA in TBE modules (#2273, #2286)
[Improvement] Improved TBE logging and stats report (#2379, #2378, #2377, #2386, #2337)
[Improvement] General fixes and enhancements (#2235, #2398, #2212, #2269, #1782, #2270, #2265, #2385, #2370, #2349, #2312, #2411, #2400)
[Deprecation] Optimizers deprecated (#2253, #2252)
[Deprecation] Removed double type support from fbgemm_cuda_utils.cuh (#2335)
[Deprecation] Removed INT8 weight/output support from TBE GPU training

Jagged Tensor Operators

[Improvement] Removed device-host synchronization from keyed jagged index select (#2315)
[Improvement] Fixed half->int build error (#2240)

Index Select Operators

[Improvement] Fixed BF16 group_index_select_2d on AMD GPU (#2321)

Low-precision operators

[New] CPU implementation of per-channel quantize operator (#2341)
[New] CPU implementation for qlinear_channelwise operator (#2343)
[New] Enabled CPU int8 output to dequantization to bf16 on CUDA (#2242)
[New] Enabled dequantization for bf16 (#2241)

Pooled Embedding

[Improvement] Used gpu_library_selector for permute_pooled_embedding_ops_gpu (#2340)

Misc

[New] Implementation of CPU version of all_to_one_device (#2251)
[Improvement] Performance improvement of _block_bucketize_sparse_features_cuda_kernel1 (#2331)
[New] Created cumem_utils_cpu and added to all_deps_cpu (#2215)
[New] Added float support to asynchronous_complete_cumsum_cpu (#2383)
[Improvement] Added early exit to sparse ops (#2277, #2276, #2213, #2259)
[New] STBE GPU coalescing kernel (#2275)
[Improvement] Removed symint from tbe_input_combine_with_length_abstract (#2336)
[New] GPU timing and basic reporting framework (#2314)
[Improvement] Fixes and FBGEMM PT2 compliance (#2223, #2224, #2225, #2231, #2327)

Benchmarks / Tests

[New] Added dynamic quantize GEMM benchmark (#2297, #2295, #2271)
[New] Added a new CPU nbit-TBE benchmark that tries to reduce CPU frequency noise (#2306)
[New] Added unit test for stochastic rounding for UVM caching (#2324)
[New] Added unit test AsyncSeriesTimer (#2364)
[New] Added int32 overflow unit test for TBE UVM caching (#2303)
[Improvement] Disabled dynamo testing in TBE (#2381)
[Improvement] Refactored and re-organized tests (#2305, #2292, #2291, #2284, #2281, #2274, #2272, #2266, #2263, #2260, #2407, #2406, #2402, #2304, #2399, #2393)
[Improvement] General fixes for tests and benchmarks (#2301, #2300, #2298, #2255, #2205, #2296)

Build / CI improvements and Fixes

[Improvement] Optimized EmbeddingSpMDMNBit_autovec (#2267)
[Improvement] Switched between hip and cuda c++ lib so load (#2236)
[Improvement] Fixred bf16 support issues (#2238)
[New] Enabled Clang compilation in OSS for fbgemm_gpu (CPU and CUDA) (#2334, #2345, #2330, #2323)
[New] Upgraded ROCm version (#2405)
[Improvement] Enabled -Winfinite-recursion in deeplearning/PACKAGE (#2329)
[Improvement] Fixed shadowed variable in deeplearning/fbgemm/src/GroupwiseConv.cc (#2268)
[Improvement] General CI and build system enhancement (#2489, #2430, #2427, #2423, #2356, #2348, #2342, #2328, #2307, #2211, #2219, #2220, #2228, #2233)
[Improvement] Documentation enhancement (#2294, #2278, #2258, #2249, #2227, #2232, #2244, #2239, #2237)

pytorch/FBGEMM v0.7.0 FBGEMM_GPU v0.7.0 on GitHub