Release Note
Highlights
- New optimizer and output type supports for Table Batched Embedding (TBE) training
- Improvement and bug fixes for TBE variable batch size
- Enhanced TBE pipeline prefetching for UVM caching
- Many improvements on TBE CPU kernels
- New and enhanced low-precision operators
- Code refactoring and reorganization for faster builds
- New tests and benchmarks
- PyTorch 2 support for various operators
- Clang compilation support
Software Requirements
FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:
- PyTorch: v2.3
- CUDA: v11.8, 12.1
- Python: v3.8, 3.9, 3.10, 3.11, 3.12
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.7.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.7.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table batched embedding (TBE) operators
- [New] Added BF16 output support in TBE training (#2382)
- [New] Added Support int8 output for sequence embeddings (#2316)
- [New] Added an auto-vectorization implementation for CPU TBE-NBit kernel with user selection (#2182, #2299)
- [New] Added CowClip optimizer (#2226, #2243)
- [Improvement] Extended support and bug fixes for variable batch size TBE (#2256, #2388, #2394, #2333)
- [Improvement] Optimized cache fetch for forward split (#2216, #2282, #2289, #2262, #2218)
- [Improvement] Caching and cache lookup for pipeline prefetching fixes and enhancements (#2164, #2309, #2287, #2308)
- [Improvement] Built hip rules by default (#2380)
- [New] Added a method to TBE module to recompute buffers (#2338)
- [New] Added meta functions for PyTorch 2 support (#2347)
- [New] Added support for MTIA in TBE modules (#2273, #2286)
- [Improvement] Improved TBE logging and stats report (#2379, #2378, #2377, #2386, #2337)
- [Improvement] General fixes and enhancements (#2235, #2398, #2212, #2269, #1782, #2270, #2265, #2385, #2370, #2349, #2312, #2411, #2400)
- [Deprecation] Optimizers deprecated (#2253, #2252)
- [Deprecation] Removed double type support from fbgemm_cuda_utils.cuh (#2335)
- [Deprecation] Removed INT8 weight/output support from TBE GPU training
Jagged Tensor Operators
- [Improvement] Removed device-host synchronization from keyed jagged index select (#2315)
- [Improvement] Fixed half->int build error (#2240)
Index Select Operators
- [Improvement] Fixed BF16 group_index_select_2d on AMD GPU (#2321)
Low-precision operators
- [New] CPU implementation of per-channel quantize operator (#2341)
- [New] CPU implementation for qlinear_channelwise operator (#2343)
- [New] Enabled CPU int8 output to dequantization to bf16 on CUDA (#2242)
- [New] Enabled dequantization for bf16 (#2241)
Pooled Embedding
- [Improvement] Used gpu_library_selector for permute_pooled_embedding_ops_gpu (#2340)
Misc
- [New] Implementation of CPU version of all_to_one_device (#2251)
- [Improvement] Performance improvement of _block_bucketize_sparse_features_cuda_kernel1 (#2331)
- [New] Created cumem_utils_cpu and added to all_deps_cpu (#2215)
- [New] Added float support to
asynchronous_complete_cumsum_cpu
(#2383) - [Improvement] Added early exit to sparse ops (#2277, #2276, #2213, #2259)
- [New] STBE GPU coalescing kernel (#2275)
- [Improvement] Removed symint from tbe_input_combine_with_length_abstract (#2336)
- [New] GPU timing and basic reporting framework (#2314)
- [Improvement] Fixes and FBGEMM PT2 compliance (#2223, #2224, #2225, #2231, #2327)
Benchmarks / Tests
- [New] Added dynamic quantize GEMM benchmark (#2297, #2295, #2271)
- [New] Added a new CPU nbit-TBE benchmark that tries to reduce CPU frequency noise (#2306)
- [New] Added unit test for stochastic rounding for UVM caching (#2324)
- [New] Added unit test AsyncSeriesTimer (#2364)
- [New] Added int32 overflow unit test for TBE UVM caching (#2303)
- [Improvement] Disabled dynamo testing in TBE (#2381)
- [Improvement] Refactored and re-organized tests (#2305, #2292, #2291, #2284, #2281, #2274, #2272, #2266, #2263, #2260, #2407, #2406, #2402, #2304, #2399, #2393)
- [Improvement] General fixes for tests and benchmarks (#2301, #2300, #2298, #2255, #2205, #2296)
Build / CI improvements and Fixes
- [Improvement] Optimized EmbeddingSpMDMNBit_autovec (#2267)
- [Improvement] Switched between hip and cuda c++ lib so load (#2236)
- [Improvement] Fixred bf16 support issues (#2238)
- [New] Enabled Clang compilation in OSS for fbgemm_gpu (CPU and CUDA) (#2334, #2345, #2330, #2323)
- [New] Upgraded ROCm version (#2405)
- [Improvement] Enabled
-Winfinite-recursion
in deeplearning/PACKAGE (#2329) - [Improvement] Fixed shadowed variable in deeplearning/fbgemm/src/GroupwiseConv.cc (#2268)
- [Improvement] General CI and build system enhancement (#2489, #2430, #2427, #2423, #2356, #2348, #2342, #2328, #2307, #2211, #2219, #2220, #2228, #2233)
- [Improvement] Documentation enhancement (#2294, #2278, #2258, #2249, #2227, #2232, #2244, #2239, #2237)