Release Note
Highlights
- Improvement and bug fixes for TBE variable batch size
- Many TBE extensions and benchmarks
- Enhanced TBE pipeline prefetching for UVM caching
- Code refactoring and reorganization for faster builds
- Many improvements and new sparse ops added
- Improved low precision ops
- Support for Python 3.12
- PyTorch 2 support for various operators
Software Requirements
FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:
- PyTorch: v2.2
- CUDA: v11.8, 12.1
- Python: v3.8, 3.9, 3.10, 3.11, 3.12
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.6.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.6.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table batched embedding (TBE) operators
- [Improvement] Extended support and bug fixes for variable batch size (#2012, #2043, #2107, #2150, #2188)
- [Improvement] caching and cache lookup for pipeline prefetching (#2147, #2154, #2151)
- [New] Support MTIA device type in FBGEMM TBE training (#1994)
- [New] Enable sequence TBE CPU via AVX (#2195)
- [New] Enable subwarp only for unweighted (#2051)
- [New] Add meta functions (#2094, #2102)
- [New] Add reverse qparam option for MTIA (#2109)
- [New] uvm_cache_stats for direct mapped (#1951, #1952)
- [Improvement] use memcpy for cpu emb inplace update (#2166)
- [Improvement] Remove indices and offsets copying from prefetch (#2186)
- [Improvement] Improve perf for L=0 cases for TBE v2 (#2046)
- [Improvement] General fixes and enhancements (#2030, #2009)
Jagged Tensor Operators
- [Improvement] Fix incorrect SymInt signature on dense_to_jagged (#2039)
- [Improvement] Fix non-contiguous tensor problem in jagged_index_select (#2060, #2061)
Index Select Operators
- [Improvement] Get total D from CPU buffer in batch_index_select_dim0 (#2079)
Low-precision operators
- [New] Add BF16 in padded FP8 quantize ops (#2010)
- [Improvement] Improve quantize_comm error message (#2018)
- [Improvement] Fix illegal memory access error and initialize empty values on fp8 quantize kernel (#2131, #2176)
Pooled Embedding
- [New] Add permute_duplicate_pooled_embeddings op for CPU (#1939)
- [Improvement] Use PyTorch's p2p access enable function (#2000)
- [New] Add support for duplicate in permutations for permute_pooled_embs_split (#1940)
- [Improvement] Improve all_to_one error message (#2019)
- [New] Add meta function for fbgemm::merge_pooled_embeddings operator (#2069)
- [New] Add variable batch per feature support to EBC (tw/cw only) (#1986)
Misc
- [New] Add meta backend for new_managed_tensor and sparse ops (#1990, #2028, #2029, #2072)
- [New] Use 4k page instead of 2M for managed tensor (#2058)
- [New] Add BF16 support for reorder_batched_ad_indices (#2116)
- [New] SymInts for sparse ops (#2017, #2089)
- [New] Support for CPU/GPU compilation (#2040)
- [New] Add impl_abstract (#2084, #2087, #2090, #2097, #2098, #2129, #2132, )
- [Improvement] Make FBGEMM PT2 compliant (#2174, #2172, #2170, #2180, #2181, #2201, #2198)
- [Improvement] Fix invalid CUDA configuration error for the empty input (#1993)
Benchmarks / Tests
- [New] Benchmark block_bucketize_sparse_features uneven sharding (#2140, #2169)
- [New] Add unit test for unique cache lookup (#2160)
- [New] Add autogenerated opcheck tests (#2050, #2069, #2073, #2092, #2118, #2139, #2152, #2173, #2193)
- [New] Add test for fbgemm ops. (#2136, #2082)
- [Improvement] Modified TBE testbench to use FBGEMM generate_rquests function to generate indices and offsets (#1882)
- [Improvement] Remove FP64 from TBE CPU tests (#2049)
- [Improvement] Add warmup_runs to TBE benchmarks and run at least 1 warmup iter #2163
- [Improvement] Add --pooling in TBE nbit_cpu benchmark (#2200)
- [Improvement] Fill embedding tables with randomized scales and bias in split-TBE benchmarks (#2031)
Build / CI improvements and Fixes
- [Improvement] General CI and build system enhancement
(#2065, #2071, #2078, #2149, #2189, #2203, #2204, #2209, #2047) - [Improvement] Reorganized code to enable faster builds (#1881, #2083, #2085, #2095, #2141, #2112, #2133, #2145, #2196, #2100, #2103)
- [New] Add support for Python 3.12 (#2194)
- [New] Updates for ROCm 5.6, 5.7 and 6.0 support and Hip.cmake changes (#2066, #2088, #2106)
- [New] Add debug flags for HIP runs (#2206)
- [Improvement] unknown c++ flag detection in CMake (#2057)
- [Improvement] Fix inconsistent dll linkage warning (#2059, #2064)
- [Improvement] Fix heap-buffer-overflow in radix_sort_parallel (#2075)
- [Improvement] Update AVX2 and AVX512 flags (#2167)