Release Note
Highlights
Table Batched Embedding
For GPU
- New Table Batched Embedding (TBE) operators and momentum type support
- New Intraining Embedding Pruning (ITEP) operators
- VBE support for Dense TBE
- Global weight decay support in TBE
- New type support and improvement to SSD TBE
- Improvement and bug fixes for TBE training and inference modules and sparse operators
For MTIA
- MTIA support for DenseTBE
Generative AI
- Gen AI Ops integration
- Support for Triton-based and CUTLASS-based operators (#2552, #2537)
- New FP8 GEMM and quantization operators
- New query attention operators
- New Car and All-To-All (NCCL-based) communication operators
- AMD Support for FP8
Others
- New MX4 quantization operators
- Support for CUDA 12.4
Better engineering
- Code refactoring and reorganization for faster builds
- New tests and benchmarks
- Improved AMD support
Software Requirements
FBGEMM_GPU v0.8.0 has been tested and known to work on the following setups:
- PyTorch: v2.4
- CUDA: v11.8, 12.1, 12.4
- Python: v3.8, 3.9, 3.10, 3.11, 3.12
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only the CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.8.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.8.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table batched embedding (TBE) operators
For GPU
- [New] VBE support for Dense TBE (#2628, #2620, #2641)
- [New] BF16 momentum support in PARTIAL_ROWWISE_ADAM (#2524, #2522, #2518)
- [New] Global weight decay support (#2516, #2507, #2506)
- [New] Multi-pass prefetch for memory efficiency (#2566)
- [Improvement] Work around masked_select for numel > MAX_INT (#2648)
- [Improvement] Fused optim in backward capability with aot_autograd (#2651)
- [Improvement] Weights mutations declaration in TBE backward ops schemas (#2698)
- [Improvement] Helper ops to support cache conflict misses (#2571)
- [Improvement] Fixed the hang issue in some TBE GPU optimizers (#2509)
- [Improvement] Misc TBE fixes and refactoring (#2583, #2597, #2529)
- [Improvement] Cache prefetch and conflict miss improvements (#2596, #2514)
For MTIA
- [New] Support MTIA in DenseTableBatchedEmbeddingBagsCodegen (#2680)
SSD Table batched embedding (TBE) operators
- [New] Add FP16 weight and output support to SSD TBE (#2638)
- [New] Implementation of PS KV DB for FBGEMM TBE operator (#2664, #2642)
- [Improvement] Removal of D->H sync when calling
lxu_cache_lookup
(#2672) - [Improvement] Recording of functions in SSD TBE (#2670)
- [Improvement] Added options, assertions and logs for training and inference SSD TBE (#2689, #2657)
- [Improvement] SSD TBE backend fixes (#2645, #2671)
New Operator Groups
- [New] Intraining Embedding Pruning (ITEP) ops (#2700, #2690, #2682)
- [New] Populate bucketize permute kernel (#2533)
- [New] MX4 quantization support (#2709, #2703, #2696, #2675, #2659)
GenAI FP8 Operators
- [New] FP8 enablement (#2615, #2637)
- [New] CK FP8 GEMM kernels (#2630)
- [New] FP8 Rowwise GEMM (#2585, #2622)
- [New] FP8 quantization and conversions to FP32/FP16 (#2686, #2681, #2593, #2540, #2677)
- [New] FP8 blockwise GEMM (#2676, #2600)
- [New] Triton-based FP8 GEMM and quantization support (#2701, #2688, #2643)
- [New] AMD support for FP8 (#2582, #2658, #2611)
GenAI Support and Operators
- [New] Integrated Gen AI ops into the build (#2512)
- [New] Support for Triton-based operators (#2570, #2618)
- [New] Support for CUTLASS-based operators (#2552, #2537)
- [New] Car and All-To-All (NCCL-based) communication ops (#2606, #2667, #2631, #2624)
- [New] Grouped query attention ops (#2673, #2504)
- [New] CK BF16 GEMM (#2617)
- [New] W4A8 GEMM kernels (#2558, #2607)
Pooled Embeddings
- [Improvement] Clean up unused pooled embedding ops (#2626)
- [Improvement] PyTorch compatibility fixes (#2619, #2629)
Sparse Operators
- [Improvement] Increased dynamic shared memory size to support larger bucket sizes (#2500)
- [Improvement] UINT8 support for reorder sequence embedding operator (#2531)
- [Improvement] Fixed CPU blocking D2H in JaggedIndexSelect2dOp backward (#2510)
Benchmarks / Tests
- [New] Unified benchmarks and unit tests for FP8 (#2609, #2699, #2666)
- [Improvement] SSD TBE benchmarks (#2579, #2580)
- [Improvement] SSD TBE tests (#2665, #2647)
- [Improvement] Fixes for TBE tests and benchmarks (#2632)
- [Improvement] nbit_cache benchmark bandwidth calculation (#2511)
Build / CI improvements and Fixes
- [New] Support for CUDA 12.4 (#2565)
- [Improvement] Improved AMD support (#2541, #2679)
- [Improvement] Strengthened artifact installation process (#2491)
- [Improvement] Memcheck added across operators (#2576, #2574, #2572, #2612, #2594, #2589, #2578)
- [Improvement] Refactoring of large header files (#2650)
- [Improvement] Improved build scripts to support debug flags and custom (i.e. GenAI) variants (#2702)