Stable API

We provide the stable API support starting from FBGEMM_GPU v1.0.0. This includes Table batched embedding (TBE) modules, Pooled embedding operators and modules, Sparse operators, Jagged tensor operators and Quantization operators.

API backward compatibility guarantees via thorough testing. We guarantee that our stable APIs will be backward compatible within a major version, meaning that the stable APIs for v1.0.0 will be compatible with every future release unless explicitly announced in advance
*Enhanced documentation, ensuring that every stable API has comprehensive and up-to-date documentation.
Functionality guarantees are only provided through unit testing framework. We do NOT guarantee any functionalities that are NOT explicitly tested and documented in our unit tests.
No performance guarantees. However, we are committed to providing support on a best-effort basis.

More details can be found in stable API documentation

Highlights

Table Batched Embedding (TBE)

New optimizer support for TBE Training
Enhanced Global weight decay support in TBE
Improvement and bug fixes for TBE training and inference modules and sparse operators

For SSD

New pipeline prefetching enabled
New cache and indices related ops
Integration of L3 cache to TBE operators
Many improvements to kernel and logging

For CPU

New type support for CPU Sequence TBE
Kernel improvements and bug fixes

Generative AI

Gen AI Ops support and improvement
Improvements to Triton-based and CUTLASS-based operators
New and optimized FP8 GEMM and quantization operators

Others

Optimized MX4 quantization operators
New dequantization operator
Removal of python 3.8 Support

Better engineering

Code refactoring and reorganization for faster builds
New and improved tests and benchmarks
Improved AMD support

Software Requirements

FBGEMM_GPU v1.0.0 has been tested and known to work on the following setups:

PyTorch: v2.5
CUDA: v11.8, 12.1, 12.4
Python: v3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.0.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.0.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

For GPU

[New] Ensemble adagrad optimizer (#3197, #2955, #2954, #3161, #3091, #2981, #2889, #3180, #3158)
[New] Bounds check in prefetch in TBE training (#3015)
[New] Method to update internal hyperparameters for FBGEMM TBE (#3025)
[Improvement] Enhanced Global Weight Decay and state tracking (#2904, #2897, #2882, #2896, #2890, #2884, #2883 )
[Improvement] masked_index_* values index type fix (#2979)
[Improvement] generate_vbe_metadata fixes (#3095, #3087)
[Improvement] Fixes on the efficiency of VBE TBE forward due to blocking D2H copy (#2862)
[Improvement] Work around on offsets and indices type mismatch int TBE training (#3037)
[Improvement] Add a host map option for a UVM tensor alloc (#3073)
[Improvement] uvm_to_device expose device as interface (#3030)
[Improvement] Add Meta backend/dispatcher for new_unified_tensor (#3005)
[Improvement] General TBE enhancements and bug fixes (#2892, #3114, #3022, #2958)
[Improvement] Consolidate repeat code in TBE inference (#3028)

For CPU

[New] Add int4 to int4 CPU Sequence TBE kernel (#2996, #2994)
[New] Use auto-vec kernel in CPU sequential embedding lookup for int8 tables (#2863, #2878)
[Improvement] Work around OMP barrier issue with MSVCand unused var error (#2918, #3084)

SSD Table batched embedding (TBE) operators

[New] Enable pipeline prefetching (#2963)
[New] Enable cache line locking support in SSD kernel (#2949)
[New] Add L2 flush (#3110)
[New] Added SSD ODS and IO/mem stats (#2906, #2913, #3035)
[New] Add SSDScratchPadIndicesQueue (#2911, #2948)
[New] Integrate l2 cache to TBE operator (#2959, #3032, #3031 )
[New] Add ssd_update_row_addrs (#2953)
[New] Add bounds check in SSD-TBE (#3013)
[New] Add 32-bit index support in SSD kernels (#3064)
[New] Add kv cache related ops (#3001, #2968)
[New] Add compact_indices op (#3075 )
[New] Create embedding cache interface and impl RocksDB cache (#2858)
[New] Reduce prefetch SM usage when using pipeline prefetching (#2991)
[New] Add a host map option for a UVM tensor alloc (#3003)
[New] Add masked_index_select and refactor masked_index_put (#2910)
[Improvement] Add parallelism on cache update (#3062)
[Improvement] add parameter server attributes (#2947)
[Improvement] Make the scratch pad tensor UVA (#2844)
[Improvement] Use less thread blocks for find_uncached kernel (#3101)
[Improvement] Fix stream sync for scratch pad eviction (#2843)
[Improvement] Make indices related to cache eviction UVA tensors (#3077
[Improvement] Split cachelib cache into header and src (#3063)
[Improvement] Record more functions and logging in SSD TBE (#2854, #2867, #2975)
[Improvement] Attach eviction filling logic to set_cache (#3034)
[Improvement] Move set_cache and set_async to background thread (#3033)
[Improvement] Refactoring vec copy in masked_index_put_kernel (#2861, #2908)
[Improvement] Increase memcpy and compute overlap (#2860)
[Improvement] Add set_async in background thread (#3036 )
[Improvement] Make evicted_rows a UVA buffer (#3079 )
[Improvement] General enhancement and bug fixes (#2937, #2993, #3151, #3089, #2898, #2930)

GenAI Support and Operators

[New] Decode and Prefill support (#3009 )
[New] Support rope with block tables (#3146)
[New] EP support (#3071)
[New] Implement SDPA kernel wrapper to use run_kernel flow for perf (#2820)
[Improvement] Move mqa code (#3011)
[Improvement] BE improvements to init_comms #3103

Triton GEMM support

[New] Enable torch.compile compatibility for triton fp8 rowwise gemm (#2978)
[New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
[New] GEMM custom op enablement (#3046)
[New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
[Improvement] Add fused bias to Triton FP8 Rowwise Kernels (#2852)
[Improvement] Triton dependency ( #3027)
[Improvement] Fix triton fp8 handling of non-contiguous inputs (#2919)
[Improvement] More autotune configs and bug fixes in TMA kernel (#3078, #3066, #3072)
[Improvement] Fp8 gemm tweak for 405B Decoding (#3104 )

FP8 and other Quantization support

[New] CK FP8 Optimizations and fixes (#2940, #2912, #2987, #3017, (#2893 )
[New] FP8 kernel development and enablement (#2866)
[New] GenAI CK Version update and integration (#2865, #2971)
[Improvement] Also hipify the fp8 related cuda functions (#2834 )
[Improvement] Auto-generation of CUTLASS Extension Kernel Templates (#2932)
[Improvement] Marlin Mixed Input Kernel Productionization (#3008)
[Improvement] Remove redundant torch.abs (#3020, #2822 )
[Improvement] Tuning for 405B/70B Prefill with small seqlen (#3042)
[Improvement] Added new instances for 405B decoding (#2936 )

Permute and Pooled Embeddings Ops

[New] Implementation of permute_multi_embedding (#2833)
[Improvement] Clean up and removal of unused exception (#2832, #2891)
[Improvement] Use at::parallel_for in cpu kernel (#2817)
[Improvement] Add dispatch_to_cpu for the operators (#2874, #2881)
[Improvement] Print the exact variable values triggering the alert in Merge Pooled Embedding (#3038)

Sparse Operators

[New] Support original indices for FBGEMM block bucketization flag (#2999, #2925)
[Improvement] Fix pack_segments backward when grad is non-contig (#3006)
[Improvement] Fix FBGEMM_GPU_MEMCHECK in sparse_ops_cuda (#2943 )
[Improvement] Update sparse_ops.py to use generic gpu target fbgemm_gpu:input_combine to support both nvidia and AMD(#2905)
[Improvement] Add abstract impl and functions (#2962, #2983, #3000 )
[Improvement] Use guard_size_oblivious in tbe_input_combine_abstract fake kernel (#2923)
[Improvement] Out variant for asynchronous_exclusive_cumsum_cpu + some more static dispatch kernels (#3090)

Quantize ops

[New] Add a CPU nbit to float dequantization op that supports torch.quintMxN type (#2995)

MX4 Ops

[New] Optimize FBGEMM Triton MX4 Quantize-Dequantize (#2838, #2837)
[New] Rounding Mode Support (#2821, #2816, #2933, #2859 )
[New] FBGEMM/TorchRec MX4 padding support (#3055, #3047, #3010 )
[New] Add Stochastic downcasting to MX4 Quantization (#2899)
[New] Support for other MX4 formats in Triton kernels (#2900)
[Improvement] Refactor MX4 Kernel to operate on flat tensors (#2836)
[Improvement] Optimize MX4 padding to minimize need for tuning (#3040)

Benchmarks / Tests

[New] Add schema compatibility test (#3130)
[New] Add SSD/UVM caching in TBE device benchmark (#3076)
[New] Add EmbeddingSpMDM8BitBenchmarkOutTypeFloat16 (#2952 )
[New] Add benchmark EmbeddingSpMDMNBitBenchmarkOutTypeFloat16 (#2901 )
[New] Add unit test for int4 to int4 sequence CPU TBE (#2997)
[New] Add rocm support for fp8 benchmarks (#2965)
[New] Add rotating buffer feature to quantize_bench #2857)
[New] Benchmark of fbgemm op - permute_multi_embedding (#2828 )
[New] Add test for supporting torch.float16 and torch.bfloat16 (2992 )
[Improvement] Fix logging and remove sync points in benchmarks (#3149, #3113, 2855)
[Improvement] Update TBE training benchmark (#3112, #3074, #3051
[Improvement] Improve ssd-training benchmark (#2850, #3004, #3069, #2989)
[Improvement] Fix segfault in ssd training unit tests (#2929)
[Improvement] Fixes on genai tests (#2864, #2885, #2970, #2849, 2869 )
[Improvement] Fix minor issues in EmbeddingSpMDMNBitBenchmark (#2894)
[Improvement] Fix test skipping for UVM tests (#3016)
[Improvement] Fix failures_dict_fast.json in TBE inference test (#3024, #3060)

Build / CI improvements and Better Engineering

[Improvement] General OSS fixes and script improvements (#2967, #2888, #2830,#2829, #2831, #2873, #2868, #2986, #2982, #3023, #2980, #2974, #2902, #3108, #3107, #3081, #3058, #3039, #3082, #3102, #3134, #2856, #3080)
[Improvement] General enhancements (#2922, #2972, #2914, #2926, #3088, #3119, #3118, #3012, #3141, #3085)
[Improvement] ROCm fixes (#2876, #2875, #2870, #2907, #3059)
[Improvement] Refactoring (#2853, #2851, #2848, #2839, #2842, #2841, #2916, #2846, #3105, #3094, #3092, #3067, #3054, #3049, #2976, #2964, #2957, #2944, #3045, #3021, #3019, #3014, #3007, #2990)
[New] Documentation (#3185, #3184, #3194, #3191, #3179, #3178, #3177, #3176, #3172, #3171, #3169, #3145, #3056, #3018, #2988, #2823, #2819, #2909, #2931)

pytorch/FBGEMM v1.0.0 FBGEMM_GPU v1.0.0 Release Notes on GitHub

Stable API

Highlights

Table Batched Embedding (TBE)

For SSD

For CPU

Generative AI

Others

Better engineering

Software Requirements

Availability

Changes

Table batched embedding (TBE) operators

For GPU

For CPU

SSD Table batched embedding (TBE) operators

GenAI Support and Operators

Triton GEMM support

FP8 and other Quantization support

Permute and Pooled Embeddings Ops

Sparse Operators

Quantize ops

MX4 Ops

Benchmarks / Tests

Build / CI improvements and Better Engineering

pytorch/FBGEMM v1.0.0
FBGEMM_GPU v1.0.0 Release Notes

on GitHub