Release Note

Highlights

Improvement and bug fixes for TBE variable batch size
Many TBE extensions and benchmarks
Enhanced TBE pipeline prefetching for UVM caching
Code refactoring and reorganization for faster builds
Many improvements and new sparse ops added
Improved low precision ops
Support for Python 3.12
PyTorch 2 support for various operators

Software Requirements

FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:

PyTorch: v2.2
CUDA: v11.8, 12.1
Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.6.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.6.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

[Improvement] Extended support and bug fixes for variable batch size (#2012, #2043, #2107, #2150, #2188)
[Improvement] caching and cache lookup for pipeline prefetching (#2147, #2154, #2151)
[New] Support MTIA device type in FBGEMM TBE training (#1994)
[New] Enable sequence TBE CPU via AVX (#2195)
[New] Enable subwarp only for unweighted (#2051)
[New] Add meta functions (#2094, #2102)
[New] Add reverse qparam option for MTIA (#2109)
[New] uvm_cache_stats for direct mapped (#1951, #1952)
[Improvement] use memcpy for cpu emb inplace update (#2166)
[Improvement] Remove indices and offsets copying from prefetch (#2186)
[Improvement] Improve perf for L=0 cases for TBE v2 (#2046)
[Improvement] General fixes and enhancements (#2030, #2009)

Jagged Tensor Operators

[Improvement] Fix incorrect SymInt signature on dense_to_jagged (#2039)
[Improvement] Fix non-contiguous tensor problem in jagged_index_select (#2060, #2061)

Index Select Operators

[Improvement] Get total D from CPU buffer in batch_index_select_dim0 (#2079)

Low-precision operators

[New] Add BF16 in padded FP8 quantize ops (#2010)
[Improvement] Improve quantize_comm error message (#2018)
[Improvement] Fix illegal memory access error and initialize empty values on fp8 quantize kernel (#2131, #2176)

Pooled Embedding

[New] Add permute_duplicate_pooled_embeddings op for CPU (#1939)
[Improvement] Use PyTorch's p2p access enable function (#2000)
[New] Add support for duplicate in permutations for permute_pooled_embs_split (#1940)
[Improvement] Improve all_to_one error message (#2019)
[New] Add meta function for fbgemm::merge_pooled_embeddings operator (#2069)
[New] Add variable batch per feature support to EBC (tw/cw only) (#1986)

Misc

[New] Add meta backend for new_managed_tensor and sparse ops (#1990, #2028, #2029, #2072)
[New] Use 4k page instead of 2M for managed tensor (#2058)
[New] Add BF16 support for reorder_batched_ad_indices (#2116)
[New] SymInts for sparse ops (#2017, #2089)
[New] Support for CPU/GPU compilation (#2040)
[New] Add impl_abstract (#2084, #2087, #2090, #2097, #2098, #2129, #2132, )
[Improvement] Make FBGEMM PT2 compliant (#2174, #2172, #2170, #2180, #2181, #2201, #2198)
[Improvement] Fix invalid CUDA configuration error for the empty input (#1993)

Benchmarks / Tests

[New] Benchmark block_bucketize_sparse_features uneven sharding (#2140, #2169)
[New] Add unit test for unique cache lookup (#2160)
[New] Add autogenerated opcheck tests (#2050, #2069, #2073, #2092, #2118, #2139, #2152, #2173, #2193)
[New] Add test for fbgemm ops. (#2136, #2082)
[Improvement] Modified TBE testbench to use FBGEMM generate_rquests function to generate indices and offsets (#1882)
[Improvement] Remove FP64 from TBE CPU tests (#2049)
[Improvement] Add warmup_runs to TBE benchmarks and run at least 1 warmup iter #2163
[Improvement] Add --pooling in TBE nbit_cpu benchmark (#2200)
[Improvement] Fill embedding tables with randomized scales and bias in split-TBE benchmarks (#2031)

Build / CI improvements and Fixes

[Improvement] General CI and build system enhancement
(#2065, #2071, #2078, #2149, #2189, #2203, #2204, #2209, #2047)
[Improvement] Reorganized code to enable faster builds (#1881, #2083, #2085, #2095, #2141, #2112, #2133, #2145, #2196, #2100, #2103)
[New] Add support for Python 3.12 (#2194)
[New] Updates for ROCm 5.6, 5.7 and 6.0 support and Hip.cmake changes (#2066, #2088, #2106)
[New] Add debug flags for HIP runs (#2206)
[Improvement] unknown c++ flag detection in CMake (#2057)
[Improvement] Fix inconsistent dll linkage warning (#2059, #2064)
[Improvement] Fix heap-buffer-overflow in radix_sort_parallel (#2075)
[Improvement] Update AVX2 and AVX512 flags (#2167)

pytorch/FBGEMM v0.6.0 FBGEMM_GPU v0.6.0 on GitHub