Release Notes

Highlights

TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement)
Many TBE extensions including defused TBE backward-optimizer, variable batch size support, pipeline prefetching support for UVM caching
Many improvements and new sparse ops added
ARM support
SM 9.0 support for CUDA 12.1 for H100 GPUs
PyTorch 2 support for various operators, i.e., jagged tensor, pooled embedding ops

Software Requirements

FBGEMM_GPU v0.5.0 has been tested and known to work on the following setups:

PyTorch: v2.1
CUDA: v11.8, 12.1
Python: v3.8, 3.9, 3.10, 3.11

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.5.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.5.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

[Improvement] TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement) (#1641, #1804, #1787, #1904)
[New] Variable batch size support to TBE training (#1653, #1752, #1633, #1634, #1713, #1717, #1943)
[New] BFloat16 support for TBE CPU (#1839, #1851)
[New] Defused TBE backward-optimizer and SplitTBE optimizer (#1819, #1820, #1821)
[New] Max norm support for rowwise_adagrad (#1781)
[New] Support for 1024-2048 embedding dimension in TBE inference (#1656)
[Improvement] Backends via PyTorch dispatcher (#1948, #1976)
[Improvement] Deprecate many TBE optimizers (#1766, #1767, #1771, #1796, #1774, #1773, #1775, #1791, #1793)
[New] TBE UVM cache pipeline prefetching (#1883, #1893)

Jagged Tensor Operators

[New] New jagged tensor operators (#1690)
[New] Backends (Meta) (#1880, #1960)
[Improvement] Jagged operator optimizations (#1643, #1646, #1644, #1661, #1662, #1691, #1692, #1777)
[Improvement] Symbolic shape tracing on jagged operators for PyTorch 2 (#1758)

Index Select Operators

[New] batch_index_select_dim0 with TBE backend (#1897)
[New] Variable input sizes support for group_index_select_dim0 (#1968)
[Improvement] Improve group_index_select(#1764, #1884)

Low-precision operators

[New] Meta Backend FP8RowwiseQuantizedToFloat (#1890)
[New] Column-wise parallel quantization/dequantization (#1743)
[New] BF16 Support in FP8 quantize ops (#1961)
[Improvement] FP8 row-wise quantization optimization/improvement (#1729, #1858, #1981, #1909)

Pooled Embedding

[New] reduce_to_one (#1571)
[New] permute_duplicate_pooled_embeddings op (#1912)
[New] BF16 support for permute_pooled_embeddings op 1937
[New] Variable size input-output support for permute_pooled_embs_kernel (#1913)
[New] Backends (Meta) (#1853)
[Improvement] multi-gpu all_to_one enhancements (#1674, #1962)

Misc

[New] CUB kernel for 2D asynchronous_complete_cumsum (#1707)
[New] Backends (Meta) (#1709, #1905, #1970, #1971)
[New] BF16 support in permute_indices_weights_kernel_2 (#1852)
[New] FP16 and BF16 support in pack_segments (#1708)
[New] BF16 support for HBC ops. (#1744)
[New] BFloat16 support (#1832, #1865)
[Improvement] Speedup reorder_batched_ad_indices (#1901, #1902, #1932, #1933, 1711)

Benchmarks / Tests

[New] CLI support to GEMMsBenchmark (#1721, #1725)
[New] Benchmark for variable batch on TBE (#1559)
[New] BF16 output test coverage (#1835, #1838)
[New] Benchmark for reorder_batched_ad_indices (#1895)
[New] CPU support (#1874, #1926)
[Improvement] GroupIndexSelect Benchmark with zero_grad (#1559)
[Improvement] Add nbit-cpu-with-spec benchmark in FBGEMM-GPU's TBE benchmark suite (#1892)

Build / CI improvements and Fixes

[New] C++17 Support to FBGEMM and FBGEMM_GPU OSS builds (#1652)
[New] ARM Support in OSS CI (#1813)
[New] SM 9.0 Support for CUDA 12.1 (#1825, #2002)
[Improvement] General CI and build system enhancement (#1658, #1695, #1697, #1702, #1719, #1751, #1784, #1795, #1836, #1958, #2020, #2024)
[Improvement] Reorganized code to enable faster builds (#1843, #1849, #1856, #1860, #1863, #1864, #1866, #1886, #1694, #1705, #1710, #1723, #1757, #1783, #1871, #1873, #1879, #1944, #1816, #1753)

pytorch/FBGEMM v0.5.0 FBGEMM_GPU v0.5.0 on GitHub