github pytorch/FBGEMM v0.5.0
FBGEMM_GPU v0.5.0

latest releases: ciflow/rocm/2564, v0.7.0, v0.7.0-rc3...
7 months ago

Release Notes

Highlights

  • TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement)
  • Many TBE extensions including defused TBE backward-optimizer, variable batch size support, pipeline prefetching support for UVM caching
  • Many improvements and new sparse ops added
  • ARM support
  • SM 9.0 support for CUDA 12.1 for H100 GPUs
  • PyTorch 2 support for various operators, i.e., jagged tensor, pooled embedding ops

Software Requirements

FBGEMM_GPU v0.5.0 has been tested and known to work on the following setups:

  • PyTorch: v2.1
  • CUDA: v11.8, 12.1
  • Python: v3.8, 3.9, 3.10, 3.11

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.5.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.5.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

Jagged Tensor Operators

Index Select Operators

  • [New] batch_index_select_dim0 with TBE backend (#1897)
  • [New] Variable input sizes support for group_index_select_dim0 (#1968)
  • [Improvement] Improve group_index_select(#1764, #1884)

Low-precision operators

  • [New] Meta Backend FP8RowwiseQuantizedToFloat (#1890)
  • [New] Column-wise parallel quantization/dequantization (#1743)
  • [New] BF16 Support in FP8 quantize ops (#1961)
  • [Improvement] FP8 row-wise quantization optimization/improvement (#1729, #1858, #1981, #1909)

Pooled Embedding

  • [New] reduce_to_one (#1571)
  • [New] permute_duplicate_pooled_embeddings op (#1912)
  • [New] BF16 support for permute_pooled_embeddings op 1937
  • [New] Variable size input-output support for permute_pooled_embs_kernel (#1913)
  • [New] Backends (Meta) (#1853)
  • [Improvement] multi-gpu all_to_one enhancements (#1674, #1962)

Misc

  • [New] CUB kernel for 2D asynchronous_complete_cumsum (#1707)
  • [New] Backends (Meta) (#1709, #1905, #1970, #1971)
  • [New] BF16 support in permute_indices_weights_kernel_2 (#1852)
  • [New] FP16 and BF16 support in pack_segments (#1708)
  • [New] BF16 support for HBC ops. (#1744)
  • [New] BFloat16 support (#1832, #1865)
  • [Improvement] Speedup reorder_batched_ad_indices (#1901, #1902, #1932, #1933, 1711)

Benchmarks / Tests

  • [New] CLI support to GEMMsBenchmark (#1721, #1725)
  • [New] Benchmark for variable batch on TBE (#1559)
  • [New] BF16 output test coverage (#1835, #1838)
  • [New] Benchmark for reorder_batched_ad_indices (#1895)
  • [New] CPU support (#1874, #1926)
  • [Improvement] GroupIndexSelect Benchmark with zero_grad (#1559)
  • [Improvement] Add nbit-cpu-with-spec benchmark in FBGEMM-GPU's TBE benchmark suite (#1892)

Build / CI improvements and Fixes

Don't miss a new FBGEMM release

NewReleases is sending notifications on new releases.