Highlights

TBE GPU

Added support for int64_t table indices and offsets in TBE inference
Introducing support for int32_t indices in TBE training
Extended TBE support for larger embedding dimensions
Made the learning rate a tensor value
Improvements on indices bounds checking

TBE CPU

Improved ARM support with SVE implementations for matrix multiplication and float matrix transpose
Improved the EmbeddingSpMDMAutovec API
Migrated FP32 ops to OSS

TBE SSD

Enabled VBE in SSD-TBE
Async initialization of RockDB SSD tensors and pad before writing to rocksDB
Improvements on indices bounds and other constraints checking

Gen AI Ops

Added nccl_alltoall function
Custom allgather support multiple dtypes, with dtype checking to prevent silent failures

ROCm

Add CK FP8 Batched GEMM and Rowwise GEMM kernels along with heuristic tuning
Fixed CK FP8 rowwise quantization for some GEMM shapes
Introduced HIP-specific optimizations to the TBE forward and backward passes

SLL ops

Migrated Sequence Learning Library (SLL) ops to OSS

Better Engineering

Restructured the build to produce multipiple smaller shared libraries instead of a single large binary
New and improved tests and benchmarks
Improved ROCm build variant support
Add build support for CUDA 12.6 and Python 3.13

Software Requirements

FBGEMM_GPU v1.1.0 has been tested and known to work on the following setups:

PyTorch: v2.6
CUDA: v11.8, 12.4, 12.6
Python: v3.9, 3.10, 3.11, 3.12, 3.13

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.1.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.1.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu124/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu126/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table Batched Embedding (TBE) operators

For GPU

[New] Add support for int32_t indices in TBE training (#3377, #3375, #3374, #3372, #3371, #3324, #3267, #3264, #3263 #3257)
[New] Add support for int64_t indices and offsets in TBE inference (#3254, #3233)
[New] Extend TBE support for larger embedding dimensions (#3462, #3467)
[New] Make learning rate tensor (Backend) (#3287, #3310, #3332)
[New] Add PTA checks to embedding_bounds_check kernels" (#3318)
[Fix] Fix PackedTensorAccessor for batch_index_select (#3281)
[Fix] Set cache_precision = weights_precision in TBE if it is not explicitly set (#3370)
[Fix] Fix pt2_wrapper registration for unified TBE interface (#3238)
[Fix] Fix PT2 compliant opcheck tests (#3404)
[Fix] Fix FBGEMM_GPU_MEMCHECK in Split optimizers (#3416)
[Fix] Fix learning rate as tensor for PT2 compile (#3407)
[New] Add new optimizer state row_counter for Adam [Frontend] (#3558)
[New] Add new optimizer state row_counter for Adam [Backend] (#3342)
[Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [7C/N]"" (#3258)
[Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [8/N]"" (#3255)
[Fix] Fix global weight decay Faketensor test (#3341)
[Fix] Fix pt2_wrapper registration for unified TBE interface (#3237)
[Fix] Fix ""Cannot call numel() on tensor with symbolic sizes/strides"" (#3368)
[Fix] Fix grid size overflow in generate_vbe_metadata (#3484)
[Fix] Fix an integer overflow in permute_multi_embedding() (#3465)
[Fix] Fix the sync point caused by iter_cpu.item() (#3401)
[Fix] Fix global weight decay Faketensor test (#3341)
[Fix] Hot fix to skip VBE CPU reshaping for MTIA (#3466)
[Fix] address mem over used during flushing (#3460)
[Improvement] Add iter singular value into TBE optimizer state (#3228)
[Improvement] V2 fwd modified warps (#3570)
[Improvement] Add enable_async_update into tbe signature and config (#3431, #3461)"
[Improvement] Adjust kNumThreads for bounds_check_indices_kernel (#3299)
[Improvement] Reduce registers in bounds_check_indices" (#3298)
[Improvement] Mark unified autograd function traceable (#3378)
[Improvement] Improve bounds_check_indices for VBE (#3388, #3386)
[Improvement] Do not call scalar_type (#3394)
[Improvement] optimizer 1d -- EMA in place (fbgemm part) (#3402)
[Improvement] Clean up nbit_forward tests (#3286)
[Improvement] Remove unused-variable in some generated code (#3327)
[Improvement] Limit grid size of bounds_check_indices" (#3282)
[Improvement] Support config based bound check version via extended modes (#3418)
[Improvement] Use int64_t index for SplitOptimizer grad (#3447)
[Improvement] Remove unused arg from generate_vbe_metadata frontend (#3453)
[Improvement] Add generate_vbe_metadata test (#3483)
[Improvement] Support config based bound check version via extended modes (#3454)
[Improvement] make iter PT2 compatible (#3253)
[Improvement] Add meta function for PT2 wrappers (#3240)
[Improvement] Nesterov (#3232)

For CPU

[New] Introduce SVE function for matrix multiplication (#3348)
[New] Add sve implementation for float matrix transpose (#3421)
[New] autovec specialization framework (#3393)
[New] Move FP32 kernels to OSS (#3568)
[Improvement] Pull in PR for Kleidi-based FP16 kernel (#3507)
[Improvement] Use local buffer where possible (#3304)
[Improvement] Refactor GenerateEmbeddingXXX functions (#3307)
[Improvement] Increase local_storage size to 512 floats (#3357)
[Improvement] Adjust EmbeddingSpMDMAutovec API (#3366)
[Improvement] Split loops to work around loop vectorizer weakness (#3406)
[Improvement] Do an early check that data_size is not negative (#3305)
[Improvement] Fix strict aliasing violation, code cleanup (#3306)

SSD TBE Operators

[New] Enable VBE in SSD-TBE (#3247)
[Improvement] put KVTensorWrapper in its own header (#3575)
[Improvement] Moving KVTensorWrapper to a header file to be used in ModelStore checkpointing code (#3276)
[Improvement] Async initialization of RockDB SSD tensors (#3520)
[Improvement] pad before writing to rocksDB (#3245)
[Improvement] use RocksDB iterator to read key range from ssd embedding (#3495)
[Improvement] Log total duration spent prefetching (#3487)
[Improvement] address mem over used during flushing (#3460)
[Improvement] Create move TBE to right device, and set Cache Load in TBE class (#3438)
[Improvement] Unit test for new move tbe from device/cache_load method (#3437)
[Improvement] make L2/rocksdb update async optional (#3429)
[Improvement] Drop RoPE when filling KV cache (#3346)
[Improvement] Remove setting total_cache_hash_size as buffer (#3441)
[Improvement] Add meta registrations for kv_cache operators (#3442)
[Improvement] remove output dtype restriction in SSD TBE (#3524)
[Improvement] change pmt require grad to false when detached (#3525)
[Improvement] add more attributes to PartiallyMaterializedTensor (#3300)
[Improvement] skip broken inference test that uses ssd TBE (#3494)
[Improvement] "coro => fut" (#3430)
[Improvement] Reland of D65489998 Optimize sharding performance of embeddings (#3549)
[Improvement] Remove torch.jit.script (#3562)

GenAI Support and Operators

[New] Add nccl_alltoall function (#3551)
[New] custom allgather support multiple dtypes (#3498)
[Improvement] Make sure fake tensor functions return on proper device (#3258)
[Improvement] Add CPU registrations to custom operators (#3262)
[Improvement] Check src & dst dtypes in allgather to prevent silent failures (#3523)
[Improvement] Better shape function registration (#3237, #3340)
[Improvement] Package re-organization improvements (#3546, #3251, #3419, #3268, #3512)

FP8 and other Quantization support

[New] New autotune config for M=4 (#3277)
[New] MoE FP8 grouped GEMM (#3321)
[New] Add shape check on GroupedGEMM kernel (#3449)
[New] Tuning for fp8 gemm with emu1.7 shapes (#3436)
[Improvement] more fp8 tuning for decode and not need to pad (#3576)
[Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
[Improvement] Split FP8 Grouped Gemm into dynamic and static version (#3543)
[Improvement] Warp-specialized FP8 rowsise GEMM kernel (#3532)
[Improvement] Add Cutlass FP8 Grouped Gemm to Quantize Bench (#3530)
[Improvement] Fixed FBGEMM fp8 rowwise for irregular shapes (#3491)
[Improvement] Properly define preallocated output as mutable in fp8 rowwise gemm (#3476)
[Improvement] Fix FP8 Rowwise Gemm Compilation with Auto-functionalize V2 (#3457)
[Improvement] Support zero-size inputs in FP8 cuda quantize kernel (#3448)
[Improvement] update FP8 GEMM tuning for emu1.7 7B shapes (#3391)
[Improvement] Customize FP8 grouped GEMM for non-zero calculation for token choice MoE (#3383)
[Improvement] Support FP8 grouped GEMM with cudagraph (#3373)
[Improvement] Refactor FP8 grouped GEMM to prepare cudagraph support (#3369)
[Improvement] Improve FP8 BMM heuristic for large shapes and MoE E2E performance (#3344)
[Improvement] retune some of the EMU1.6 7B FP8 GEMM shapes (#3328)
[Improvement] Make FP8 BMM output contiguous (#3270)
[Improvement] Tune FP8 rowwise bmm tile hueristic (#3256)
[Improvement] more FP8 GEMM tuning for LDM shapes (#3414)
[Improvement] Split up f8f8bf16_rowwise_batched.cu (#3381)
[Improvement] use sym int in quantize.cpp for f8f8bf16_rowwise_meta (#3410)
[Improvement] Remove triton.ops dependency from fbgemm (#3329)
[Improvement] Improve performance of prefill mode FP8 Grouped Gemm (#3522)
[Improvement] support quantize_fp8_row for up to 4d non contiguous tensor (#3508)
[Improvement] Back out ""support quantize_fp8_row for up to 4d non contiguous tensor"" (#3505)
[Improvement] Make the scale match the shape of quantized value with N-D tensors (#3396)
[Improvement] Fix out-of-bound load in row scaling (#3527)

ROCm

[New] More CK FP8 rowwise GEMM instances and tuning (#3455)
[New] Setup for ck fp8 batched gemm (#3322)
[New] CK FP8 Batched Gemm Heuristic Tuning (#3336)
[New] CK FP8 Grouped Gemm Support (#3316)
[New] Enable v2 forward test for ROCm (#3573)
[New] Add fused_moe kernel to ck_extension (#3518)
[Improvement] Implement Vec2 load/store for ROCm devices (#3413, #3475)
[Improvement] Manual loop unroll for rocm inference (#3439, #3405)
[Improvement] Optimzed backward pass for ROCm devices (#3367)
[Improvement] Add manual loop unroll for rocm devices in fwd pass (#3309, #3345)
[Improvement] [ROCm] debug v2 kernel for ROCm (#3266)
[Improvement] Optimzed backward pass for ROCm devices (#3511, #3488)
[Improvement] FP8 Rowwise compile fix followup for AMD (#3478)
[Improvement] Use output zero fill into grouped gemm kernel setup (#3537)
[Improvement] ROCm] remove the duplicated ROCm version print as it has been done in Pytorch (#3330)
[Improvement] Small cleanup of CK kernels (#3278)
[Improvement] Cherry-pick CK PR #1636 for fp8 GEMM rowwise for 70B Prefill (#3517)
[Improvement] Heuristic Tuning for CK FP8 Grouped Gemm (#3356)
[Improvement] Temporary disable nbit_forward_test on OSS rocm clang (#3445)
[Fix] Fix CK FP8 rowwise quantization for some GEMM shapes (#3486)

SLL

[Improvement] Migrate SLL ops to OSS (#3485, #3479, #3459, #3456, #3428, #3458, #3354, #3352, #3351, #3350, #3347, #3331, #3472, #3482, #3474)
[Improvement] Fix specialization issue in keyed_jagged_index_select_dim1_forward_cuda (#3578)
[Improvement] Align sll function names (#3471)
[Improvement] Break up SLL test files (#3550)
[Improvement] Register jagged ops to CompositeImplicitAutograd (#3395, #3249)

Sparse Operators

Sparse Ops

[Improvement] Register fake tensor impl for fbgemm::all_to_one_device (#3320)
[Improvement] Code cleanups to sparse bucketize and sparse block bucketize kernels (#3296,
#3295, #3302)
[Improvement] Update impl_abstract in sparse ops (#3311)
[Improvement] Cleanup stray testing line (#3353)
[Improvement] Print the node infos when CUDA p2p init fails (#3390)
[Improvement] Add large my_size support in _block_bucketize_pooled_sparse_features_cuda_kernel2 (#3294)
[Improvement] Kernel support for multiple buckets per rank (#3323)
[Improvement] Add CPU group_index_select fwd and bwd impl (#3273)
[Improvement] Skip check_all_same_device if only CPU and meta tensors appear (#3241)
[Improvement] create pack_segments_v2 with additional pad_minf and presence_mask functionality (#3427)

Quantization Operators

Quantize Ops

[Improvement] Add meta dispatch for FusedNBitRowwiseQuantizedSBHalfToFloatOrHalf (#3248, #3231)
[Improvement] Add torch checks for QuantizedCommCodec (#3260, #3389)
[Fix] Fix index overflow for superlarge inputs (#3519)

MX4 Ops

[Improvement] MX4 group size configuration for pyper (#3516)
[Fix] Various illegal memory access fixes (#3229, #3509, #3349)

Better Engineering

Benchmarks and tests

[New] Add a benchmark for VBE (#3464)
[New] Add Machete to fbgemm quantize bench (#3259)
[Improvement] Improve bounds check indices benchmark (#3283)
[Improvement] Add trace for nbit_device (#3292)
[Improvement] Use cudagraph for autotune (#3291)
[Improvement] Improve benchmark accuracy with warmups and kineto profiling (#3585, #3580)
[Fix] Fix test error (#3480)
[Fix] Disable SLL test in OSS (#3545)

Build / CI improvements

[New] Add build support for CUDA 12.6 (#3398, #3533, #3503, #3434)
[New] Add build support for Python 3.13 (#3502, #3529, #3555)
[New] Modularize the OSS CMake build (#3385, #3392, #3408, #3417, #3446, #3450, #3451, #3492, #3500)
[Improvement] Add CUTLASS 3.6 compatibility (#3303)
[Improvement] MIsc CMake build fixes (#3513, #3382, #3477)
[Improvement] Update ManyLinux support to ManyLinux 2.28 (#3521)
[Improvement] Update Triton (#3497)
[Improvement] Various build fixes and workflow improvements for ROCm jobs (#3566, #3557, #3554, #3547, #3501)
[Improvement] Various GitHub workflow improvements (#3432, #3531, #3499, #3514, #3542, #3252, #3581, #3538, #3536, #3446, #3297, #3243, #3242, #3238, #3239, #3246, #3236, #3528, #3230)
[Improvement] Various documentation fixes (#3339, #3333, #3244, #3365, #3289)
[Improvement] Improvements to documentation regarding compatibility (#3569, #3280)
[Improvement] Update package requirements.txt (#3574, #3469)
[Improvement] Increase time-out for CUDA OSS CI (#3230)
[Improvement] Add backwards compatibility checks for v1.1.0 release (#3489)
[Improvement] Disable certain tests in OSS (#3443, #3548, #3272)

Misc Cleanups

[Improvement] Lint fixes (#3250, #3504, #3435, #3397, #3293, #3387, #3433)
[Improvement] Remove unused variables (#3335, #3355, #3359)
[Improvement] Use type-safe utilities from c10 (#3285, #3358)

pytorch/FBGEMM v1.1.0 FBGEMM_GPU v1.1.0 Release Notes on GitHub

Highlights

TBE GPU

TBE CPU

TBE SSD

Gen AI Ops

ROCm

SLL ops

Better Engineering

Software Requirements

Availability

Changes

Table Batched Embedding (TBE) operators

For GPU

For CPU

SSD TBE Operators

GenAI Support and Operators

FP8 and other Quantization support

ROCm

SLL

Sparse Operators

Sparse Ops

Quantization Operators

Quantize Ops

MX4 Ops

Better Engineering

Benchmarks and tests

Build / CI improvements

Misc Cleanups

pytorch/FBGEMM v1.1.0
FBGEMM_GPU v1.1.0 Release Notes

on GitHub