Highlights
TBE GPU
- Added support for
int64_t
table indices and offsets in TBE inference - Introducing support for int32_t indices in TBE training
- Extended TBE support for larger embedding dimensions
- Made the learning rate a tensor value
- Improvements on indices bounds checking
TBE CPU
- Improved ARM support with SVE implementations for matrix multiplication and float matrix transpose
- Improved the EmbeddingSpMDMAutovec API
- Migrated FP32 ops to OSS
TBE SSD
- Enabled VBE in SSD-TBE
- Async initialization of RockDB SSD tensors and pad before writing to rocksDB
- Improvements on indices bounds and other constraints checking
Gen AI Ops
- Added nccl_alltoall function
- Custom allgather support multiple dtypes, with dtype checking to prevent silent failures
ROCm
- Add CK FP8 Batched GEMM and Rowwise GEMM kernels along with heuristic tuning
- Fixed CK FP8 rowwise quantization for some GEMM shapes
- Introduced HIP-specific optimizations to the TBE forward and backward passes
SLL ops
- Migrated Sequence Learning Library (SLL) ops to OSS
Better Engineering
- Restructured the build to produce multipiple smaller shared libraries instead of a single large binary
- New and improved tests and benchmarks
- Improved ROCm build variant support
- Add build support for CUDA 12.6 and Python 3.13
Software Requirements
FBGEMM_GPU v1.1.0 has been tested and known to work on the following setups:
- PyTorch: v2.6
- CUDA: v11.8, 12.4, 12.6
- Python: v3.9, 3.10, 3.11, 3.12, 3.13
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.1.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.1.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu124/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu126/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table Batched Embedding (TBE) operators
For GPU
- [New] Add support for
int32_t
indices in TBE training (#3377, #3375, #3374, #3372, #3371, #3324, #3267, #3264, #3263 #3257) - [New] Add support for int64_t indices and offsets in TBE inference (#3254, #3233)
- [New] Extend TBE support for larger embedding dimensions (#3462, #3467)
- [New] Make
learning rate
tensor (Backend) (#3287, #3310, #3332) - [New] Add PTA checks to embedding_bounds_check kernels" (#3318)
- [Fix] Fix PackedTensorAccessor for batch_index_select (#3281)
- [Fix] Set cache_precision = weights_precision in TBE if it is not explicitly set (#3370)
- [Fix] Fix pt2_wrapper registration for unified TBE interface (#3238)
- [Fix] Fix PT2 compliant opcheck tests (#3404)
- [Fix] Fix FBGEMM_GPU_MEMCHECK in Split optimizers (#3416)
- [Fix] Fix learning rate as tensor for PT2 compile (#3407)
- [New] Add new optimizer state
row_counter
for Adam [Frontend] (#3558) - [New] Add new optimizer state
row_counter
for Adam [Backend] (#3342) - [Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [7C/N]"" (#3258)
- [Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [8/N]"" (#3255)
- [Fix] Fix global weight decay Faketensor test (#3341)
- [Fix] Fix pt2_wrapper registration for unified TBE interface (#3237)
- [Fix] Fix ""Cannot call numel() on tensor with symbolic sizes/strides"" (#3368)
- [Fix] Fix grid size overflow in generate_vbe_metadata (#3484)
- [Fix] Fix an integer overflow in permute_multi_embedding() (#3465)
- [Fix] Fix the sync point caused by iter_cpu.item() (#3401)
- [Fix] Fix global weight decay Faketensor test (#3341)
- [Fix] Hot fix to skip VBE CPU reshaping for MTIA (#3466)
- [Fix] address mem over used during flushing (#3460)
- [Improvement] Add
iter
singular value into TBE optimizer state (#3228) - [Improvement] V2 fwd modified warps (#3570)
- [Improvement] Add enable_async_update into tbe signature and config (#3431, #3461)"
- [Improvement] Adjust kNumThreads for bounds_check_indices_kernel (#3299)
- [Improvement] Reduce registers in bounds_check_indices" (#3298)
- [Improvement] Mark unified autograd function traceable (#3378)
- [Improvement] Improve bounds_check_indices for VBE (#3388, #3386)
- [Improvement] Do not call
scalar_type
(#3394) - [Improvement] optimizer 1d -- EMA in place (fbgemm part) (#3402)
- [Improvement] Clean up nbit_forward tests (#3286)
- [Improvement] Remove unused-variable in some generated code (#3327)
- [Improvement] Limit grid size of bounds_check_indices" (#3282)
- [Improvement] Support config based bound check version via extended modes (#3418)
- [Improvement] Use int64_t index for SplitOptimizer grad (#3447)
- [Improvement] Remove unused arg from generate_vbe_metadata frontend (#3453)
- [Improvement] Add generate_vbe_metadata test (#3483)
- [Improvement] Support config based bound check version via extended modes (#3454)
- [Improvement] make
iter
PT2 compatible (#3253) - [Improvement] Add meta function for PT2 wrappers (#3240)
- [Improvement] Nesterov (#3232)
For CPU
- [New] Introduce SVE function for matrix multiplication (#3348)
- [New] Add sve implementation for float matrix transpose (#3421)
- [New] autovec specialization framework (#3393)
- [New] Move FP32 kernels to OSS (#3568)
- [Improvement] Pull in PR for Kleidi-based FP16 kernel (#3507)
- [Improvement] Use local buffer where possible (#3304)
- [Improvement] Refactor GenerateEmbeddingXXX functions (#3307)
- [Improvement] Increase local_storage size to 512 floats (#3357)
- [Improvement] Adjust EmbeddingSpMDMAutovec API (#3366)
- [Improvement] Split loops to work around loop vectorizer weakness (#3406)
- [Improvement] Do an early check that data_size is not negative (#3305)
- [Improvement] Fix strict aliasing violation, code cleanup (#3306)
SSD TBE Operators
- [New] Enable VBE in SSD-TBE (#3247)
- [Improvement] put KVTensorWrapper in its own header (#3575)
- [Improvement] Moving KVTensorWrapper to a header file to be used in ModelStore checkpointing code (#3276)
- [Improvement] Async initialization of RockDB SSD tensors (#3520)
- [Improvement] pad before writing to rocksDB (#3245)
- [Improvement] use RocksDB iterator to read key range from ssd embedding (#3495)
- [Improvement] Log total duration spent prefetching (#3487)
- [Improvement] address mem over used during flushing (#3460)
- [Improvement] Create move TBE to right device, and set Cache Load in TBE class (#3438)
- [Improvement] Unit test for new move tbe from device/cache_load method (#3437)
- [Improvement] make L2/rocksdb update async optional (#3429)
- [Improvement] Drop RoPE when filling KV cache (#3346)
- [Improvement] Remove setting total_cache_hash_size as buffer (#3441)
- [Improvement] Add meta registrations for kv_cache operators (#3442)
- [Improvement] remove output dtype restriction in SSD TBE (#3524)
- [Improvement] change pmt require grad to false when detached (#3525)
- [Improvement] add more attributes to PartiallyMaterializedTensor (#3300)
- [Improvement] skip broken inference test that uses ssd TBE (#3494)
- [Improvement] "coro => fut" (#3430)
- [Improvement] Reland of D65489998 Optimize sharding performance of embeddings (#3549)
- [Improvement] Remove torch.jit.script (#3562)
GenAI Support and Operators
- [New] Add nccl_alltoall function (#3551)
- [New] custom allgather support multiple dtypes (#3498)
- [Improvement] Make sure fake tensor functions return on proper device (#3258)
- [Improvement] Add CPU registrations to custom operators (#3262)
- [Improvement] Check src & dst dtypes in allgather to prevent silent failures (#3523)
- [Improvement] Better shape function registration (#3237, #3340)
- [Improvement] Package re-organization improvements (#3546, #3251, #3419, #3268, #3512)
FP8 and other Quantization support
- [New] New autotune config for M=4 (#3277)
- [New] MoE FP8 grouped GEMM (#3321)
- [New] Add shape check on GroupedGEMM kernel (#3449)
- [New] Tuning for fp8 gemm with emu1.7 shapes (#3436)
- [Improvement] more fp8 tuning for decode and not need to pad (#3576)
- [Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
- [Improvement] Split FP8 Grouped Gemm into dynamic and static version (#3543)
- [Improvement] Warp-specialized FP8 rowsise GEMM kernel (#3532)
- [Improvement] Add Cutlass FP8 Grouped Gemm to Quantize Bench (#3530)
- [Improvement] Fixed FBGEMM fp8 rowwise for irregular shapes (#3491)
- [Improvement] Properly define preallocated output as mutable in fp8 rowwise gemm (#3476)
- [Improvement] Fix FP8 Rowwise Gemm Compilation with Auto-functionalize V2 (#3457)
- [Improvement] Support zero-size inputs in FP8 cuda quantize kernel (#3448)
- [Improvement] update FP8 GEMM tuning for emu1.7 7B shapes (#3391)
- [Improvement] Customize FP8 grouped GEMM for non-zero calculation for token choice MoE (#3383)
- [Improvement] Support FP8 grouped GEMM with cudagraph (#3373)
- [Improvement] Refactor FP8 grouped GEMM to prepare cudagraph support (#3369)
- [Improvement] Improve FP8 BMM heuristic for large shapes and MoE E2E performance (#3344)
- [Improvement] retune some of the EMU1.6 7B FP8 GEMM shapes (#3328)
- [Improvement] Make FP8 BMM output contiguous (#3270)
- [Improvement] Tune FP8 rowwise bmm tile hueristic (#3256)
- [Improvement] more FP8 GEMM tuning for LDM shapes (#3414)
- [Improvement] Split up
f8f8bf16_rowwise_batched.cu
(#3381) - [Improvement] use sym int in quantize.cpp for f8f8bf16_rowwise_meta (#3410)
- [Improvement] Remove triton.ops dependency from fbgemm (#3329)
- [Improvement] Improve performance of prefill mode FP8 Grouped Gemm (#3522)
- [Improvement] support quantize_fp8_row for up to 4d non contiguous tensor (#3508)
- [Improvement] Back out ""support quantize_fp8_row for up to 4d non contiguous tensor"" (#3505)
- [Improvement] Make the scale match the shape of quantized value with N-D tensors (#3396)
- [Improvement] Fix out-of-bound load in row scaling (#3527)
ROCm
- [New] More CK FP8 rowwise GEMM instances and tuning (#3455)
- [New] Setup for ck fp8 batched gemm (#3322)
- [New] CK FP8 Batched Gemm Heuristic Tuning (#3336)
- [New] CK FP8 Grouped Gemm Support (#3316)
- [New] Enable v2 forward test for ROCm (#3573)
- [New] Add fused_moe kernel to ck_extension (#3518)
- [Improvement] Implement Vec2 load/store for ROCm devices (#3413, #3475)
- [Improvement] Manual loop unroll for rocm inference (#3439, #3405)
- [Improvement] Optimzed backward pass for ROCm devices (#3367)
- [Improvement] Add manual loop unroll for rocm devices in fwd pass (#3309, #3345)
- [Improvement] [ROCm] debug v2 kernel for ROCm (#3266)
- [Improvement] Optimzed backward pass for ROCm devices (#3511, #3488)
- [Improvement] FP8 Rowwise compile fix followup for AMD (#3478)
- [Improvement] Use output zero fill into grouped gemm kernel setup (#3537)
- [Improvement] ROCm] remove the duplicated ROCm version print as it has been done in Pytorch (#3330)
- [Improvement] Small cleanup of CK kernels (#3278)
- [Improvement] Cherry-pick CK PR #1636 for fp8 GEMM rowwise for 70B Prefill (#3517)
- [Improvement] Heuristic Tuning for CK FP8 Grouped Gemm (#3356)
- [Improvement] Temporary disable nbit_forward_test on OSS rocm clang (#3445)
- [Fix] Fix CK FP8 rowwise quantization for some GEMM shapes (#3486)
SLL
- [Improvement] Migrate SLL ops to OSS (#3485, #3479, #3459, #3456, #3428, #3458, #3354, #3352, #3351, #3350, #3347, #3331, #3472, #3482, #3474)
- [Improvement] Fix specialization issue in keyed_jagged_index_select_dim1_forward_cuda (#3578)
- [Improvement] Align sll function names (#3471)
- [Improvement] Break up SLL test files (#3550)
- [Improvement] Register jagged ops to CompositeImplicitAutograd (#3395, #3249)
Sparse Operators
Sparse Ops
- [Improvement] Register fake tensor impl for fbgemm::all_to_one_device (#3320)
- [Improvement] Code cleanups to sparse bucketize and sparse block bucketize kernels (#3296,
#3295, #3302) - [Improvement] Update impl_abstract in sparse ops (#3311)
- [Improvement] Cleanup stray testing line (#3353)
- [Improvement] Print the node infos when CUDA p2p init fails (#3390)
- [Improvement] Add large my_size support in _block_bucketize_pooled_sparse_features_cuda_kernel2 (#3294)
- [Improvement] Kernel support for multiple buckets per rank (#3323)
- [Improvement] Add CPU group_index_select fwd and bwd impl (#3273)
- [Improvement] Skip check_all_same_device if only CPU and meta tensors appear (#3241)
- [Improvement] create pack_segments_v2 with additional pad_minf and presence_mask functionality (#3427)
Quantization Operators
Quantize Ops
- [Improvement] Add meta dispatch for FusedNBitRowwiseQuantizedSBHalfToFloatOrHalf (#3248, #3231)
- [Improvement] Add torch checks for QuantizedCommCodec (#3260, #3389)
- [Fix] Fix index overflow for superlarge inputs (#3519)
MX4 Ops
- [Improvement] MX4 group size configuration for pyper (#3516)
- [Fix] Various illegal memory access fixes (#3229, #3509, #3349)
Better Engineering
Benchmarks and tests
- [New] Add a benchmark for VBE (#3464)
- [New] Add Machete to fbgemm quantize bench (#3259)
- [Improvement] Improve bounds check indices benchmark (#3283)
- [Improvement] Add trace for nbit_device (#3292)
- [Improvement] Use cudagraph for autotune (#3291)
- [Improvement] Improve benchmark accuracy with warmups and kineto profiling (#3585, #3580)
- [Fix] Fix test error (#3480)
- [Fix] Disable SLL test in OSS (#3545)
Build / CI improvements
- [New] Add build support for CUDA 12.6 (#3398, #3533, #3503, #3434)
- [New] Add build support for Python 3.13 (#3502, #3529, #3555)
- [New] Modularize the OSS CMake build (#3385, #3392, #3408, #3417, #3446, #3450, #3451, #3492, #3500)
- [Improvement] Add CUTLASS 3.6 compatibility (#3303)
- [Improvement] MIsc CMake build fixes (#3513, #3382, #3477)
- [Improvement] Update ManyLinux support to ManyLinux 2.28 (#3521)
- [Improvement] Update Triton (#3497)
- [Improvement] Various build fixes and workflow improvements for ROCm jobs (#3566, #3557, #3554, #3547, #3501)
- [Improvement] Various GitHub workflow improvements (#3432, #3531, #3499, #3514, #3542, #3252, #3581, #3538, #3536, #3446, #3297, #3243, #3242, #3238, #3239, #3246, #3236, #3528, #3230)
- [Improvement] Various documentation fixes (#3339, #3333, #3244, #3365, #3289)
- [Improvement] Improvements to documentation regarding compatibility (#3569, #3280)
- [Improvement] Update package requirements.txt (#3574, #3469)
- [Improvement] Increase time-out for CUDA OSS CI (#3230)
- [Improvement] Add backwards compatibility checks for v1.1.0 release (#3489)
- [Improvement] Disable certain tests in OSS (#3443, #3548, #3272)