pytorch/FBGEMM v1.7.0 on GitHub

Highlights

Inference & Production Deployment

TurboSSDInferenceModule with streaming updates and snapshot loading for HSTU serving (#5558, #5554)
AMD/ROCm support for SSD TBE inference with cache locking and dedicated memcpy streams (#5559, #5480)
TBE EEG (Embedding Export Gateway) for inference workloads (#5688)
DRAM KV cache and L2 cache hit rate metrics for production monitoring (#5633, #5730)

Enrichment & Feature Store Integration

Configurable IGR enrichment support for DRAM KV embedding cache (#5463, #5488)
OneFlow OpenTab and Feature Store enrichment backends (#5465, #5466, #5493, #5494)
Per-feature pooling factors support for flexible embedding architectures (#5690)

Performance Optimizations

Double-buffered eviction and auto-sized RocksDB block cache reducing prefetch stalls (#5512, #5513)
Precomputed writeback dedup indices eliminating GPU-CPU sync in backward pass (#5522)
Optimized jagged_unique_indices_cuda with binary-search and custom CUB pipeline (#5718)
Vectorized FP16 row conversion in rowwise quantization (#5596)

Quantization & GenAI

BF16 scale/bias support for INT4 quantization (#5595)
AVX512-BF16 dequantization enabled in OSS builds (#5635)
FP8 rowwise padding for quantized AllToAll pooled embeddings (#5673)
New Triton IKBO LCE kernel and TLX IKBO Flash Attention (#5521, #5651)

Platform & Hardware Support

SVE-FP16 version of EmbeddingSpMDM8Bit for ARM architectures (#5720)
UVM pipeline support for MTIA accelerators (#5538)
Preallocated host buffer support for CPU TBE (#5692)

Developer Experience

C++20 modernization: concepts, requires clauses, std::ranges, and std::bit_cast (#5586, #5592, #5593)
Comprehensive benchmark trace export and analysis tooling (#5498, #5731, #5671, #5693)
Minimum GCC bumped to 11.4 for better C++20 support (#5553)

Software Requirements

FBGEMM_GPU v1.7.0 has been tested and known to work on the following setups:

PyTorch: 2.12.x
CUDA: 12.6, 12.8, 12.9, 13.0
Python: 3.10, 3.11, 3.12, 3.13, 3.14
ROCm: 7.0, 7.1

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.7.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.7.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu128/
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu129/
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu130/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table Batched Embedding (TBE) Operators

For GPU

[New] Add configurable IGR enrichment support for DRAM KV embedding cache (#5463)
[New] Add OneFlow OpenTab enrichment backend support (#5465)
[New] Add OneFlow Feature Store enrichment backend and refactor dispatch (#5466)
[New] Add sync fetch_sids_sync API for publish SID→VID mapping (#5467)
[Fix] fix a device unmatched errors in benchmark (#5490)
[Fix] Fix FBGEMM_MEMCHECK bug in vbe_metadata kernel (#5506)
[Fix] Fix bug in momentum type declaration in HIP TBE kernel (#5501) (#5514)
[Improvement] 52 [A] (#5518)
[Improvement] Add unit test to validate HIP backward kernel with FP16 momentum (#5519)
[Fix] Fix test_cache_int32_overflow test failure on ROCm (#5526) (#5530)
[Fix] Fix output_type=BF16 test_backward_adagrad_large_dims failure (#5531)
[Improvement] Fix ufmt lint: remove extra blank line in lxu_cache_test.py (#5535)
[Fix] Fix autovec EmbeddingSpMDMNBit to handle pruned (-1) indices (#5543)
[Improvement] Add periodic logging for L2 cache fill wait time (#5549)
[Fix] Use sym_numel() instead of numel() in TBE pt2 backward codegen (#5563)
[Fix] Fix TBE v2 forward kernel for embedding dim > 1024 (#5326) (#5569)
[Improvement] Fold UVM code into TBE package (#5576)
[Improvement] Migrate tbe_input_multiplexer.py and runtime_monitor.py into tbe/monitoring/ (#5590)
[Improvement] Fold tbe/stats/ into tbe/monitoring/ for better organization (#5591)
[Fix] Fix int32 truncation in tbe_input_combine offset accumulation (#5594)
[Fix] Get actual free GPU memory in test_cache_int32_overflow (#5605)
[Fix] Make warp segment threshold consitent with host function (#5606)
[Fix] Revert D99867633 (#5607)
[Fix] Fix tbe combine tests (#5614)
[Fix] Use aligned_unique_ptr in more places to avoid leak (#5621)
[Improvement] Remove stale ROCm 5.7 skip checks and dead SM70 code in tests (#5619) (#5626)
[Improvement] Simplify is_torchdynamo_compiling to direct import from torch.compiler (#5618) (#5628)
[Fix] Support multi-dimensional runtime_meta in RES streaming buffers by lazy init (#5643)
[Fix] Add missing C10_CUDA_CHECK (#5647)
[Fix] Fix VBE batch sizes not passed to request builder (#5653)
[Improvement] log query empty count vs total count (#5657)
[Improvement] Use newer STL features in codegen templates (#5659)
[Improvement] Use Python 3.10+ typing in TBE ops and utilities (#5667)
[Fix] Apply proper grid striding on forward V2 kernel for ROCm (#5447) (#5669)
[Improvement] Exclude transient RES streaming buffers from checkpoints by setting persistent=False (#5674)
[Improvement] Use Python 3.10+ typing in core TBE ops (#5675)
[Improvement] TBE benchmark suites improvement (#5677)
[Improvement] Refactor bounds_check_indices offset checks to condition-first (Phase 1) (#5682)
[New] TBE EEG for Inference (#5688)
[New] Add per-feature pooling factors support (#5690)
[New] Add SVE-FP16 version of EmbeddingSpMDM8Bit (#5720)
[Improvement] remove uneccesarry field for FixedBlockPool in inference (#5729)
[Fix] Fix find_long_segments kernel launch failure for batch index select (#5732)
[Improvement] support warpSize 32 and 64 in the same build (#5739)
[New] Create tbe/config/ package with foundational embedding types (#5742)
[Improvement] Remove unnecessary __syncthreads in bounds_check_indices_kernel_v2 (#5744)
[New] Add cache config types to tbe/cache/ package (#5752)

For CPU

[Improvement] Replace spin-wait polling with condition variable in EmbeddingKVDB fill queue (#5510)
[Improvement] Precompute writeback dedup indices in forward to eliminate GPU-CPU sync in backward (#5522)
[Fix] Fix CPU TBE inline bounds check for unified embedding (#5523)
[Fix] Fix fused TBE weight buffer for MTIA (#5534)
[New] Add UVM pipeline support for MTIA (#5538)
[New] Add preallocated host buffer support to FBGEMM SplitTableBatchedEmbeddingBagsCodegen (#5692)
[Improvement] Enable TBE nobag backward test for SGD on CPU (#5759)

SSD Table Batched Embedding (TBE) Operators

[Improvement] Move compute thresholds logic for eviction (#5453)
[Improvement] enable feature score auto collection in EBC (#5459)
[Improvement] Migrate cudaStreamAddCallback to cudaLaunchHostFunc (#5462)
[Improvement] Add Python enum configs and KJT builder for enrichment (#5464)
[Improvement] Add cache locking and dedicated memcpy stream for SSD TBE inference (#5480)
[New] Enable RES for DRAM KV embedding cache (#5488)
[New] Add OneFlow OpenTab enrichment backend support (#5493)
[New] Add OneFlow Feature Store enrichment backend and refactor dispatch (#5494)
[Fix] Fix race conditions (#5496)
[Improvement] Use atomicAdd for lxu_cache_locking_counter increments/decrements (#5509)
[Improvement] Tune RocksDB bloom filter and background thread pool sizing (#5511)
[Improvement] Double-buffer eviction buffers to reduce prefetch stalls (#5512)
[Improvement] Auto-size RocksDB block cache and expose L2 cache hit rate (#5513)
[Fix] Fix race conditions: make shared mutable state atomic (#5520)
[Fix] Fix sorted_ids None issue in SSD TBE optimizer state fetching (#5525)
[Improvement] Make inference cache locking opt-in via enable_cache_locking flag (#5546)
[New] Add embedding cache support to oneflow base model (#5552)
[New] Add streaming_update() and load_snapshot() for inference (#5554)
[New] Add TurboSSDInferenceModule for HSTU serving integration (#5558)
[New] Add AMD/ROCm support for SSD TBE inference (#5559)
[Improvement] Support input data not most recent in MP-ZCH (#5567) (#5570)
[Fix] Fix lint (#5611)
[New] Add DRAM KV cache and L1 hit rate metrics for training (#5633)
[Improvement] Skip scratch pad eviction data in enrichment mode to avoid cudaFree overhead (#5645)
[Improvement] Add laser_batch_size to IGR enrichment, Add sleep for enrichemnt (#5697)
[Fix] Gate enrichment_policy by per-TBE embedding_cache_mode (#5698)
[Fix] Add spin-loop termination to for AMD GPU hang on MP-ZCH (#5714)
[Improvement] Add unit tests for warp primitives, bitonic sort, and ROCm warpSize guards (#5715)
[Improvement] Add KVZCH inference read-time hit rate metrics via fb303 ODS counters (#5730)
[Fix] Fix int32 truncation of 64-bit ssd_row_addrs in unrolled forward path (#5743)
[Improvement] Add KVZCH inference read-time hit rate metrics via fb303 ODS counters (#5745)
[Improvement] Add SSD/KVZCH config types to tbe/ssd/ package (#5753)

GenAI Support and Operators

Triton GEMM Support

[Improvement] Port reorder_batched_ad_lengths benchmark to tritonbench (#5505)
[New] IKBO LCE kernel in fbgemm (#5521)
[Improvement] Port group_index_select_2d to tritonbench (#5533)
[Improvement] Add Portions Copyright headers to modified third-party files (#5545)
[Improvement] Port jagged_index_select_2d benchmark to tritonbench (#5572)
[Improvement] Port bench_dense_to_jagged_1d to tritonbench (#5580)
[Improvement] Port bench_jagged_1d_to_dense to tritonbench (#5584)
[Improvement] Port bench_jagged_2d_to_dense and bench_dense_to_jagged_2d to (#5598)
[Improvement] Port jagged_dense_dense_elementwise_add_jagged_output and jagged_dense_elementwise_op_jagged_output to tritonbench (#5602)
[Improvement] Update jagged_acc_weights_and_counts and jagged_slice_cpu bench (#5620)
[Improvement] Upgrade permute_multi_embedding benchmark (#5627)
[Improvement] Upgrade batched_unary_embeddings benchmark (#5639)
[Improvement] Use Python 3.10+ typing in sparse/quantize/triton/utils (#5636) (#5642)
[New] Triton/TLX IKBO FA (#5651)
[Fix] TLX IKBO FA benchmarking with latest commit hash + bug fix (#5734)

Quantization Operators

[Fix] Fix fp16 code on aarch64 and Windows builds (#5548) (#5550)
[Improvement] merge SFINAE overloads of CodeGenHelpers templates with if constexpr (#5565) (#5571)
[Improvement] bf16 scale/bias for INT4 (#5595)
[Improvement] Vectorize fp16 row conversion in rowwise quantization (#5596)
[Fix] Fix EmbeddingQuantizeFloatToFloatOrHalfBenchmark (#5622)
[Improvement] Use double in dequant ref/scalar to match FMA precision (#5623)
[Improvement] Remove legacy quantize path (#5624)
[Improvement] Cleanup stale code for ROCM < 6.2 and CUDA < 12 (#5616) (#5625)
[Fix] Fix stale pytorch version checks (#5631)
[Improvement] Enable AVX512-BF16 dequant in OSS CMake and Bazel builds (#5635)
[Fix] [fbgemm_gpu[ Fix aarch64 build issues caused by D99968947 (#5655)
[Fix] Fix OOB read in _get_padding_value_kernel (#5652) (#5662)
[Improvement] Add trace export to mixdim benchmark and fix FP16 benchmark consistency (#5665)
[Fix] Add FP8 rowwise padding to quantized AllToAll pooled embeddings (#5673)
[Fix] Relax numerical tolerances in KV cache quantization tests (#5681)
[Improvement] Remove LEGACY parameter entirely from batch Quantize overload for API consistency (#5683)
[Improvement] Harden rowwise quantize benchmark with Kineto trace export (#5693)
[Fix] Fix fbgemm_dev build/test health issues (#5694)
[Improvement] benchmarks + stats tooling for bf16 AVX2 8-bit / N-bit dequant (D100932926) (#5709)
[Improvement] Remove unused test parameters (#5725)

Sparse Operators

[Improvement] Fixes and improvements to permute_2d_sparse_data_bench (#5477)
[Improvement] Add heterogeneous per-group input shapes support to group_index_select_2d_bench (#5487)
[Improvement] Add permute_1d comparison scripts and CPU cache flushing for old benchmark (#5492)
[Improvement] Add assertion to guard against overflow in keyed_jagged_index_select_dim1 (#5500)
[Improvement] Port reorder_batched_sequence_embeddings benchmark over to tritonbench (#5504)
[Improvement] Harden asynchronous_complete_cumsum_2d_bench in sparse_ops_benchmark (#5515)
[Improvement] Add meta function for block_bucketize_sparse_features_inference (#5529)
[Improvement] Enable block_bucketize* tests on ROCm (#5527) (#5532)
[Improvement] Remove redundant CUDA_KERNEL_ASSERTs in keyed_jagged_index_select_dim1 (#5539)
[Fix] Fix int32 overflow in keyed_jagged_index_select_dim1 (#5544)
[Improvement] Upgrade batch_reuse_index_select_device benchmark (#5562)
[Improvement] Improve keyed_jagged_index_select_dim1 and masked_select_jagged_1d bench (#5613)
[Improvement] Remove torch_compiled (#5617)
[Improvement] Validate total_num_blocks divisibility by my_size in block_bucketize (#5646)
[Fix] Fix 2 broken tests caused by D101141810 (#5654)
[Improvement] Add my_size > 0 guard and inference negative test for block_bucketize (#5663)
[Improvement] Optimize jagged_unique_indices_cuda (binary-search length + custom cub pipeline) (#5718)
[Fix] Fix Hypothesis differing_executors health check failure in index select (#5721)
[Improvement] Add unit test for batch_index_select_dim0 with large segment lengths (#5722)
[Fix] Fix int32 stride overflow in jagged_to_padded_dense at BLD > INT_MAX (#5755)

Build / CI Improvements and Better Engineering

[Improvement] Remove NCCLX one-sided comm code from fbgemm (#5475)
[Improvement] Add missing copyright headers to Meta-authored files (#5482)
[Improvement] Add Meta copyright headers to modified NVIDIA CUTLASS files (#5483)
[Improvement] Add Portions Copyright headers to modified AMD CK/ROCm gen_ai files (#5484)
[Improvement] Add Portions Copyright headers to modified third-party include files (#5485)
[Improvement] Add Portions Copyright headers to modified Arm KleidiAI files (#5486)
[Fix] Fix Vec2/Vec4 UVM performance regression with vectorized at::BFloat16 loads/stores (#5489)
[Fix] Fix Vec2/Vec4 UVM performance regression with vectorized at::Half copy (#5491)
[Improvement] Install libdw (#5495)
[Fix] Remove _test suffix from package name for test channel builds (#5502)
[Improvement] Update docs and compatibility table for FBGEMM v1.6.0 release (#5503)
[Fix] Fix build-time error for tbb in CentOS. (#5497) (#5516)
[Improvement] Update default CUDA version to 13.0.2 (#5524)
[Improvement] Move internal enrichment files to fb/ for OSS exclusion (#5541)
[Fix] Fix empty key lookup in gpu_detect.bash (#5551)
[Improvement] Bump minimum GCC to 11.4 (#5537) (#5553)
[Fix] Remove omp_set_num_threads from RadixSortTest to fix ASan leak (#5555)
[Improvement] Enable more clang-tidy checks on C++20 (#5575)
[Improvement] Add checks for uninitialized storage (#5579)
[Improvement] Simplify array_of_ones and remove array_of_zeroes (#5573) (#5581)
[Improvement] simplify PackingTraits methods (#5574) (#5582)
[Improvement] Simplify FP code (#5577) (#5583)
[Improvement] Replace SFINAE with C++20 concepts and requires clauses (#5586)
[Improvement] Use std::bit_cast and std::countl_zero in C++20 (#5592)
[Improvement] Use supported std::ranges algorithms (#5593)
[Improvement] Use CUB_WRAPPED_NAMESPACE instead of legacy CUB_NS_PREFIX (#5601)
[Fix] Strip -std=c++NN flag from pytorch package (#5604)
[Improvement] simplify ALIGNAS, remove useless attributes and stale CUDA workaround (#5608)
[Improvement] Add aligned_unique_ptr RAII wrapper to avoid leak risks (#5609)
[Improvement] Add CUDA 13.2 support to CI and release workflows (#5610) — reverted in this release; see #5750
[Improvement] Remove dead CUDA < 11 workarounds and simplify bf16/CUB guards (#5600) (#5612)
[Improvement] Unify duplicated cmake code between CPU and GPU builds (#5629)
[Improvement] Use C++20 [[unlikely]] and defaulted operator== (#5630)
[Fix] Fix 3 broken tests caused by D100185387 (#5656)
[Fix] Fix pyre type annotations in test_utils.py (#5660)
[Improvement] Fix flake8 E402 warnings (#5658) (#5661)
[Fix] Fix duplicate symbol linker errors on ARM builds (#5664)
[Fix] Fix OSS CI ModuleNotFoundError: explicit pip in conda env (#5691)
[Improvement] Enable device-side assertions on ROCm (#5723)
[Improvement] Re-enable get_cuda_error_help in kernel error message (#5724)
[Improvement] Replace rocm-smi with amd-smi across ROCm build, CI, and docs (#5597) (#5726)
[Fix] Enable AMD tests for ZCH & Fix OSS (#5727)
[Improvement] Add FBGEMM_NO_JK=2 (EnvFirstThenJk) policy; refactor feature-gate lookup into singleton (#5748)
[Fix] Revert CUDA 13.2 enablement (#5610) due to OSS CI cost regression and upstream conda-forge instability (#5750)
[Improvement] Annotate unused function (#5758)
[Fix] Remove erroneous NVIDIA proprietary block from BSD-3 LICENSE (#5760)

Tests and Benchmarks

[New] Add common scripts for benchmark trace analysis (#5498)
[Improvement] Re-organize diff benchmarking scripts (#5508)
[Improvement] Set manual seed for fbgemm benchmark (#5540)
[Improvement] Benchmarks for D98170783 (#5547)
[Improvement] Remove pt2_cpu stubs and move isValidBlockingFactor (#5556)
[Improvement] Benchmark code refactoring (#5632)
[Improvement] Add --device and --export-trace flags to stride_gemm_benchmark (#5671)
[Improvement] Harden repeat_arange benchmark with input validation and trace export (#5676)
[Improvement] Harden histogram_binning_calibration benchmark with input validation and trace export (#5687)
[Fix] Fix type annotation (#5695)
[New] Add scripts for analyzing bench runs (#5731)

pytorch/FBGEMM v1.7.0 FBGEMM_GPU v1.7.0 Release Notes on GitHub