github pytorch/FBGEMM v1.7.0
FBGEMM_GPU v1.7.0 Release Notes

3 hours ago

Highlights

Inference & Production Deployment

  • TurboSSDInferenceModule with streaming updates and snapshot loading for HSTU serving (#5558, #5554)
  • AMD/ROCm support for SSD TBE inference with cache locking and dedicated memcpy streams (#5559, #5480)
  • TBE EEG (Embedding Export Gateway) for inference workloads (#5688)
  • DRAM KV cache and L2 cache hit rate metrics for production monitoring (#5633, #5730)

Enrichment & Feature Store Integration

  • Configurable IGR enrichment support for DRAM KV embedding cache (#5463, #5488)
  • OneFlow OpenTab and Feature Store enrichment backends (#5465, #5466, #5493, #5494)
  • Per-feature pooling factors support for flexible embedding architectures (#5690)

Performance Optimizations

  • Double-buffered eviction and auto-sized RocksDB block cache reducing prefetch stalls (#5512, #5513)
  • Precomputed writeback dedup indices eliminating GPU-CPU sync in backward pass (#5522)
  • Optimized jagged_unique_indices_cuda with binary-search and custom CUB pipeline (#5718)
  • Vectorized FP16 row conversion in rowwise quantization (#5596)

Quantization & GenAI

  • BF16 scale/bias support for INT4 quantization (#5595)
  • AVX512-BF16 dequantization enabled in OSS builds (#5635)
  • FP8 rowwise padding for quantized AllToAll pooled embeddings (#5673)
  • New Triton IKBO LCE kernel and TLX IKBO Flash Attention (#5521, #5651)

Platform & Hardware Support

  • SVE-FP16 version of EmbeddingSpMDM8Bit for ARM architectures (#5720)
  • UVM pipeline support for MTIA accelerators (#5538)
  • Preallocated host buffer support for CPU TBE (#5692)

Developer Experience

  • C++20 modernization: concepts, requires clauses, std::ranges, and std::bit_cast (#5586, #5592, #5593)
  • Comprehensive benchmark trace export and analysis tooling (#5498, #5731, #5671, #5693)
  • Minimum GCC bumped to 11.4 for better C++20 support (#5553)

Software Requirements

FBGEMM_GPU v1.7.0 has been tested and known to work on the following setups:

  • PyTorch: 2.12.x
  • CUDA: 12.6, 12.8, 12.9, 13.0
  • Python: 3.10, 3.11, 3.12, 3.13, 3.14
  • ROCm: 7.0, 7.1

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.7.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.7.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu128/
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu129/
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu130/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table Batched Embedding (TBE) Operators

For GPU

  • [New] Add configurable IGR enrichment support for DRAM KV embedding cache (#5463)
  • [New] Add OneFlow OpenTab enrichment backend support (#5465)
  • [New] Add OneFlow Feature Store enrichment backend and refactor dispatch (#5466)
  • [New] Add sync fetch_sids_sync API for publish SID→VID mapping (#5467)
  • [Fix] fix a device unmatched errors in benchmark (#5490)
  • [Fix] Fix FBGEMM_MEMCHECK bug in vbe_metadata kernel (#5506)
  • [Fix] Fix bug in momentum type declaration in HIP TBE kernel (#5501) (#5514)
  • [Improvement] 52 [A] (#5518)
  • [Improvement] Add unit test to validate HIP backward kernel with FP16 momentum (#5519)
  • [Fix] Fix test_cache_int32_overflow test failure on ROCm (#5526) (#5530)
  • [Fix] Fix output_type=BF16 test_backward_adagrad_large_dims failure (#5531)
  • [Improvement] Fix ufmt lint: remove extra blank line in lxu_cache_test.py (#5535)
  • [Fix] Fix autovec EmbeddingSpMDMNBit to handle pruned (-1) indices (#5543)
  • [Improvement] Add periodic logging for L2 cache fill wait time (#5549)
  • [Fix] Use sym_numel() instead of numel() in TBE pt2 backward codegen (#5563)
  • [Fix] Fix TBE v2 forward kernel for embedding dim > 1024 (#5326) (#5569)
  • [Improvement] Fold UVM code into TBE package (#5576)
  • [Improvement] Migrate tbe_input_multiplexer.py and runtime_monitor.py into tbe/monitoring/ (#5590)
  • [Improvement] Fold tbe/stats/ into tbe/monitoring/ for better organization (#5591)
  • [Fix] Fix int32 truncation in tbe_input_combine offset accumulation (#5594)
  • [Fix] Get actual free GPU memory in test_cache_int32_overflow (#5605)
  • [Fix] Make warp segment threshold consitent with host function (#5606)
  • [Fix] Revert D99867633 (#5607)
  • [Fix] Fix tbe combine tests (#5614)
  • [Fix] Use aligned_unique_ptr in more places to avoid leak (#5621)
  • [Improvement] Remove stale ROCm 5.7 skip checks and dead SM70 code in tests (#5619) (#5626)
  • [Improvement] Simplify is_torchdynamo_compiling to direct import from torch.compiler (#5618) (#5628)
  • [Fix] Support multi-dimensional runtime_meta in RES streaming buffers by lazy init (#5643)
  • [Fix] Add missing C10_CUDA_CHECK (#5647)
  • [Fix] Fix VBE batch sizes not passed to request builder (#5653)
  • [Improvement] log query empty count vs total count (#5657)
  • [Improvement] Use newer STL features in codegen templates (#5659)
  • [Improvement] Use Python 3.10+ typing in TBE ops and utilities (#5667)
  • [Fix] Apply proper grid striding on forward V2 kernel for ROCm (#5447) (#5669)
  • [Improvement] Exclude transient RES streaming buffers from checkpoints by setting persistent=False (#5674)
  • [Improvement] Use Python 3.10+ typing in core TBE ops (#5675)
  • [Improvement] TBE benchmark suites improvement (#5677)
  • [Improvement] Refactor bounds_check_indices offset checks to condition-first (Phase 1) (#5682)
  • [New] TBE EEG for Inference (#5688)
  • [New] Add per-feature pooling factors support (#5690)
  • [New] Add SVE-FP16 version of EmbeddingSpMDM8Bit (#5720)
  • [Improvement] remove uneccesarry field for FixedBlockPool in inference (#5729)
  • [Fix] Fix find_long_segments kernel launch failure for batch index select (#5732)
  • [Improvement] support warpSize 32 and 64 in the same build (#5739)
  • [New] Create tbe/config/ package with foundational embedding types (#5742)
  • [Improvement] Remove unnecessary __syncthreads in bounds_check_indices_kernel_v2 (#5744)
  • [New] Add cache config types to tbe/cache/ package (#5752)

For CPU

  • [Improvement] Replace spin-wait polling with condition variable in EmbeddingKVDB fill queue (#5510)
  • [Improvement] Precompute writeback dedup indices in forward to eliminate GPU-CPU sync in backward (#5522)
  • [Fix] Fix CPU TBE inline bounds check for unified embedding (#5523)
  • [Fix] Fix fused TBE weight buffer for MTIA (#5534)
  • [New] Add UVM pipeline support for MTIA (#5538)
  • [New] Add preallocated host buffer support to FBGEMM SplitTableBatchedEmbeddingBagsCodegen (#5692)
  • [Improvement] Enable TBE nobag backward test for SGD on CPU (#5759)

SSD Table Batched Embedding (TBE) Operators

  • [Improvement] Move compute thresholds logic for eviction (#5453)
  • [Improvement] enable feature score auto collection in EBC (#5459)
  • [Improvement] Migrate cudaStreamAddCallback to cudaLaunchHostFunc (#5462)
  • [Improvement] Add Python enum configs and KJT builder for enrichment (#5464)
  • [Improvement] Add cache locking and dedicated memcpy stream for SSD TBE inference (#5480)
  • [New] Enable RES for DRAM KV embedding cache (#5488)
  • [New] Add OneFlow OpenTab enrichment backend support (#5493)
  • [New] Add OneFlow Feature Store enrichment backend and refactor dispatch (#5494)
  • [Fix] Fix race conditions (#5496)
  • [Improvement] Use atomicAdd for lxu_cache_locking_counter increments/decrements (#5509)
  • [Improvement] Tune RocksDB bloom filter and background thread pool sizing (#5511)
  • [Improvement] Double-buffer eviction buffers to reduce prefetch stalls (#5512)
  • [Improvement] Auto-size RocksDB block cache and expose L2 cache hit rate (#5513)
  • [Fix] Fix race conditions: make shared mutable state atomic (#5520)
  • [Fix] Fix sorted_ids None issue in SSD TBE optimizer state fetching (#5525)
  • [Improvement] Make inference cache locking opt-in via enable_cache_locking flag (#5546)
  • [New] Add embedding cache support to oneflow base model (#5552)
  • [New] Add streaming_update() and load_snapshot() for inference (#5554)
  • [New] Add TurboSSDInferenceModule for HSTU serving integration (#5558)
  • [New] Add AMD/ROCm support for SSD TBE inference (#5559)
  • [Improvement] Support input data not most recent in MP-ZCH (#5567) (#5570)
  • [Fix] Fix lint (#5611)
  • [New] Add DRAM KV cache and L1 hit rate metrics for training (#5633)
  • [Improvement] Skip scratch pad eviction data in enrichment mode to avoid cudaFree overhead (#5645)
  • [Improvement] Add laser_batch_size to IGR enrichment, Add sleep for enrichemnt (#5697)
  • [Fix] Gate enrichment_policy by per-TBE embedding_cache_mode (#5698)
  • [Fix] Add spin-loop termination to for AMD GPU hang on MP-ZCH (#5714)
  • [Improvement] Add unit tests for warp primitives, bitonic sort, and ROCm warpSize guards (#5715)
  • [Improvement] Add KVZCH inference read-time hit rate metrics via fb303 ODS counters (#5730)
  • [Fix] Fix int32 truncation of 64-bit ssd_row_addrs in unrolled forward path (#5743)
  • [Improvement] Add KVZCH inference read-time hit rate metrics via fb303 ODS counters (#5745)
  • [Improvement] Add SSD/KVZCH config types to tbe/ssd/ package (#5753)

GenAI Support and Operators

Triton GEMM Support

  • [Improvement] Port reorder_batched_ad_lengths benchmark to tritonbench (#5505)
  • [New] IKBO LCE kernel in fbgemm (#5521)
  • [Improvement] Port group_index_select_2d to tritonbench (#5533)
  • [Improvement] Add Portions Copyright headers to modified third-party files (#5545)
  • [Improvement] Port jagged_index_select_2d benchmark to tritonbench (#5572)
  • [Improvement] Port bench_dense_to_jagged_1d to tritonbench (#5580)
  • [Improvement] Port bench_jagged_1d_to_dense to tritonbench (#5584)
  • [Improvement] Port bench_jagged_2d_to_dense and bench_dense_to_jagged_2d to (#5598)
  • [Improvement] Port jagged_dense_dense_elementwise_add_jagged_output and jagged_dense_elementwise_op_jagged_output to tritonbench (#5602)
  • [Improvement] Update jagged_acc_weights_and_counts and jagged_slice_cpu bench (#5620)
  • [Improvement] Upgrade permute_multi_embedding benchmark (#5627)
  • [Improvement] Upgrade batched_unary_embeddings benchmark (#5639)
  • [Improvement] Use Python 3.10+ typing in sparse/quantize/triton/utils (#5636) (#5642)
  • [New] Triton/TLX IKBO FA (#5651)
  • [Fix] TLX IKBO FA benchmarking with latest commit hash + bug fix (#5734)

Quantization Operators

  • [Fix] Fix fp16 code on aarch64 and Windows builds (#5548) (#5550)
  • [Improvement] merge SFINAE overloads of CodeGenHelpers templates with if constexpr (#5565) (#5571)
  • [Improvement] bf16 scale/bias for INT4 (#5595)
  • [Improvement] Vectorize fp16 row conversion in rowwise quantization (#5596)
  • [Fix] Fix EmbeddingQuantizeFloatToFloatOrHalfBenchmark (#5622)
  • [Improvement] Use double in dequant ref/scalar to match FMA precision (#5623)
  • [Improvement] Remove legacy quantize path (#5624)
  • [Improvement] Cleanup stale code for ROCM < 6.2 and CUDA < 12 (#5616) (#5625)
  • [Fix] Fix stale pytorch version checks (#5631)
  • [Improvement] Enable AVX512-BF16 dequant in OSS CMake and Bazel builds (#5635)
  • [Fix] [fbgemm_gpu[ Fix aarch64 build issues caused by D99968947 (#5655)
  • [Fix] Fix OOB read in _get_padding_value_kernel (#5652) (#5662)
  • [Improvement] Add trace export to mixdim benchmark and fix FP16 benchmark consistency (#5665)
  • [Fix] Add FP8 rowwise padding to quantized AllToAll pooled embeddings (#5673)
  • [Fix] Relax numerical tolerances in KV cache quantization tests (#5681)
  • [Improvement] Remove LEGACY parameter entirely from batch Quantize overload for API consistency (#5683)
  • [Improvement] Harden rowwise quantize benchmark with Kineto trace export (#5693)
  • [Fix] Fix fbgemm_dev build/test health issues (#5694)
  • [Improvement] benchmarks + stats tooling for bf16 AVX2 8-bit / N-bit dequant (D100932926) (#5709)
  • [Improvement] Remove unused test parameters (#5725)

Sparse Operators

  • [Improvement] Fixes and improvements to permute_2d_sparse_data_bench (#5477)
  • [Improvement] Add heterogeneous per-group input shapes support to group_index_select_2d_bench (#5487)
  • [Improvement] Add permute_1d comparison scripts and CPU cache flushing for old benchmark (#5492)
  • [Improvement] Add assertion to guard against overflow in keyed_jagged_index_select_dim1 (#5500)
  • [Improvement] Port reorder_batched_sequence_embeddings benchmark over to tritonbench (#5504)
  • [Improvement] Harden asynchronous_complete_cumsum_2d_bench in sparse_ops_benchmark (#5515)
  • [Improvement] Add meta function for block_bucketize_sparse_features_inference (#5529)
  • [Improvement] Enable block_bucketize* tests on ROCm (#5527) (#5532)
  • [Improvement] Remove redundant CUDA_KERNEL_ASSERTs in keyed_jagged_index_select_dim1 (#5539)
  • [Fix] Fix int32 overflow in keyed_jagged_index_select_dim1 (#5544)
  • [Improvement] Upgrade batch_reuse_index_select_device benchmark (#5562)
  • [Improvement] Improve keyed_jagged_index_select_dim1 and masked_select_jagged_1d bench (#5613)
  • [Improvement] Remove torch_compiled (#5617)
  • [Improvement] Validate total_num_blocks divisibility by my_size in block_bucketize (#5646)
  • [Fix] Fix 2 broken tests caused by D101141810 (#5654)
  • [Improvement] Add my_size > 0 guard and inference negative test for block_bucketize (#5663)
  • [Improvement] Optimize jagged_unique_indices_cuda (binary-search length + custom cub pipeline) (#5718)
  • [Fix] Fix Hypothesis differing_executors health check failure in index select (#5721)
  • [Improvement] Add unit test for batch_index_select_dim0 with large segment lengths (#5722)
  • [Fix] Fix int32 stride overflow in jagged_to_padded_dense at BLD > INT_MAX (#5755)

Build / CI Improvements and Better Engineering

  • [Improvement] Remove NCCLX one-sided comm code from fbgemm (#5475)
  • [Improvement] Add missing copyright headers to Meta-authored files (#5482)
  • [Improvement] Add Meta copyright headers to modified NVIDIA CUTLASS files (#5483)
  • [Improvement] Add Portions Copyright headers to modified AMD CK/ROCm gen_ai files (#5484)
  • [Improvement] Add Portions Copyright headers to modified third-party include files (#5485)
  • [Improvement] Add Portions Copyright headers to modified Arm KleidiAI files (#5486)
  • [Fix] Fix Vec2/Vec4 UVM performance regression with vectorized at::BFloat16 loads/stores (#5489)
  • [Fix] Fix Vec2/Vec4 UVM performance regression with vectorized at::Half copy (#5491)
  • [Improvement] Install libdw (#5495)
  • [Fix] Remove _test suffix from package name for test channel builds (#5502)
  • [Improvement] Update docs and compatibility table for FBGEMM v1.6.0 release (#5503)
  • [Fix] Fix build-time error for tbb in CentOS. (#5497) (#5516)
  • [Improvement] Update default CUDA version to 13.0.2 (#5524)
  • [Improvement] Move internal enrichment files to fb/ for OSS exclusion (#5541)
  • [Fix] Fix empty key lookup in gpu_detect.bash (#5551)
  • [Improvement] Bump minimum GCC to 11.4 (#5537) (#5553)
  • [Fix] Remove omp_set_num_threads from RadixSortTest to fix ASan leak (#5555)
  • [Improvement] Enable more clang-tidy checks on C++20 (#5575)
  • [Improvement] Add checks for uninitialized storage (#5579)
  • [Improvement] Simplify array_of_ones and remove array_of_zeroes (#5573) (#5581)
  • [Improvement] simplify PackingTraits methods (#5574) (#5582)
  • [Improvement] Simplify FP code (#5577) (#5583)
  • [Improvement] Replace SFINAE with C++20 concepts and requires clauses (#5586)
  • [Improvement] Use std::bit_cast and std::countl_zero in C++20 (#5592)
  • [Improvement] Use supported std::ranges algorithms (#5593)
  • [Improvement] Use CUB_WRAPPED_NAMESPACE instead of legacy CUB_NS_PREFIX (#5601)
  • [Fix] Strip -std=c++NN flag from pytorch package (#5604)
  • [Improvement] simplify ALIGNAS, remove useless attributes and stale CUDA workaround (#5608)
  • [Improvement] Add aligned_unique_ptr RAII wrapper to avoid leak risks (#5609)
  • [Improvement] Add CUDA 13.2 support to CI and release workflows (#5610) — reverted in this release; see #5750
  • [Improvement] Remove dead CUDA < 11 workarounds and simplify bf16/CUB guards (#5600) (#5612)
  • [Improvement] Unify duplicated cmake code between CPU and GPU builds (#5629)
  • [Improvement] Use C++20 [[unlikely]] and defaulted operator== (#5630)
  • [Fix] Fix 3 broken tests caused by D100185387 (#5656)
  • [Fix] Fix pyre type annotations in test_utils.py (#5660)
  • [Improvement] Fix flake8 E402 warnings (#5658) (#5661)
  • [Fix] Fix duplicate symbol linker errors on ARM builds (#5664)
  • [Fix] Fix OSS CI ModuleNotFoundError: explicit pip in conda env (#5691)
  • [Improvement] Enable device-side assertions on ROCm (#5723)
  • [Improvement] Re-enable get_cuda_error_help in kernel error message (#5724)
  • [Improvement] Replace rocm-smi with amd-smi across ROCm build, CI, and docs (#5597) (#5726)
  • [Fix] Enable AMD tests for ZCH & Fix OSS (#5727)
  • [Improvement] Add FBGEMM_NO_JK=2 (EnvFirstThenJk) policy; refactor feature-gate lookup into singleton (#5748)
  • [Fix] Revert CUDA 13.2 enablement (#5610) due to OSS CI cost regression and upstream conda-forge instability (#5750)
  • [Improvement] Annotate unused function (#5758)
  • [Fix] Remove erroneous NVIDIA proprietary block from BSD-3 LICENSE (#5760)

Tests and Benchmarks

  • [New] Add common scripts for benchmark trace analysis (#5498)
  • [Improvement] Re-organize diff benchmarking scripts (#5508)
  • [Improvement] Set manual seed for fbgemm benchmark (#5540)
  • [Improvement] Benchmarks for D98170783 (#5547)
  • [Improvement] Remove pt2_cpu stubs and move isValidBlockingFactor (#5556)
  • [Improvement] Benchmark code refactoring (#5632)
  • [Improvement] Add --device and --export-trace flags to stride_gemm_benchmark (#5671)
  • [Improvement] Harden repeat_arange benchmark with input validation and trace export (#5676)
  • [Improvement] Harden histogram_binning_calibration benchmark with input validation and trace export (#5687)
  • [Fix] Fix type annotation (#5695)
  • [New] Add scripts for analyzing bench runs (#5731)

Don't miss a new FBGEMM release

NewReleases is sending notifications on new releases.