Highlights
Inference & Production Deployment
- TurboSSDInferenceModule with streaming updates and snapshot loading for HSTU serving (#5558, #5554)
- AMD/ROCm support for SSD TBE inference with cache locking and dedicated memcpy streams (#5559, #5480)
- TBE EEG (Embedding Export Gateway) for inference workloads (#5688)
- DRAM KV cache and L2 cache hit rate metrics for production monitoring (#5633, #5730)
Enrichment & Feature Store Integration
- Configurable IGR enrichment support for DRAM KV embedding cache (#5463, #5488)
- OneFlow OpenTab and Feature Store enrichment backends (#5465, #5466, #5493, #5494)
- Per-feature pooling factors support for flexible embedding architectures (#5690)
Performance Optimizations
- Double-buffered eviction and auto-sized RocksDB block cache reducing prefetch stalls (#5512, #5513)
- Precomputed writeback dedup indices eliminating GPU-CPU sync in backward pass (#5522)
- Optimized jagged_unique_indices_cuda with binary-search and custom CUB pipeline (#5718)
- Vectorized FP16 row conversion in rowwise quantization (#5596)
Quantization & GenAI
- BF16 scale/bias support for INT4 quantization (#5595)
- AVX512-BF16 dequantization enabled in OSS builds (#5635)
- FP8 rowwise padding for quantized AllToAll pooled embeddings (#5673)
- New Triton IKBO LCE kernel and TLX IKBO Flash Attention (#5521, #5651)
Platform & Hardware Support
- SVE-FP16 version of EmbeddingSpMDM8Bit for ARM architectures (#5720)
- UVM pipeline support for MTIA accelerators (#5538)
- Preallocated host buffer support for CPU TBE (#5692)
Developer Experience
- C++20 modernization: concepts, requires clauses, std::ranges, and std::bit_cast (#5586, #5592, #5593)
- Comprehensive benchmark trace export and analysis tooling (#5498, #5731, #5671, #5693)
- Minimum GCC bumped to 11.4 for better C++20 support (#5553)
Software Requirements
FBGEMM_GPU v1.7.0 has been tested and known to work on the following setups:
- PyTorch: 2.12.x
- CUDA: 12.6, 12.8, 12.9, 13.0
- Python: 3.10, 3.11, 3.12, 3.13, 3.14
- ROCm: 7.0, 7.1
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.7.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.7.0Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu128/
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu129/
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cu130/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.7.0 --index-url https://download.pytorch.org/whl/cpuChanges
Table Batched Embedding (TBE) Operators
For GPU
- [New] Add configurable IGR enrichment support for DRAM KV embedding cache (#5463)
- [New] Add OneFlow OpenTab enrichment backend support (#5465)
- [New] Add OneFlow Feature Store enrichment backend and refactor dispatch (#5466)
- [New] Add sync fetch_sids_sync API for publish SID→VID mapping (#5467)
- [Fix] fix a device unmatched errors in benchmark (#5490)
- [Fix] Fix FBGEMM_MEMCHECK bug in vbe_metadata kernel (#5506)
- [Fix] Fix bug in momentum type declaration in HIP TBE kernel (#5501) (#5514)
- [Improvement] 52 [A] (#5518)
- [Improvement] Add unit test to validate HIP backward kernel with FP16 momentum (#5519)
- [Fix] Fix test_cache_int32_overflow test failure on ROCm (#5526) (#5530)
- [Fix] Fix output_type=BF16 test_backward_adagrad_large_dims failure (#5531)
- [Improvement] Fix ufmt lint: remove extra blank line in lxu_cache_test.py (#5535)
- [Fix] Fix autovec EmbeddingSpMDMNBit to handle pruned (-1) indices (#5543)
- [Improvement] Add periodic logging for L2 cache fill wait time (#5549)
- [Fix] Use sym_numel() instead of numel() in TBE pt2 backward codegen (#5563)
- [Fix] Fix TBE v2 forward kernel for embedding dim > 1024 (#5326) (#5569)
- [Improvement] Fold UVM code into TBE package (#5576)
- [Improvement] Migrate tbe_input_multiplexer.py and runtime_monitor.py into tbe/monitoring/ (#5590)
- [Improvement] Fold tbe/stats/ into tbe/monitoring/ for better organization (#5591)
- [Fix] Fix int32 truncation in tbe_input_combine offset accumulation (#5594)
- [Fix] Get actual free GPU memory in test_cache_int32_overflow (#5605)
- [Fix] Make warp segment threshold consitent with host function (#5606)
- [Fix] Revert D99867633 (#5607)
- [Fix] Fix tbe combine tests (#5614)
- [Fix] Use aligned_unique_ptr in more places to avoid leak (#5621)
- [Improvement] Remove stale ROCm 5.7 skip checks and dead SM70 code in tests (#5619) (#5626)
- [Improvement] Simplify is_torchdynamo_compiling to direct import from torch.compiler (#5618) (#5628)
- [Fix] Support multi-dimensional runtime_meta in RES streaming buffers by lazy init (#5643)
- [Fix] Add missing C10_CUDA_CHECK (#5647)
- [Fix] Fix VBE batch sizes not passed to request builder (#5653)
- [Improvement] log query empty count vs total count (#5657)
- [Improvement] Use newer STL features in codegen templates (#5659)
- [Improvement] Use Python 3.10+ typing in TBE ops and utilities (#5667)
- [Fix] Apply proper grid striding on forward V2 kernel for ROCm (#5447) (#5669)
- [Improvement] Exclude transient RES streaming buffers from checkpoints by setting persistent=False (#5674)
- [Improvement] Use Python 3.10+ typing in core TBE ops (#5675)
- [Improvement] TBE benchmark suites improvement (#5677)
- [Improvement] Refactor bounds_check_indices offset checks to condition-first (Phase 1) (#5682)
- [New] TBE EEG for Inference (#5688)
- [New] Add per-feature pooling factors support (#5690)
- [New] Add SVE-FP16 version of EmbeddingSpMDM8Bit (#5720)
- [Improvement] remove uneccesarry field for FixedBlockPool in inference (#5729)
- [Fix] Fix find_long_segments kernel launch failure for batch index select (#5732)
- [Improvement] support warpSize 32 and 64 in the same build (#5739)
- [New] Create tbe/config/ package with foundational embedding types (#5742)
- [Improvement] Remove unnecessary __syncthreads in bounds_check_indices_kernel_v2 (#5744)
- [New] Add cache config types to tbe/cache/ package (#5752)
For CPU
- [Improvement] Replace spin-wait polling with condition variable in EmbeddingKVDB fill queue (#5510)
- [Improvement] Precompute writeback dedup indices in forward to eliminate GPU-CPU sync in backward (#5522)
- [Fix] Fix CPU TBE inline bounds check for unified embedding (#5523)
- [Fix] Fix fused TBE weight buffer for MTIA (#5534)
- [New] Add UVM pipeline support for MTIA (#5538)
- [New] Add preallocated host buffer support to FBGEMM SplitTableBatchedEmbeddingBagsCodegen (#5692)
- [Improvement] Enable TBE nobag backward test for SGD on CPU (#5759)
SSD Table Batched Embedding (TBE) Operators
- [Improvement] Move compute thresholds logic for eviction (#5453)
- [Improvement] enable feature score auto collection in EBC (#5459)
- [Improvement] Migrate cudaStreamAddCallback to cudaLaunchHostFunc (#5462)
- [Improvement] Add Python enum configs and KJT builder for enrichment (#5464)
- [Improvement] Add cache locking and dedicated memcpy stream for SSD TBE inference (#5480)
- [New] Enable RES for DRAM KV embedding cache (#5488)
- [New] Add OneFlow OpenTab enrichment backend support (#5493)
- [New] Add OneFlow Feature Store enrichment backend and refactor dispatch (#5494)
- [Fix] Fix race conditions (#5496)
- [Improvement] Use atomicAdd for lxu_cache_locking_counter increments/decrements (#5509)
- [Improvement] Tune RocksDB bloom filter and background thread pool sizing (#5511)
- [Improvement] Double-buffer eviction buffers to reduce prefetch stalls (#5512)
- [Improvement] Auto-size RocksDB block cache and expose L2 cache hit rate (#5513)
- [Fix] Fix race conditions: make shared mutable state atomic (#5520)
- [Fix] Fix sorted_ids None issue in SSD TBE optimizer state fetching (#5525)
- [Improvement] Make inference cache locking opt-in via enable_cache_locking flag (#5546)
- [New] Add embedding cache support to oneflow base model (#5552)
- [New] Add streaming_update() and load_snapshot() for inference (#5554)
- [New] Add TurboSSDInferenceModule for HSTU serving integration (#5558)
- [New] Add AMD/ROCm support for SSD TBE inference (#5559)
- [Improvement] Support input data not most recent in MP-ZCH (#5567) (#5570)
- [Fix] Fix lint (#5611)
- [New] Add DRAM KV cache and L1 hit rate metrics for training (#5633)
- [Improvement] Skip scratch pad eviction data in enrichment mode to avoid cudaFree overhead (#5645)
- [Improvement] Add laser_batch_size to IGR enrichment, Add sleep for enrichemnt (#5697)
- [Fix] Gate enrichment_policy by per-TBE embedding_cache_mode (#5698)
- [Fix] Add spin-loop termination to for AMD GPU hang on MP-ZCH (#5714)
- [Improvement] Add unit tests for warp primitives, bitonic sort, and ROCm warpSize guards (#5715)
- [Improvement] Add KVZCH inference read-time hit rate metrics via fb303 ODS counters (#5730)
- [Fix] Fix int32 truncation of 64-bit ssd_row_addrs in unrolled forward path (#5743)
- [Improvement] Add KVZCH inference read-time hit rate metrics via fb303 ODS counters (#5745)
- [Improvement] Add SSD/KVZCH config types to tbe/ssd/ package (#5753)
GenAI Support and Operators
Triton GEMM Support
- [Improvement] Port reorder_batched_ad_lengths benchmark to tritonbench (#5505)
- [New] IKBO LCE kernel in fbgemm (#5521)
- [Improvement] Port group_index_select_2d to tritonbench (#5533)
- [Improvement] Add Portions Copyright headers to modified third-party files (#5545)
- [Improvement] Port jagged_index_select_2d benchmark to tritonbench (#5572)
- [Improvement] Port bench_dense_to_jagged_1d to tritonbench (#5580)
- [Improvement] Port bench_jagged_1d_to_dense to tritonbench (#5584)
- [Improvement] Port bench_jagged_2d_to_dense and bench_dense_to_jagged_2d to (#5598)
- [Improvement] Port jagged_dense_dense_elementwise_add_jagged_output and jagged_dense_elementwise_op_jagged_output to tritonbench (#5602)
- [Improvement] Update jagged_acc_weights_and_counts and jagged_slice_cpu bench (#5620)
- [Improvement] Upgrade permute_multi_embedding benchmark (#5627)
- [Improvement] Upgrade batched_unary_embeddings benchmark (#5639)
- [Improvement] Use Python 3.10+ typing in sparse/quantize/triton/utils (#5636) (#5642)
- [New] Triton/TLX IKBO FA (#5651)
- [Fix] TLX IKBO FA benchmarking with latest commit hash + bug fix (#5734)
Quantization Operators
- [Fix] Fix fp16 code on aarch64 and Windows builds (#5548) (#5550)
- [Improvement] merge SFINAE overloads of CodeGenHelpers templates with if constexpr (#5565) (#5571)
- [Improvement] bf16 scale/bias for INT4 (#5595)
- [Improvement] Vectorize fp16 row conversion in rowwise quantization (#5596)
- [Fix] Fix EmbeddingQuantizeFloatToFloatOrHalfBenchmark (#5622)
- [Improvement] Use double in dequant ref/scalar to match FMA precision (#5623)
- [Improvement] Remove legacy quantize path (#5624)
- [Improvement] Cleanup stale code for ROCM < 6.2 and CUDA < 12 (#5616) (#5625)
- [Fix] Fix stale pytorch version checks (#5631)
- [Improvement] Enable AVX512-BF16 dequant in OSS CMake and Bazel builds (#5635)
- [Fix] [fbgemm_gpu[ Fix aarch64 build issues caused by D99968947 (#5655)
- [Fix] Fix OOB read in _get_padding_value_kernel (#5652) (#5662)
- [Improvement] Add trace export to mixdim benchmark and fix FP16 benchmark consistency (#5665)
- [Fix] Add FP8 rowwise padding to quantized AllToAll pooled embeddings (#5673)
- [Fix] Relax numerical tolerances in KV cache quantization tests (#5681)
- [Improvement] Remove LEGACY parameter entirely from batch Quantize overload for API consistency (#5683)
- [Improvement] Harden rowwise quantize benchmark with Kineto trace export (#5693)
- [Fix] Fix fbgemm_dev build/test health issues (#5694)
- [Improvement] benchmarks + stats tooling for bf16 AVX2 8-bit / N-bit dequant (D100932926) (#5709)
- [Improvement] Remove unused test parameters (#5725)
Sparse Operators
- [Improvement] Fixes and improvements to permute_2d_sparse_data_bench (#5477)
- [Improvement] Add heterogeneous per-group input shapes support to group_index_select_2d_bench (#5487)
- [Improvement] Add permute_1d comparison scripts and CPU cache flushing for old benchmark (#5492)
- [Improvement] Add assertion to guard against overflow in keyed_jagged_index_select_dim1 (#5500)
- [Improvement] Port reorder_batched_sequence_embeddings benchmark over to tritonbench (#5504)
- [Improvement] Harden asynchronous_complete_cumsum_2d_bench in sparse_ops_benchmark (#5515)
- [Improvement] Add meta function for block_bucketize_sparse_features_inference (#5529)
- [Improvement] Enable block_bucketize* tests on ROCm (#5527) (#5532)
- [Improvement] Remove redundant CUDA_KERNEL_ASSERTs in keyed_jagged_index_select_dim1 (#5539)
- [Fix] Fix int32 overflow in keyed_jagged_index_select_dim1 (#5544)
- [Improvement] Upgrade batch_reuse_index_select_device benchmark (#5562)
- [Improvement] Improve keyed_jagged_index_select_dim1 and masked_select_jagged_1d bench (#5613)
- [Improvement] Remove torch_compiled (#5617)
- [Improvement] Validate total_num_blocks divisibility by my_size in block_bucketize (#5646)
- [Fix] Fix 2 broken tests caused by D101141810 (#5654)
- [Improvement] Add my_size > 0 guard and inference negative test for block_bucketize (#5663)
- [Improvement] Optimize jagged_unique_indices_cuda (binary-search length + custom cub pipeline) (#5718)
- [Fix] Fix Hypothesis differing_executors health check failure in index select (#5721)
- [Improvement] Add unit test for batch_index_select_dim0 with large segment lengths (#5722)
- [Fix] Fix int32 stride overflow in jagged_to_padded_dense at BLD > INT_MAX (#5755)
Build / CI Improvements and Better Engineering
- [Improvement] Remove NCCLX one-sided comm code from fbgemm (#5475)
- [Improvement] Add missing copyright headers to Meta-authored files (#5482)
- [Improvement] Add Meta copyright headers to modified NVIDIA CUTLASS files (#5483)
- [Improvement] Add Portions Copyright headers to modified AMD CK/ROCm gen_ai files (#5484)
- [Improvement] Add Portions Copyright headers to modified third-party include files (#5485)
- [Improvement] Add Portions Copyright headers to modified Arm KleidiAI files (#5486)
- [Fix] Fix Vec2/Vec4 UVM performance regression with vectorized at::BFloat16 loads/stores (#5489)
- [Fix] Fix Vec2/Vec4 UVM performance regression with vectorized at::Half copy (#5491)
- [Improvement] Install libdw (#5495)
- [Fix] Remove _test suffix from package name for test channel builds (#5502)
- [Improvement] Update docs and compatibility table for FBGEMM v1.6.0 release (#5503)
- [Fix] Fix build-time error for tbb in CentOS. (#5497) (#5516)
- [Improvement] Update default CUDA version to 13.0.2 (#5524)
- [Improvement] Move internal enrichment files to fb/ for OSS exclusion (#5541)
- [Fix] Fix empty key lookup in gpu_detect.bash (#5551)
- [Improvement] Bump minimum GCC to 11.4 (#5537) (#5553)
- [Fix] Remove omp_set_num_threads from RadixSortTest to fix ASan leak (#5555)
- [Improvement] Enable more clang-tidy checks on C++20 (#5575)
- [Improvement] Add checks for uninitialized storage (#5579)
- [Improvement] Simplify array_of_ones and remove array_of_zeroes (#5573) (#5581)
- [Improvement] simplify PackingTraits methods (#5574) (#5582)
- [Improvement] Simplify FP code (#5577) (#5583)
- [Improvement] Replace SFINAE with C++20 concepts and requires clauses (#5586)
- [Improvement] Use std::bit_cast and std::countl_zero in C++20 (#5592)
- [Improvement] Use supported std::ranges algorithms (#5593)
- [Improvement] Use CUB_WRAPPED_NAMESPACE instead of legacy CUB_NS_PREFIX (#5601)
- [Fix] Strip -std=c++NN flag from pytorch package (#5604)
- [Improvement] simplify ALIGNAS, remove useless attributes and stale CUDA workaround (#5608)
- [Improvement] Add aligned_unique_ptr RAII wrapper to avoid leak risks (#5609)
- [Improvement] Add CUDA 13.2 support to CI and release workflows (#5610) — reverted in this release; see #5750
- [Improvement] Remove dead CUDA < 11 workarounds and simplify bf16/CUB guards (#5600) (#5612)
- [Improvement] Unify duplicated cmake code between CPU and GPU builds (#5629)
- [Improvement] Use C++20 [[unlikely]] and defaulted operator== (#5630)
- [Fix] Fix 3 broken tests caused by D100185387 (#5656)
- [Fix] Fix pyre type annotations in test_utils.py (#5660)
- [Improvement] Fix flake8 E402 warnings (#5658) (#5661)
- [Fix] Fix duplicate symbol linker errors on ARM builds (#5664)
- [Fix] Fix OSS CI ModuleNotFoundError: explicit
pipin conda env (#5691) - [Improvement] Enable device-side assertions on ROCm (#5723)
- [Improvement] Re-enable get_cuda_error_help in kernel error message (#5724)
- [Improvement] Replace rocm-smi with amd-smi across ROCm build, CI, and docs (#5597) (#5726)
- [Fix] Enable AMD tests for ZCH & Fix OSS (#5727)
- [Improvement] Add FBGEMM_NO_JK=2 (EnvFirstThenJk) policy; refactor feature-gate lookup into singleton (#5748)
- [Fix] Revert CUDA 13.2 enablement (#5610) due to OSS CI cost regression and upstream conda-forge instability (#5750)
- [Improvement] Annotate unused function (#5758)
- [Fix] Remove erroneous NVIDIA proprietary block from BSD-3 LICENSE (#5760)
Tests and Benchmarks
- [New] Add common scripts for benchmark trace analysis (#5498)
- [Improvement] Re-organize diff benchmarking scripts (#5508)
- [Improvement] Set manual seed for fbgemm benchmark (#5540)
- [Improvement] Benchmarks for D98170783 (#5547)
- [Improvement] Remove pt2_cpu stubs and move isValidBlockingFactor (#5556)
- [Improvement] Benchmark code refactoring (#5632)
- [Improvement] Add --device and --export-trace flags to stride_gemm_benchmark (#5671)
- [Improvement] Harden repeat_arange benchmark with input validation and trace export (#5676)
- [Improvement] Harden histogram_binning_calibration benchmark with input validation and trace export (#5687)
- [Fix] Fix type annotation (#5695)
- [New] Add scripts for analyzing bench runs (#5731)