Highlights

TBE

Added new kernels and improved dispatch for CPU and GPU.
Optimizations for SSD offloading, RocksDB integration, and checkpoint handling.
Various bug fixes and performance improvements in TBE forward and backward passes.

Gen AI Ops

Numerous fixes and optimizations in GEMM kernels, including Cutlass BF16 grouped GEMM tuning cache support.
New kernels and performance improvements for FP8, Triton, and quantization operations.
Added support for fused SILU with quantization and RMS with quantization.
Enhanced heuristics and API updates for GenAI operators.
Improved GPU atomic operations and kernel vectorization.

New Ops

Introduction of HSTU ops, courtesy of Nvidia

Benchmarking

New benchmarks for inference, Gather/Scatter, and DramKV.

Better Engineering

Added build support for CUDA 12.9
Upgraded CI instances and build matrix filters.
Various fixes for OSS compatibility and build stability.

Software Requirements

FBGEMM_GPU v1.3.0 has been tested and known to work on the following setups:

PyTorch: v2.8
CUDA: v12.6, 12.8, 12.9
Python: v3.9, 3.10, 3.11, 3.12, 3.13

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU (instructions here) and FBGEMM-GenAI (instructions here).

Availability

FBGEMM_GPU and FBGEMM GenAI can be fetched directly from PyPI:

# FBGEMM_GPU - CUDA (only the CUDA 12.6 variant is available)
pip install fbgemm-gpu==1.3.0

# FBGEMM_GPU - CPU
pip install fbgemm-gpu-cpu==1.3.0

# FBGEMM GenAI
pip install fbgemm-gpu-genai==1.3.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU - CUDA
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cu128/
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cu128/

# FBGEMM_GPU - CPU
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cpu

# FBGEMM GenAI 
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cpu

GenAI

GEMM

[Fix] Fix CQS signal facebook-unused-include-check in fbcode/deeplearning/fbgemm/src [B] [A] (#4401)
[Improvement] Optimize some code out of compilation in the table lookup kernel (#4371)
[Improvement] pyre-fixmes for D75477355 - FBGEMM (#4390)
[Fix] fix feature eviction config conflict in inference operator (#4374)
[Improvement] Add CudaEvent Sync to Two Hop All To One Copies (#4367)
[Fix] fix output dtype issue in merge_pooled_embeddings when input tensors are all empty (#4325)
[Fix] Fix meta implementation for nobag (#4370)
[Fix] Enable stacked grouped GEMM (#4349)
[New] Add new kernels for Cutlass BF16 grouped GEMM for tuning cache (#4300)
[New] Support tuning cache for Cutlass BF16 grouped GEMM (#4299)
[Improvement] 0 tokens for gather_scale_dense_tokens (#4319)
[Improvement] improve write performance by ~10x (#4277)
[New] OSS MPZCH CUDA kernel in FBGEMM (#4214)
[Improvement] Migrate jagged tensor kernels to FBGEMM_LAUNCH_KERNEL, pt 1 (#4278)
[Improvement] wrap shard_input into dispatcher and access indices through data_ptr instead of [] (#4264)
[Improvement] Vectorize f16 conversion (#4253)
[Improvement] Use hardware-accelerated float16 conversion (#4245)
[Improvement] Speed up AMD training by reducing GPU atomic ops (#4255)
[Fix] fix diff failure for D75474404 (#4247)
[Improvement] Pass B_offsets to forward wrapper for MTIA (#4242)
[Fix] Issue warnings once. (#4164)
[Fix] Fix test breaking in internal github repo (#4162)
[New] Update heuristic for Cutlass BF16 Grouped GEMM (#4138)
[Improvement] Refactor Cutlass BF16 Grouped GEMM (#4124)
[Improvement] Add flag to disable fbgemm autovec in OSS (#4100)
[New] FuseScatterAdd supports non-WS kernels. (#4135)
[New] Build MXFP4/NVFP4 CUTLASS grouped GEMM (#4128)
[Fix] Disable some sources from build (#4120)
[Fix] Revert D73898778 (#4107)
[Improvement] More if-constexpr (#4054)
[Fix] Update repo version and reenable rowwise scaling (#4038)
[Improvement] Autodetect Triton WS support. (#4009)
[Fix] Minor fix to bf16 grouped gemm test (#4002)
[Improvement] Improve Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon by 2%-10% (#3879)

GenAI Ops

[Fix] Remove all imports of <torch/torch.h> from CK extensions (#4407)
[Improvement] BF16I4 Preshuffled Batched GEMM (#4399)
[Fix] silu_mul_quant fix (#4395)
[Fix] Fix trunk health GenAI tests (#4394)
[Improvement] add actual_batch_size to rope_qkv_varseq_prefill (#4380)
[Fix] silu mul quant torch api fix (#4376)
[Improvement] silu_mul API Update (#4359)
[Improvement] Add FP32 support for routing_score dtype (#4352)
[New] Add initial version of TuningCache and scripts for heuristic + kernel (#4289)
[Improvement] Update AI Codesign Cutlass to 4.0 (#4276)
[Improvement] modify convert_e4m3fn_kv_cache_to_e4m3fnuz_inplace kernel to support N_H_L dim and address correctness (#4286)
[Fix] Fix for T227105252 ("Your diff, D76285873, broke one test") (#4318)
[Improvement] Update torch API Definition (#4310)
[Fix] [fbgemm_gpu] Disable GenAI test in OSS (#4311)
[New] Fused SILU with quantization and RMS with quantization (#4204)
[New] Leverage fuse kernel in inference workload (#1237) (#4157)
[Improvement] Triton based activation kernels runs on valid tokens. (#4185)
[Improvement] Triton based Gather/Scatter kernels runs on valid tokens. (#4184)
[Improvement] IndexShuffling supports TP2EP. (#4180)
[Fix] iRoPE varseq flag for pre-calculated kv qparams (#4160)
[Improvement] Clean up IndexShuffling op. (#4155)
[Fix] Revert D74855940 (#4149)
[Improvement] [fbgemm_gpu] Enable building FB-internal sources only if CUDA version… (#4143)
[Improvement] Decouple some operator defs from operator impl (#4140)
[New] introduce kernel for converting e4m3fn kv_cache to e4m3fnuz (#4113)
[Fix] OSS CI fixes (#4077)
[New] Create dedicated kv cache header file (#4089)
[New] Add FP8 rowwise support for MetaShuffling demo. (#4094)
[New] pipelining one_shot and two_shot allreduce (#4081)
[Improvement] E2E cleanup and baseline. (#1164) (#4080)
[New] Add external qparams parameters to mqa_attn API (#4069)
[Improvement] Reduce OSS CI timeout (#4067)
[New] TokenShuffling MoE Example for OSS (#4065)
[Improvement] Add external qparams parameters to dequantize_int4_cache API (#4062)
[New] fbgemm_gpu.experimental.gen_ai.moe.silu_mul_quant. (#4059)
[Fix] Remove unused variable in gqa_attn_splitk_attn_kernel (#4014)
[New] fbgemm_gpu.experimental.gen_ai.moe.silu_mul. (#1144) (#4057)
[New] fbgemm_gpu.experimental.gen_ai.moe.scatter_add_dense_tokens. (#4055)
[New] int4 kv (#3878)
[Improvement] Optimize kv cache usage for yoco (#4030)
[Improvement] Use cudaMemset/hipMemset to setup IndexShuffling kernel. (#4016)
[Improvement] Cleanup gather/scatter ops. (#4001)
[New] torch.ops.fbgemm.gather_scale_quant_dense_tokens (#3995)
[Improvement] Make gqa tests device agnostic (#3985)
[New] Add index_shuffling to gather_scatter_bench. (#3999)
[Improvement] Fuse ScatterAdd into GroupedGEMM. (#3994)
[Improvement] GroupedGEMM perf improvement. (#3992)
[New] Open source TokenShuffling MoE kernels. (#3978)
[Improvement] Make kv_cache tests device agnostic (#3961)
[Improvement] Make quantize_bench easier to use in OSS (#3968)
[Improvement] Modernize FP8 Rowwise Tuning (#3955)
[Fix] reland fp8 kv cache dequantization fix (#3956)
[Improvement] Update DeepGemm with Latest Performance Improvements (#3954)
[New] Add DEEPGEMM Masked API. (#3949)
[New] Add harness for comms benchmark (#3936)
[Improvement] Use Int64 Indexing in Grouped Gemm (#3930)

FP8

[New] Support skip scaling for input tensor for Triton rowwise FP8 kernel (#4362)
[Improvement] Enrich auto-tune shapes for OC OBA model (#4368)
[Improvement] Vectorize load/store for FP8 Quantization (#4262)
[Improvement] Build and optimize BF16 grouped GEMM on blackwell (#4353)
[Improvement] Tune FP8 grouped GEMM for Llama4 shapes (#4326)
[Improvement] Add tensor checks in FP8 batched GEMM (#4347)
[Improvement] Improve heuristic for Cutlass FP8 Grouped GEMM (#4309)
[New] Support tuning cache for Cutlass FP8 Grouped GEMM (#4308)
[New] Support tuning cache for Cutlass FP8 GEMM (#4301)
[New] CK Preshuffle FP8 Rowwise GEMM (#4234)
[Improvement] Simplify CK FP8 Kernel Launch and enable FP16 Outputs. (#4233)
[Fix] Revert D76342974: Multisect successfully blamed "D76342974: [FBGEMM] Make Cutlass FP8 Rowwise bias always FP32" for one test failure (#4323)
[Improvement] Make Cutlass FP8 Rowwise bias always FP32 (#4317)
[Improvement] group gemm float8 dtype for AMD GPU (#4282)
[Improvement] Use float8e4m3fn for MI350+ (#4280)
[Improvement] Partition correctness issues from performance issues when pruning configs (#4261)
[Fix] Disable use_cuda_graph for inductor integration with non-persistent fp8_rowwise (#4259)
[Improvement] cap scale_ub used in fp8 kv_cache quantization (#4237)
[Fix] Fix FP8 rowwise GEMM meta kernel with 3d input tensor (#4231)
[Fix] Fix FP8 quantization meta kernels with 3d input tensor (#4221)
[Improvement] Add best config for FP8 medium shapes (#4169)
[Improvement] Update matmul_fp8_row_meta signature (#4168)
[Improvement] Support Bias in _kernel_matmul_fp8_row_non_persistent (#4167)
[Improvement] Optimize cudaGetDeviceProperties runtime overhead (#4209)
[Improvement] Enable FP8 Grouped/Batched/Regular GEMM with dispatched SM100 and optimizations (resubmit) (#4197)
[New] cache quantization (#4031)
[Fix] FP8 kv code improvement (#4040)
[Fix] Fix templates for FP8 Rowwise Slow Accumulation (#4037)

Triton

[Fix] Fix trunk health (#4379)
[Improvement] Replace torch quantization implementation with Triton version (#4217)
[Improvement] Support BF16 in Triton downcast quantization mx4 unpack kernel (#4203)
[Fix] FP4 Triton kernel bug fix (#4181)
[Improvement] INT64 address indexing. (#1049) (#3965)

Quantize Ops

[New] Utilities for slicing preshuffled tensors (#4396)
[Improvement] Support scale_bias_last on tbe lookup kernel (#4363)
[New] NVFP4 quantization emulation kernels as reference (#4324)
[Improvement] Update quantize_test to fix lint (#4355)
[Improvement] Disable comm/multi_gpu_car_test.py on non-CUDA machines (#4328)
[Improvement] NVFP4 kernel fusion and optimization (#4329)
[Improvement] SymInt-ify quantize.cpp (#4322)
[Fix] Fix IMA issue in QKV quantize kernel (#4305)
[New] Added triton implementation for nvfp4 quantization scheme (#4275)
[Improvement] Enable K that is not divisible by group size for shuffled mixed dtype kernels. (#4267)
[Fix] Fix the output shape for 3d input for f8f8bf16_rowwise_meta (#4254)
[Improvement] optimization of perKVhead quantization (#4161)
[Improvement] Allow multiple group sizes to be passed in quantize_bench (#4137)
[New] Add Llama4 shapes in quantize_bench (#4129)
[New] Add MXFP4 PT reference quantization kernel and refactor CUTLASS FP4 GEMM (#4117)
[New] Support Triton unpacked MXFP4 quantization kernel (#4116)
[Improvement] Migrate TBE inference kernels to FBGEMM_LAUNCH_KERNEL (#4092)
[Fix] [fbgemm_gpu] Remove sm_100 and sm_120 (#4024)
[New] Enable FP4 CUTLASS GEMM and CUDA quantization kernels (#4004)
[Fix] Fix weighted TBE inference NaN (un-init) row_weights (#4006)
[New] add logic for new heuristics of l4_17b_128e shapes (#3984)
[Improvement] refine the heuristics sweeping scripts for llama4 17b_128e shapes (#3983)
[Improvement] Performance Optimization: Improved TileShape Configuration for Large Llama Shapes (#3790)

TBE

TBE GPU

[Improvement] Move tbe weights as buffers so export can track properly (#4369)
[Improvement] Support prefetch pipeline in bounds_check_indices (#4312)
[New] tbe cpu nobag dispatch and backward pass kernel impl (#4303)
[Improvement] tbe cpu nobag dispatch and forward pass kernel impl (#4302)
[Fix] Fix the large indices forward test (#4232)
[Fix] Limit the grid size for the TBE forward kernel (#4208)
[New] Add auto-gen aux args for SSD TBE (#4220)
[Improvement] Replace enable_optimizer_offloading with aux_bool (#4219)
[Fix] pyre fixes for D75477355, group 1 (#4212)
[Improvement] Add more parameter specializations for autovec TBE kernels (#4153)
[New] Add TBE data configuration reporter to TBE forward" (#4130)
[Fix] support filling partial rows from backend (#4158)
[Improvement] Migrate TBE cache kernels to FBGEMM_LAUNCH_KERNEL (#4127)
[Improvement] Migrate TBE utility kernels to FBGEMM_LAUNCH_KERNEL (#4122)
[New] Writeback support (#4103)
[New] Add meta functions for cache ops (#4118)
[Fix] Fix backward_dense_test (#3702)
[New] Add more parameter specializations for autovec TBE kernels (#4047)
[Improvement] Migrate TBE forward kernels to FBGEMM_LAUNCH_KERNEL (#4079)
[Improvement] Upcast round_up inputs/outputs/ to uint64_t (#4052)
[Fix] Fix test_indices_estimation max_i should not be <1 (#4056)
[Improvement] Make sure that all TBE grad_output is contiguous and 16-byte aligned (#4041)
[Improvement] Use overflow_safe_int_t in TBE forward training (#3953)
[Fix] [fbgemm_gpu] Add missing init.py (#4008)
[New] DramKVstore (#4005)
[Improvement] Support prefetch pipeline in bounds_check_indices (#3923)
[Improvement] Include NONE in bounds_check_mode validation (#3990)
[Fix] Fix learning_rate_tensor to avoid in-place op and recompilation issues (#3989)
[Fix] Fix bounds_check_indices v2 bug (#3979)
[New] Add __TEMPLATE_SOURCE_FILE__ macro to TBE codegen sources (#3962)
[Improvement] Add bound checks in TBE grad_indice_weights kernel (#3943)
[Improvement] Use bounds_check_indices v2 on ROCm (#3916)
[New] Add basic CLI for EEG parameter extraction (#3951)
[Improvement] Pass in sharding position information to TBE to facilitate logging / dump / etc. (#3927)
[Improvement] Fix flaky TBE unit tests (#3938)
[Fix] Fix get_infos_metadata meta dispatch (#3946)
[Improvement] Change set_learning_rate_tensor (#3945)

TBE SSD

[Fix] Remove debug_split_optimizer_states (#4397)
[Improvement] Update the rowwise adagrad optimizer to leverage optimizer state offloading, v4, frontend (#4249)
[Fix] fix bugs in D76548519 (#4378)
[Fix] patch fixes for eviction (#4304)
[Fix] [fbgemm_gpu] SSD test fix for OSS (#4351)
[Fix] Fixing reading from EmbeddingRocksDB connection (#4341)
[Improvement] Adding a separate utils file for KVTensorMetaData (#4298) (#4335)
[Improvement] Making create_rocksdb_hard_link_snapshot function a no_op (#4340)
[New] Adding E2E unit tests for KVTensorMetaData class (#4298)
[New] Adding KVTensorMetaData class (#4297)
[New] Adding get_kvtensor_serializable_metadata function to (#4296)
[Fix] ssd offloading: fix trunk break for optimizer state (#4313)
[Improvement] Optimize integration test (#4284)
[Fix] disable rdb ckpt with dram backend (#4292)
[Fix] [fbgemm_gpu] Fix DRAM test (#4288)
[New] Added unit tests for the entire ssd offloading using rocksdb checkpoint flow (#4228)
[Improvement] Expose SE/DESE support to EmbeddingRocksDBWrapper for training pipeline (#4227)
[Improvement] Adding Serialization and Deserialization functions for KVTensor (#4226)
[New] Creating ReadOnlyEmbeddingKVDB class and necessary functions (#4225)
[New] Creating RocksDBCheckpointHandler to expose rocksdb checkpoint to python (#4224)
[New] Adding function to create a snapshot and exposing it from EmbeddingRocksDBWrapper (#4223)
[New] Creating Checkpointhandle and connecting to EmbeddingRocksDB (#4222)
[Improvement] Adding helper function for enabling RocksDB Checkpoint (#4213)
[New] Track updated rows in SSDTBE (#4211)
[Improvement] Pass the updated embeddings to EmbeddingKVDB (#4210)
[Improvement] chunk processing l2 cache flush (#4216)
[Improvement] chunking opt split and fix duplicate flush (#4260)
[Improvement] optimize unit test running time (#4268)
[Improvement] add load checkpoint support for virtual table (#4250)
[Fix] fixes for dram kv enablement (#4246)
[Improvement] Dram integretion test - split tbe training (#4241)
[New] Dram integration test - dram kv tensor wrapper (#4240)
[New] Dram optimizer initialization (#4239)
[Improvement] Adding a mutex lock to set_range function (#4207)
[Improvement] Update the rowwise adagrad optimizer to leverage optimizer state offloading, v4, backend (#4195)
[Improvement] support get state dict and apply state dict (#4145)
[New] Add tensor_stream unit test (#4099)
[New] implement optimizer state with opt offloading (#4141)
[Fix] Fix a lint error. (#4156)
[Improvement] shardTensor metadata recalc after checkpoint state_dict (#4146)
[Improvement] Migrate TBE SSD cache kernels to FBGEMM_LAUNCH_KERNEL (#4142)
[Improvement] make sure narrow returns contiguous tensor (#4139)
[Improvement] do not create random value for optimizer (#4132)
[Fix] fix lint error (#4131)
[Improvement] Append columns to the SSD cache for storing optimizer data, v3 (#4125)
[New] Support ordered read based on weight id in KVT (#4108)
[New] Add new interfaces to SSD TBE for checkpoint saving and loading (#4088)
[New] Add logic to stream weights in EmbeddingKVDB (#4058)
[Improvement] Add enable_raw_embedding_streaming from TBE config to EmbeddingKVDB (#4053)
[Improvement] update ssd tbe and kvtensor with UT (#4084)
[Fix] Fix list OSS (#4083)
[New] add UT for new use cases (#4074)
[New] support zero collision tables in ssd operator (#4033)
[Improvement] small changes for kvzch (#4073)
[New] add c++ bucket sorted id utils (#3982)
[New] add rocksdb backend api for getting id tensors (#3981)
[New] add kvtensor discrete id IO support (#3972)
[Improvement] skip multiple flush at the same iteration (#3967)
[Improvement] always toggle compaction to true (#3935)

Other Ops

Sparse Ops

[New] add optimized reorder_batched_ad_indices_kernel on AMD (#4388)
[Improvement] add more checks for CPU combined inputs coalescing (#4202)
[Improvement] preprocessor logic for hipBLAS V3 API changes (#4281) (#4320)
[Fix] Move sparse_op registration + correct sigmoid XL lowering settings (#4179)
[New] Add keep_orig_idx_per_feature parameter to block_bucketize_sparse_features kernel (#4027)
[Fix] [fbgemm_gpu] Fix missing operator registration (#4042)
[New] Move batched_complete_cumsum op to FBGEMM (#4036)

Permute Ops

[Improvement] Relax the checks for dimensions of pooled_embs (#4165)
[New] support permute_multi_embedding_function on torch.export (#3897)

Benchmarks

[New] handle inference buck gpu deps (#4358)
[New] kvzch inference python operator (#4344)
[Fix] Revert D75462895: Multisect successfully blamed "D75462895: [fbgemm_gpu] Add TBE data configuration reporter to TBE forward (v2)" for one test failure (#4381)
[New] Add TBE data configuration reporter to TBE forward (v2) (#4364)
[Improvement] [fbgemm_gpu] Upgrade benchmark workflows (#4337)
[Fix] comment out benchmark UT (#4338)
[New] Added silu and rms fusion with nvfp4 quantization for Triton kernel (#4285)
[Improvement] benchmark for DramKV (#4238)
[New] Add kernel execution timing to the KernelLauncher class (#4201)
[Fix] Fix vbe benchmark for MTIA (#4171)
[Fix] Fix TBE benchmark results logging (#4170)
[New] ReportTBE data configuration with EEG-based indices (squash stack from D73450767) (#4046)
[New] add batched support (#4003)
[Improvement] Cleanup shuffling ops. (#4013)
[New] Add Gather/Scatter related benchmark. (#3993)
[Improvement] Add Support for Indices and Offsets File Options in Benchmarking Script (#3966)
[New] Support MTIA for vbe (#3963)
[Fix] [fbgemm_gpu] Fix CPU benchmark scripts (#3941)

Better Engineering

Utilities

[Improvement] Migrate jagged tensor kernels to FBGEMM_LAUNCH_KERNEL, pt 2 (#4350)
[Improvement] Migrate SourceContext macros to source_context.h (#4392)
[Fix] Deprecate barrier isolation macros (#4357)
[Improvement] Decouple some operator defs from operator impl (#4272)
[Improvement] Decouple some operator defs from operator impl (#4175)
[Improvement] Decouple some operator defs from operator impl (#4175)
[Fix] Silence autodeps warnings (#4229)
[Improvement] Migrate TBE UVM cache kernels to FBGEMM_LAUNCH_KERNEL (#4193)
[Improvement] pyre fixes for D75477355, group 11 (#4215)
[Improvement] Migrate to the new PackedTensorAccessor TensorAccessor definitions (#3991)
[New] Add set_max_dynamic_smem (#4398)

Builds

[Improvement] [fbgemm_gpu] Update release versioning to 1.3 (#4408)
[Improvement] [fbgemm_gpu] Increase PIP install test timeouts (#4406)
[Improvement] Use newer CMake module features (#4377)
[Improvement] Increase timeout for Nova GenAI workflow (#4372)
[Improvement] [fbgemm_gpu] Upgrade CI instances (#4366)
[Improvement] FBGEMM build changes to support integration with pytorch (#4354)
[Fix] [fbgemm_gpu] Fix CUDA 12.9 OSS compilation for HSTU (#4360)
[Improvement] [fbgemm_gpu] Add build support for CUDA 12.9 (#4356)
[Improvement] [fbgemm_gpu] Integrate HSTU into OSS CI (#4236)
[Misc] OSS the TorchRec MPZCH CPU kernel functions (#4295)
[Improvement] Enable HSTU builds in fbcode (#4290)
[Fix] Fix use __fp16 by default (#4257)
[Fix] [fbgemm_gpu] Fix detection mechanism for FBPKG builds (#4243)
[Fix] Fix the logic around filtering FB-only code from the build (#4230)
[Improvement] [fbgemm_gpu] Improvements to the Nova build matrix filter script (#4206)
[Improvement] [fbgemm_gpu] Filter Build Coordinates Generated by Nova (#4196)
[Improvement] Enable FP8 Grouped/Batched/Regular GEMM with dispatched SM100 and optimizations (#4172)
[Fix] Trim constexpr from isA to improve Windows clang-cl support. (#4119)
[Improvement] Enable FP8 Grouped/Batched/Regular GEMM with dispatched SM100 and optimizations (#4172)
[Fix] [fbgemm_gpu] Remove sm90 target from OSS builds (#4177)
[Fix] [fbgemm_gpu] Disable GenAI builds against CUDA 11.8 (#4173)
[Improvement] [fbgemm_gpu] Support ROCm 6.4 builds (#4114)
[Improvement] [fbgemm_gpu] Enable GenAI ROCm builds in CI and Nova workflows (#4066)
[Improvement] Support building for armv8.1 (#4068)
[Improvement] [ROCm OSS Enablement] Update bash build and install scripts to account for targets and variants (#4032)
[Fix] [ROCm OSS Enablement] Update setup.py to account for targets and variants (#4023)
[Fix] fix build that excludes a bunch of features (#4020)
[Improvement] [fbgemm_gpu] Add rate limiting to github metrics scripts (#4007)
[Fix] [fbgemm_gpu] Disable MoE tests in OSS (#3998)
[New] [FBGEMM][PR] [fbgemm_gpu] Enable ROCm builds for GenAI, pt 2 (#3996)
[New] [fbgemm_gpu] Enable ROCm builds for GenAI, pt 1 (#3910)
[Fix] [fbgemm_gpu] Fix CUDA publish version for PyPI (#3988)
[Fix] Fix runner for ROCm CI (#3986)
[Fix] fix lint issue (#3980)
[Fix] Fix split_embeddings_utils in CMake (#3974)
[Fix] [fbgemm_gpu] Disable grouped GEMM tests in OSS (#3971)
[Fix] Guard CUDA API call against older driver versions (#3970)
[Fix] [fbgemm_gpu] Fix coalesce ops build in OSS (#3964)
[Fix] [fbgemm_gpu] Install missing pyyaml package in PIP install tests (#3957)
[Improvement] [fbgemm_gpu] Update release workflows (#3952)
[Improvement] [fbgemm_gpu] Reduce OSS build sizes for non-GenAI FBGEMM_GPU (#3948)
[New] [fbgemm_gpu] Add Scripts for Generating Release Reports (#3676)
[Fix] [fbgemm_gpu] Fix ROCm test reliability (#4385)
[Fix] [fbgemm_gpu] ROCm fixes for CI (#4345)

Documentation

[Fix] Fix benchmark helper text (#4375)
[Fix] [fbgemm_gpu] Fix broken docs link (#4252)
[Fix] [fbgemm_gpu] Fix releases docs (#4188)
[Improvement] Update ReadMe. (#4126)
[Improvement] Update fbgemm gen_ai README (#3997)
[Improvement] [fbgemm_gpu] Update release version in docs (#3959)

Other

Utilities

[Improvement] Decouple embedding_ssd_{}_pt2_autograd from CUDA files (#4389)
[Improvement] Decouple embedding_ssd_{}_pt2_autograd from CUDA files (#4389)
[Fix] Back out "Add make directory to filestore abstraction" (#4386)
[Improvement] Add CudaEvents Barrier before MemCpy V33 (#4348)
[Improvement] kv embedding inference cache wrapper (#4343)
[New] Implement a stat library for fbgemm embedding (#4339)
[Improvement] Add make directory to filestore abstraction (#4346)
[Improvement] Add manifold wrapper (#4291)
[Improvement] Prevent duplicate operator registrations (#4327)
[Fix] Replace references to folly::hash::fnv(32|64) with the _BROKEN alias (#4332)
[Fix] [CK] Update CK revision to include fix for F16 atomics (#4306)
[Fix] Back out "Migrate TBE UVM cache kernels to FBGEMM_LAUNCH_KERNEL" (#4315)
[Improvement] Include template filename in the DSA file descriptor (#4287)
[Improvement] Add feature evict for dram_kv_embedding_cache. (#4187)
[Fix] Fix Signed-Unsigned Comparison in Tensor Utils (#4279)
[Improvement] CK Version Update (#4235)
[Improvement] Jemalloc Mempool and Adaptation for CPU HASHTABLE (#4154)
[Fix] Revert D73927918 (#4190)
[Improvement] [fbgemm_gpu] Install conda using Miniforge (#4178)
[Fix] Silence autodeps warnings (#4163)
[New] Add checks for dimensions of pooled_embs (#4159)
[Fix] [fbgemm_gpu] Change Conda update channel (#4144)
[Improvement] Migrate embedding_bounds_check to FBGEMM_LAUNCH_KERNEL (#4104)
[Fix] Re-enable c10_retrieve_device_side_assertion_info in kernel launch check (#4109)
[Fix] Remove call to c10_retrieve_device_side_assertion_info for now (#4098)
[Fix] Deprecate get_device_properties for at::cuda::getDeviceProperties (#4096)
[Improvement] update hipify_torch submodule for version 2 (#4028) (#4093)
[Improvement] Update bounds_check_indices version selection logic (#4086)
[Improvement] Simplify weight row cache load and evict routines, v2 (#4087)
[New] Compute optimizer state pointer in table row (#4078)
[Improvement] Refactor bounds_check_indices (#4049)
[Improvement] Clean up WeightRow in preparation for optimizer state offloading (#4021)
[New] Enable NaN checks on tensor arguments to kernel launches (#4029)
[Fix] Use CUDAStream instead of cudaStream_t in kernel_launcher (#4071)
[Fix] Remove cudaStreamGetDevice() call (#4070)
[Improvement] use fmaf to do multiply and add (#4051)
[Fix] Fix shared memory check for HIP (#4044)
[Fix] Back out "Migrate make_pta_acc_format() away from old macros, v2]" (#4039)
[Improvement] Move ComputeDevice to split_table_batched_embeddings_ops_common (#4034)
[Improvement] Optimize if-statements with if-constexpr (#4022)
[Fix] Fixes and enhancements to FBGEMM_LAUNCH_KERNEL (#4015)
[Improvement] Expand the EEG CLI utility to support indices generation (#4017)
[Improvement] Remove old code associated with TensorAccessor creation (#4010)
[Improvement] Simplify WeightRowAccessor based on its usage (#3975)
[New] Add kernel to initialize shared memory for testing (#4000)
[Improvement] PackedTensorAccessor cleanup (#3987)
[Improvement] TensorAccessor cleanup (#3973)
[New] Incorporate __TEMPLATE_SOURCE_FILE__ into kernel launcher facilities (#3969)
[New] Incorporate Torch DSA (#3950)
[New] Better kernel launch utilities, pt 2 (#3947)
[New] Better kernel launch utilities (#3914)

Miscellaneous

[New] Add HSTU in fbgemm_gpu/experimental/ (#4090)
[Fix] Fix inputs in EmbeddingSpMDM8BitTest (#4200)
[Fix] Fix weights pointer not moving in idx==-1 case (#4199)
[Improvement] Support ROCm 6.4 builds, pt 2 (#4151)
[Fix] Fix cpuinfo not being initialized before checking for ARM SVE2 (#4121)
[Fix] Fix illegal memory access when weights are partially empty in input combine cuda (#4101) (#4111)
[Fix] Fix illegal memory access when weights are partially empty in input combine cuda (#4101)
[Fix] Fix illegal memory access when weights are partially empty in input combine cuda (#4101)
[Fix] Add +fp16fml to -march for aarch64 (#4091)
[New] Allow merge_pooled_embedding take in device without index (#4061)
[Improvement] Migrate make_pta_acc_format() away from old macros, v3] (#4048)
[Fix] fix the type hack in dramKV wrapper (#4012)

pytorch/FBGEMM v1.3.0 FBGEMM v1.3.0 Release Notes on GitHub

Highlights

TBE

Gen AI Ops

New Ops

Benchmarking

Better Engineering

Software Requirements

Availability

GenAI

GEMM

GenAI Ops

FP8

Triton

Quantize Ops

TBE

TBE GPU

TBE SSD

Other Ops

Sparse Ops

Permute Ops

Benchmarks

Better Engineering

Utilities

Builds

Documentation

Other

Utilities

Miscellaneous

pytorch/FBGEMM v1.3.0
FBGEMM v1.3.0 Release Notes

on GitHub