github pytorch/FBGEMM v1.3.0
FBGEMM v1.3.0 Release Notes

12 days ago

Highlights

TBE

  • Added new kernels and improved dispatch for CPU and GPU.
  • Optimizations for SSD offloading, RocksDB integration, and checkpoint handling.
  • Various bug fixes and performance improvements in TBE forward and backward passes.

Gen AI Ops

  • Numerous fixes and optimizations in GEMM kernels, including Cutlass BF16 grouped GEMM tuning cache support.
  • New kernels and performance improvements for FP8, Triton, and quantization operations.
  • Added support for fused SILU with quantization and RMS with quantization.
  • Enhanced heuristics and API updates for GenAI operators.
  • Improved GPU atomic operations and kernel vectorization.

New Ops

  • Introduction of HSTU ops, courtesy of Nvidia

Benchmarking

  • New benchmarks for inference, Gather/Scatter, and DramKV.

Better Engineering

  • Added build support for CUDA 12.9
  • Upgraded CI instances and build matrix filters.
  • Various fixes for OSS compatibility and build stability.

Software Requirements

FBGEMM_GPU v1.3.0 has been tested and known to work on the following setups:

  • PyTorch: v2.8
  • CUDA: v12.6, 12.8, 12.9
  • Python: v3.9, 3.10, 3.11, 3.12, 3.13

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU (instructions here) and FBGEMM-GenAI (instructions here).

Availability

FBGEMM_GPU and FBGEMM GenAI can be fetched directly from PyPI:

# FBGEMM_GPU - CUDA (only the CUDA 12.6 variant is available)
pip install fbgemm-gpu==1.3.0

# FBGEMM_GPU - CPU
pip install fbgemm-gpu-cpu==1.3.0

# FBGEMM GenAI
pip install fbgemm-gpu-genai==1.3.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU - CUDA
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cu128/
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cu128/

# FBGEMM_GPU - CPU
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cpu

# FBGEMM GenAI 
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cpu

GenAI

GEMM

  • [Fix] Fix CQS signal facebook-unused-include-check in fbcode/deeplearning/fbgemm/src [B] [A] (#4401)
  • [Improvement] Optimize some code out of compilation in the table lookup kernel (#4371)
  • [Improvement] pyre-fixmes for D75477355 - FBGEMM (#4390)
  • [Fix] fix feature eviction config conflict in inference operator (#4374)
  • [Improvement] Add CudaEvent Sync to Two Hop All To One Copies (#4367)
  • [Fix] fix output dtype issue in merge_pooled_embeddings when input tensors are all empty (#4325)
  • [Fix] Fix meta implementation for nobag (#4370)
  • [Fix] Enable stacked grouped GEMM (#4349)
  • [New] Add new kernels for Cutlass BF16 grouped GEMM for tuning cache (#4300)
  • [New] Support tuning cache for Cutlass BF16 grouped GEMM (#4299)
  • [Improvement] 0 tokens for gather_scale_dense_tokens (#4319)
  • [Improvement] improve write performance by ~10x (#4277)
  • [New] OSS MPZCH CUDA kernel in FBGEMM (#4214)
  • [Improvement] Migrate jagged tensor kernels to FBGEMM_LAUNCH_KERNEL, pt 1 (#4278)
  • [Improvement] wrap shard_input into dispatcher and access indices through data_ptr instead of [] (#4264)
  • [Improvement] Vectorize f16 conversion (#4253)
  • [Improvement] Use hardware-accelerated float16 conversion (#4245)
  • [Improvement] Speed up AMD training by reducing GPU atomic ops (#4255)
  • [Fix] fix diff failure for D75474404 (#4247)
  • [Improvement] Pass B_offsets to forward wrapper for MTIA (#4242)
  • [Fix] Issue warnings once. (#4164)
  • [Fix] Fix test breaking in internal github repo (#4162)
  • [New] Update heuristic for Cutlass BF16 Grouped GEMM (#4138)
  • [Improvement] Refactor Cutlass BF16 Grouped GEMM (#4124)
  • [Improvement] Add flag to disable fbgemm autovec in OSS (#4100)
  • [New] FuseScatterAdd supports non-WS kernels. (#4135)
  • [New] Build MXFP4/NVFP4 CUTLASS grouped GEMM (#4128)
  • [Fix] Disable some sources from build (#4120)
  • [Fix] Revert D73898778 (#4107)
  • [Improvement] More if-constexpr (#4054)
  • [Fix] Update repo version and reenable rowwise scaling (#4038)
  • [Improvement] Autodetect Triton WS support. (#4009)
  • [Fix] Minor fix to bf16 grouped gemm test (#4002)
  • [Improvement] Improve Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon by 2%-10% (#3879)

GenAI Ops

  • [Fix] Remove all imports of <torch/torch.h> from CK extensions (#4407)
  • [Improvement] BF16I4 Preshuffled Batched GEMM (#4399)
  • [Fix] silu_mul_quant fix (#4395)
  • [Fix] Fix trunk health GenAI tests (#4394)
  • [Improvement] add actual_batch_size to rope_qkv_varseq_prefill (#4380)
  • [Fix] silu mul quant torch api fix (#4376)
  • [Improvement] silu_mul API Update (#4359)
  • [Improvement] Add FP32 support for routing_score dtype (#4352)
  • [New] Add initial version of TuningCache and scripts for heuristic + kernel (#4289)
  • [Improvement] Update AI Codesign Cutlass to 4.0 (#4276)
  • [Improvement] modify convert_e4m3fn_kv_cache_to_e4m3fnuz_inplace kernel to support N_H_L dim and address correctness (#4286)
  • [Fix] Fix for T227105252 ("Your diff, D76285873, broke one test") (#4318)
  • [Improvement] Update torch API Definition (#4310)
  • [Fix] [fbgemm_gpu] Disable GenAI test in OSS (#4311)
  • [New] Fused SILU with quantization and RMS with quantization (#4204)
  • [New] Leverage fuse kernel in inference workload (#1237) (#4157)
  • [Improvement] Triton based activation kernels runs on valid tokens. (#4185)
  • [Improvement] Triton based Gather/Scatter kernels runs on valid tokens. (#4184)
  • [Improvement] IndexShuffling supports TP2EP. (#4180)
  • [Fix] iRoPE varseq flag for pre-calculated kv qparams (#4160)
  • [Improvement] Clean up IndexShuffling op. (#4155)
  • [Fix] Revert D74855940 (#4149)
  • [Improvement] [fbgemm_gpu] Enable building FB-internal sources only if CUDA version… (#4143)
  • [Improvement] Decouple some operator defs from operator impl (#4140)
  • [New] introduce kernel for converting e4m3fn kv_cache to e4m3fnuz (#4113)
  • [Fix] OSS CI fixes (#4077)
  • [New] Create dedicated kv cache header file (#4089)
  • [New] Add FP8 rowwise support for MetaShuffling demo. (#4094)
  • [New] pipelining one_shot and two_shot allreduce (#4081)
  • [Improvement] E2E cleanup and baseline. (#1164) (#4080)
  • [New] Add external qparams parameters to mqa_attn API (#4069)
  • [Improvement] Reduce OSS CI timeout (#4067)
  • [New] TokenShuffling MoE Example for OSS (#4065)
  • [Improvement] Add external qparams parameters to dequantize_int4_cache API (#4062)
  • [New] fbgemm_gpu.experimental.gen_ai.moe.silu_mul_quant. (#4059)
  • [Fix] Remove unused variable in gqa_attn_splitk_attn_kernel (#4014)
  • [New] fbgemm_gpu.experimental.gen_ai.moe.silu_mul. (#1144) (#4057)
  • [New] fbgemm_gpu.experimental.gen_ai.moe.scatter_add_dense_tokens. (#4055)
  • [New] int4 kv (#3878)
  • [Improvement] Optimize kv cache usage for yoco (#4030)
  • [Improvement] Use cudaMemset/hipMemset to setup IndexShuffling kernel. (#4016)
  • [Improvement] Cleanup gather/scatter ops. (#4001)
  • [New] torch.ops.fbgemm.gather_scale_quant_dense_tokens (#3995)
  • [Improvement] Make gqa tests device agnostic (#3985)
  • [New] Add index_shuffling to gather_scatter_bench. (#3999)
  • [Improvement] Fuse ScatterAdd into GroupedGEMM. (#3994)
  • [Improvement] GroupedGEMM perf improvement. (#3992)
  • [New] Open source TokenShuffling MoE kernels. (#3978)
  • [Improvement] Make kv_cache tests device agnostic (#3961)
  • [Improvement] Make quantize_bench easier to use in OSS (#3968)
  • [Improvement] Modernize FP8 Rowwise Tuning (#3955)
  • [Fix] reland fp8 kv cache dequantization fix (#3956)
  • [Improvement] Update DeepGemm with Latest Performance Improvements (#3954)
  • [New] Add DEEPGEMM Masked API. (#3949)
  • [New] Add harness for comms benchmark (#3936)
  • [Improvement] Use Int64 Indexing in Grouped Gemm (#3930)

FP8

  • [New] Support skip scaling for input tensor for Triton rowwise FP8 kernel (#4362)
  • [Improvement] Enrich auto-tune shapes for OC OBA model (#4368)
  • [Improvement] Vectorize load/store for FP8 Quantization (#4262)
  • [Improvement] Build and optimize BF16 grouped GEMM on blackwell (#4353)
  • [Improvement] Tune FP8 grouped GEMM for Llama4 shapes (#4326)
  • [Improvement] Add tensor checks in FP8 batched GEMM (#4347)
  • [Improvement] Improve heuristic for Cutlass FP8 Grouped GEMM (#4309)
  • [New] Support tuning cache for Cutlass FP8 Grouped GEMM (#4308)
  • [New] Support tuning cache for Cutlass FP8 GEMM (#4301)
  • [New] CK Preshuffle FP8 Rowwise GEMM (#4234)
  • [Improvement] Simplify CK FP8 Kernel Launch and enable FP16 Outputs. (#4233)
  • [Fix] Revert D76342974: Multisect successfully blamed "D76342974: [FBGEMM] Make Cutlass FP8 Rowwise bias always FP32" for one test failure (#4323)
  • [Improvement] Make Cutlass FP8 Rowwise bias always FP32 (#4317)
  • [Improvement] group gemm float8 dtype for AMD GPU (#4282)
  • [Improvement] Use float8e4m3fn for MI350+ (#4280)
  • [Improvement] Partition correctness issues from performance issues when pruning configs (#4261)
  • [Fix] Disable use_cuda_graph for inductor integration with non-persistent fp8_rowwise (#4259)
  • [Improvement] cap scale_ub used in fp8 kv_cache quantization (#4237)
  • [Fix] Fix FP8 rowwise GEMM meta kernel with 3d input tensor (#4231)
  • [Fix] Fix FP8 quantization meta kernels with 3d input tensor (#4221)
  • [Improvement] Add best config for FP8 medium shapes (#4169)
  • [Improvement] Update matmul_fp8_row_meta signature (#4168)
  • [Improvement] Support Bias in _kernel_matmul_fp8_row_non_persistent (#4167)
  • [Improvement] Optimize cudaGetDeviceProperties runtime overhead (#4209)
  • [Improvement] Enable FP8 Grouped/Batched/Regular GEMM with dispatched SM100 and optimizations (resubmit) (#4197)
  • [New] cache quantization (#4031)
  • [Fix] FP8 kv code improvement (#4040)
  • [Fix] Fix templates for FP8 Rowwise Slow Accumulation (#4037)

Triton

  • [Fix] Fix trunk health (#4379)
  • [Improvement] Replace torch quantization implementation with Triton version (#4217)
  • [Improvement] Support BF16 in Triton downcast quantization mx4 unpack kernel (#4203)
  • [Fix] FP4 Triton kernel bug fix (#4181)
  • [Improvement] INT64 address indexing. (#1049) (#3965)

Quantize Ops

  • [New] Utilities for slicing preshuffled tensors (#4396)
  • [Improvement] Support scale_bias_last on tbe lookup kernel (#4363)
  • [New] NVFP4 quantization emulation kernels as reference (#4324)
  • [Improvement] Update quantize_test to fix lint (#4355)
  • [Improvement] Disable comm/multi_gpu_car_test.py on non-CUDA machines (#4328)
  • [Improvement] NVFP4 kernel fusion and optimization (#4329)
  • [Improvement] SymInt-ify quantize.cpp (#4322)
  • [Fix] Fix IMA issue in QKV quantize kernel (#4305)
  • [New] Added triton implementation for nvfp4 quantization scheme (#4275)
  • [Improvement] Enable K that is not divisible by group size for shuffled mixed dtype kernels. (#4267)
  • [Fix] Fix the output shape for 3d input for f8f8bf16_rowwise_meta (#4254)
  • [Improvement] optimization of perKVhead quantization (#4161)
  • [Improvement] Allow multiple group sizes to be passed in quantize_bench (#4137)
  • [New] Add Llama4 shapes in quantize_bench (#4129)
  • [New] Add MXFP4 PT reference quantization kernel and refactor CUTLASS FP4 GEMM (#4117)
  • [New] Support Triton unpacked MXFP4 quantization kernel (#4116)
  • [Improvement] Migrate TBE inference kernels to FBGEMM_LAUNCH_KERNEL (#4092)
  • [Fix] [fbgemm_gpu] Remove sm_100 and sm_120 (#4024)
  • [New] Enable FP4 CUTLASS GEMM and CUDA quantization kernels (#4004)
  • [Fix] Fix weighted TBE inference NaN (un-init) row_weights (#4006)
  • [New] add logic for new heuristics of l4_17b_128e shapes (#3984)
  • [Improvement] refine the heuristics sweeping scripts for llama4 17b_128e shapes (#3983)
  • [Improvement] Performance Optimization: Improved TileShape Configuration for Large Llama Shapes (#3790)

TBE

TBE GPU

  • [Improvement] Move tbe weights as buffers so export can track properly (#4369)
  • [Improvement] Support prefetch pipeline in bounds_check_indices (#4312)
  • [New] tbe cpu nobag dispatch and backward pass kernel impl (#4303)
  • [Improvement] tbe cpu nobag dispatch and forward pass kernel impl (#4302)
  • [Fix] Fix the large indices forward test (#4232)
  • [Fix] Limit the grid size for the TBE forward kernel (#4208)
  • [New] Add auto-gen aux args for SSD TBE (#4220)
  • [Improvement] Replace enable_optimizer_offloading with aux_bool (#4219)
  • [Fix] pyre fixes for D75477355, group 1 (#4212)
  • [Improvement] Add more parameter specializations for autovec TBE kernels (#4153)
  • [New] Add TBE data configuration reporter to TBE forward" (#4130)
  • [Fix] support filling partial rows from backend (#4158)
  • [Improvement] Migrate TBE cache kernels to FBGEMM_LAUNCH_KERNEL (#4127)
  • [Improvement] Migrate TBE utility kernels to FBGEMM_LAUNCH_KERNEL (#4122)
  • [New] Writeback support (#4103)
  • [New] Add meta functions for cache ops (#4118)
  • [Fix] Fix backward_dense_test (#3702)
  • [New] Add more parameter specializations for autovec TBE kernels (#4047)
  • [Improvement] Migrate TBE forward kernels to FBGEMM_LAUNCH_KERNEL (#4079)
  • [Improvement] Upcast round_up inputs/outputs/ to uint64_t (#4052)
  • [Fix] Fix test_indices_estimation max_i should not be <1 (#4056)
  • [Improvement] Make sure that all TBE grad_output is contiguous and 16-byte aligned (#4041)
  • [Improvement] Use overflow_safe_int_t in TBE forward training (#3953)
  • [Fix] [fbgemm_gpu] Add missing init.py (#4008)
  • [New] DramKVstore (#4005)
  • [Improvement] Support prefetch pipeline in bounds_check_indices (#3923)
  • [Improvement] Include NONE in bounds_check_mode validation (#3990)
  • [Fix] Fix learning_rate_tensor to avoid in-place op and recompilation issues (#3989)
  • [Fix] Fix bounds_check_indices v2 bug (#3979)
  • [New] Add __TEMPLATE_SOURCE_FILE__ macro to TBE codegen sources (#3962)
  • [Improvement] Add bound checks in TBE grad_indice_weights kernel (#3943)
  • [Improvement] Use bounds_check_indices v2 on ROCm (#3916)
  • [New] Add basic CLI for EEG parameter extraction (#3951)
  • [Improvement] Pass in sharding position information to TBE to facilitate logging / dump / etc. (#3927)
  • [Improvement] Fix flaky TBE unit tests (#3938)
  • [Fix] Fix get_infos_metadata meta dispatch (#3946)
  • [Improvement] Change set_learning_rate_tensor (#3945)

TBE SSD

  • [Fix] Remove debug_split_optimizer_states (#4397)
  • [Improvement] Update the rowwise adagrad optimizer to leverage optimizer state offloading, v4, frontend (#4249)
  • [Fix] fix bugs in D76548519 (#4378)
  • [Fix] patch fixes for eviction (#4304)
  • [Fix] [fbgemm_gpu] SSD test fix for OSS (#4351)
  • [Fix] Fixing reading from EmbeddingRocksDB connection (#4341)
  • [Improvement] Adding a separate utils file for KVTensorMetaData (#4298) (#4335)
  • [Improvement] Making create_rocksdb_hard_link_snapshot function a no_op (#4340)
  • [New] Adding E2E unit tests for KVTensorMetaData class (#4298)
  • [New] Adding KVTensorMetaData class (#4297)
  • [New] Adding get_kvtensor_serializable_metadata function to (#4296)
  • [Fix] ssd offloading: fix trunk break for optimizer state (#4313)
  • [Improvement] Optimize integration test (#4284)
  • [Fix] disable rdb ckpt with dram backend (#4292)
  • [Fix] [fbgemm_gpu] Fix DRAM test (#4288)
  • [New] Added unit tests for the entire ssd offloading using rocksdb checkpoint flow (#4228)
  • [Improvement] Expose SE/DESE support to EmbeddingRocksDBWrapper for training pipeline (#4227)
  • [Improvement] Adding Serialization and Deserialization functions for KVTensor (#4226)
  • [New] Creating ReadOnlyEmbeddingKVDB class and necessary functions (#4225)
  • [New] Creating RocksDBCheckpointHandler to expose rocksdb checkpoint to python (#4224)
  • [New] Adding function to create a snapshot and exposing it from EmbeddingRocksDBWrapper (#4223)
  • [New] Creating Checkpointhandle and connecting to EmbeddingRocksDB (#4222)
  • [Improvement] Adding helper function for enabling RocksDB Checkpoint (#4213)
  • [New] Track updated rows in SSDTBE (#4211)
  • [Improvement] Pass the updated embeddings to EmbeddingKVDB (#4210)
  • [Improvement] chunk processing l2 cache flush (#4216)
  • [Improvement] chunking opt split and fix duplicate flush (#4260)
  • [Improvement] optimize unit test running time (#4268)
  • [Improvement] add load checkpoint support for virtual table (#4250)
  • [Fix] fixes for dram kv enablement (#4246)
  • [Improvement] Dram integretion test - split tbe training (#4241)
  • [New] Dram integration test - dram kv tensor wrapper (#4240)
  • [New] Dram optimizer initialization (#4239)
  • [Improvement] Adding a mutex lock to set_range function (#4207)
  • [Improvement] Update the rowwise adagrad optimizer to leverage optimizer state offloading, v4, backend (#4195)
  • [Improvement] support get state dict and apply state dict (#4145)
  • [New] Add tensor_stream unit test (#4099)
  • [New] implement optimizer state with opt offloading (#4141)
  • [Fix] Fix a lint error. (#4156)
  • [Improvement] shardTensor metadata recalc after checkpoint state_dict (#4146)
  • [Improvement] Migrate TBE SSD cache kernels to FBGEMM_LAUNCH_KERNEL (#4142)
  • [Improvement] make sure narrow returns contiguous tensor (#4139)
  • [Improvement] do not create random value for optimizer (#4132)
  • [Fix] fix lint error (#4131)
  • [Improvement] Append columns to the SSD cache for storing optimizer data, v3 (#4125)
  • [New] Support ordered read based on weight id in KVT (#4108)
  • [New] Add new interfaces to SSD TBE for checkpoint saving and loading (#4088)
  • [New] Add logic to stream weights in EmbeddingKVDB (#4058)
  • [Improvement] Add enable_raw_embedding_streaming from TBE config to EmbeddingKVDB (#4053)
  • [Improvement] update ssd tbe and kvtensor with UT (#4084)
  • [Fix] Fix list OSS (#4083)
  • [New] add UT for new use cases (#4074)
  • [New] support zero collision tables in ssd operator (#4033)
  • [Improvement] small changes for kvzch (#4073)
  • [New] add c++ bucket sorted id utils (#3982)
  • [New] add rocksdb backend api for getting id tensors (#3981)
  • [New] add kvtensor discrete id IO support (#3972)
  • [Improvement] skip multiple flush at the same iteration (#3967)
  • [Improvement] always toggle compaction to true (#3935)

Other Ops

Sparse Ops

  • [New] add optimized reorder_batched_ad_indices_kernel on AMD (#4388)
  • [Improvement] add more checks for CPU combined inputs coalescing (#4202)
  • [Improvement] preprocessor logic for hipBLAS V3 API changes (#4281) (#4320)
  • [Fix] Move sparse_op registration + correct sigmoid XL lowering settings (#4179)
  • [New] Add keep_orig_idx_per_feature parameter to block_bucketize_sparse_features kernel (#4027)
  • [Fix] [fbgemm_gpu] Fix missing operator registration (#4042)
  • [New] Move batched_complete_cumsum op to FBGEMM (#4036)

Permute Ops

  • [Improvement] Relax the checks for dimensions of pooled_embs (#4165)
  • [New] support permute_multi_embedding_function on torch.export (#3897)

Benchmarks

  • [New] handle inference buck gpu deps (#4358)
  • [New] kvzch inference python operator (#4344)
  • [Fix] Revert D75462895: Multisect successfully blamed "D75462895: [fbgemm_gpu] Add TBE data configuration reporter to TBE forward (v2)" for one test failure (#4381)
  • [New] Add TBE data configuration reporter to TBE forward (v2) (#4364)
  • [Improvement] [fbgemm_gpu] Upgrade benchmark workflows (#4337)
  • [Fix] comment out benchmark UT (#4338)
  • [New] Added silu and rms fusion with nvfp4 quantization for Triton kernel (#4285)
  • [Improvement] benchmark for DramKV (#4238)
  • [New] Add kernel execution timing to the KernelLauncher class (#4201)
  • [Fix] Fix vbe benchmark for MTIA (#4171)
  • [Fix] Fix TBE benchmark results logging (#4170)
  • [New] ReportTBE data configuration with EEG-based indices (squash stack from D73450767) (#4046)
  • [New] add batched support (#4003)
  • [Improvement] Cleanup shuffling ops. (#4013)
  • [New] Add Gather/Scatter related benchmark. (#3993)
  • [Improvement] Add Support for Indices and Offsets File Options in Benchmarking Script (#3966)
  • [New] Support MTIA for vbe (#3963)
  • [Fix] [fbgemm_gpu] Fix CPU benchmark scripts (#3941)

Better Engineering

Utilities

  • [Improvement] Migrate jagged tensor kernels to FBGEMM_LAUNCH_KERNEL, pt 2 (#4350)
  • [Improvement] Migrate SourceContext macros to source_context.h (#4392)
  • [Fix] Deprecate barrier isolation macros (#4357)
  • [Improvement] Decouple some operator defs from operator impl (#4272)
  • [Improvement] Decouple some operator defs from operator impl (#4175)
  • [Improvement] Decouple some operator defs from operator impl (#4175)
  • [Fix] Silence autodeps warnings (#4229)
  • [Improvement] Migrate TBE UVM cache kernels to FBGEMM_LAUNCH_KERNEL (#4193)
  • [Improvement] pyre fixes for D75477355, group 11 (#4215)
  • [Improvement] Migrate to the new PackedTensorAccessor TensorAccessor definitions (#3991)
  • [New] Add set_max_dynamic_smem (#4398)

Builds

  • [Improvement] [fbgemm_gpu] Update release versioning to 1.3 (#4408)
  • [Improvement] [fbgemm_gpu] Increase PIP install test timeouts (#4406)
  • [Improvement] Use newer CMake module features (#4377)
  • [Improvement] Increase timeout for Nova GenAI workflow (#4372)
  • [Improvement] [fbgemm_gpu] Upgrade CI instances (#4366)
  • [Improvement] FBGEMM build changes to support integration with pytorch (#4354)
  • [Fix] [fbgemm_gpu] Fix CUDA 12.9 OSS compilation for HSTU (#4360)
  • [Improvement] [fbgemm_gpu] Add build support for CUDA 12.9 (#4356)
  • [Improvement] [fbgemm_gpu] Integrate HSTU into OSS CI (#4236)
  • [Misc] OSS the TorchRec MPZCH CPU kernel functions (#4295)
  • [Improvement] Enable HSTU builds in fbcode (#4290)
  • [Fix] Fix use __fp16 by default (#4257)
  • [Fix] [fbgemm_gpu] Fix detection mechanism for FBPKG builds (#4243)
  • [Fix] Fix the logic around filtering FB-only code from the build (#4230)
  • [Improvement] [fbgemm_gpu] Improvements to the Nova build matrix filter script (#4206)
  • [Improvement] [fbgemm_gpu] Filter Build Coordinates Generated by Nova (#4196)
  • [Improvement] Enable FP8 Grouped/Batched/Regular GEMM with dispatched SM100 and optimizations (#4172)
  • [Fix] Trim constexpr from isA to improve Windows clang-cl support. (#4119)
  • [Improvement] Enable FP8 Grouped/Batched/Regular GEMM with dispatched SM100 and optimizations (#4172)
  • [Fix] [fbgemm_gpu] Remove sm90 target from OSS builds (#4177)
  • [Fix] [fbgemm_gpu] Disable GenAI builds against CUDA 11.8 (#4173)
  • [Improvement] [fbgemm_gpu] Support ROCm 6.4 builds (#4114)
  • [Improvement] [fbgemm_gpu] Enable GenAI ROCm builds in CI and Nova workflows (#4066)
  • [Improvement] Support building for armv8.1 (#4068)
  • [Improvement] [ROCm OSS Enablement] Update bash build and install scripts to account for targets and variants (#4032)
  • [Fix] [ROCm OSS Enablement] Update setup.py to account for targets and variants (#4023)
  • [Fix] fix build that excludes a bunch of features (#4020)
  • [Improvement] [fbgemm_gpu] Add rate limiting to github metrics scripts (#4007)
  • [Fix] [fbgemm_gpu] Disable MoE tests in OSS (#3998)
  • [New] [FBGEMM][PR] [fbgemm_gpu] Enable ROCm builds for GenAI, pt 2 (#3996)
  • [New] [fbgemm_gpu] Enable ROCm builds for GenAI, pt 1 (#3910)
  • [Fix] [fbgemm_gpu] Fix CUDA publish version for PyPI (#3988)
  • [Fix] Fix runner for ROCm CI (#3986)
  • [Fix] fix lint issue (#3980)
  • [Fix] Fix split_embeddings_utils in CMake (#3974)
  • [Fix] [fbgemm_gpu] Disable grouped GEMM tests in OSS (#3971)
  • [Fix] Guard CUDA API call against older driver versions (#3970)
  • [Fix] [fbgemm_gpu] Fix coalesce ops build in OSS (#3964)
  • [Fix] [fbgemm_gpu] Install missing pyyaml package in PIP install tests (#3957)
  • [Improvement] [fbgemm_gpu] Update release workflows (#3952)
  • [Improvement] [fbgemm_gpu] Reduce OSS build sizes for non-GenAI FBGEMM_GPU (#3948)
  • [New] [fbgemm_gpu] Add Scripts for Generating Release Reports (#3676)
  • [Fix] [fbgemm_gpu] Fix ROCm test reliability (#4385)
  • [Fix] [fbgemm_gpu] ROCm fixes for CI (#4345)

Documentation

  • [Fix] Fix benchmark helper text (#4375)
  • [Fix] [fbgemm_gpu] Fix broken docs link (#4252)
  • [Fix] [fbgemm_gpu] Fix releases docs (#4188)
  • [Improvement] Update ReadMe. (#4126)
  • [Improvement] Update fbgemm gen_ai README (#3997)
  • [Improvement] [fbgemm_gpu] Update release version in docs (#3959)

Other

Utilities

  • [Improvement] Decouple embedding_ssd_{}_pt2_autograd from CUDA files (#4389)
  • [Improvement] Decouple embedding_ssd_{}_pt2_autograd from CUDA files (#4389)
  • [Fix] Back out "Add make directory to filestore abstraction" (#4386)
  • [Improvement] Add CudaEvents Barrier before MemCpy V33 (#4348)
  • [Improvement] kv embedding inference cache wrapper (#4343)
  • [New] Implement a stat library for fbgemm embedding (#4339)
  • [Improvement] Add make directory to filestore abstraction (#4346)
  • [Improvement] Add manifold wrapper (#4291)
  • [Improvement] Prevent duplicate operator registrations (#4327)
  • [Fix] Replace references to folly::hash::fnv(32|64) with the _BROKEN alias (#4332)
  • [Fix] [CK] Update CK revision to include fix for F16 atomics (#4306)
  • [Fix] Back out "Migrate TBE UVM cache kernels to FBGEMM_LAUNCH_KERNEL" (#4315)
  • [Improvement] Include template filename in the DSA file descriptor (#4287)
  • [Improvement] Add feature evict for dram_kv_embedding_cache. (#4187)
  • [Fix] Fix Signed-Unsigned Comparison in Tensor Utils (#4279)
  • [Improvement] CK Version Update (#4235)
  • [Improvement] Jemalloc Mempool and Adaptation for CPU HASHTABLE (#4154)
  • [Fix] Revert D73927918 (#4190)
  • [Improvement] [fbgemm_gpu] Install conda using Miniforge (#4178)
  • [Fix] Silence autodeps warnings (#4163)
  • [New] Add checks for dimensions of pooled_embs (#4159)
  • [Fix] [fbgemm_gpu] Change Conda update channel (#4144)
  • [Improvement] Migrate embedding_bounds_check to FBGEMM_LAUNCH_KERNEL (#4104)
  • [Fix] Re-enable c10_retrieve_device_side_assertion_info in kernel launch check (#4109)
  • [Fix] Remove call to c10_retrieve_device_side_assertion_info for now (#4098)
  • [Fix] Deprecate get_device_properties for at::cuda::getDeviceProperties (#4096)
  • [Improvement] update hipify_torch submodule for version 2 (#4028) (#4093)
  • [Improvement] Update bounds_check_indices version selection logic (#4086)
  • [Improvement] Simplify weight row cache load and evict routines, v2 (#4087)
  • [New] Compute optimizer state pointer in table row (#4078)
  • [Improvement] Refactor bounds_check_indices (#4049)
  • [Improvement] Clean up WeightRow in preparation for optimizer state offloading (#4021)
  • [New] Enable NaN checks on tensor arguments to kernel launches (#4029)
  • [Fix] Use CUDAStream instead of cudaStream_t in kernel_launcher (#4071)
  • [Fix] Remove cudaStreamGetDevice() call (#4070)
  • [Improvement] use fmaf to do multiply and add (#4051)
  • [Fix] Fix shared memory check for HIP (#4044)
  • [Fix] Back out "Migrate make_pta_acc_format() away from old macros, v2]" (#4039)
  • [Improvement] Move ComputeDevice to split_table_batched_embeddings_ops_common (#4034)
  • [Improvement] Optimize if-statements with if-constexpr (#4022)
  • [Fix] Fixes and enhancements to FBGEMM_LAUNCH_KERNEL (#4015)
  • [Improvement] Expand the EEG CLI utility to support indices generation (#4017)
  • [Improvement] Remove old code associated with TensorAccessor creation (#4010)
  • [Improvement] Simplify WeightRowAccessor based on its usage (#3975)
  • [New] Add kernel to initialize shared memory for testing (#4000)
  • [Improvement] PackedTensorAccessor cleanup (#3987)
  • [Improvement] TensorAccessor cleanup (#3973)
  • [New] Incorporate __TEMPLATE_SOURCE_FILE__ into kernel launcher facilities (#3969)
  • [New] Incorporate Torch DSA (#3950)
  • [New] Better kernel launch utilities, pt 2 (#3947)
  • [New] Better kernel launch utilities (#3914)

Miscellaneous

  • [New] Add HSTU in fbgemm_gpu/experimental/ (#4090)
  • [Fix] Fix inputs in EmbeddingSpMDM8BitTest (#4200)
  • [Fix] Fix weights pointer not moving in idx==-1 case (#4199)
  • [Improvement] Support ROCm 6.4 builds, pt 2 (#4151)
  • [Fix] Fix cpuinfo not being initialized before checking for ARM SVE2 (#4121)
  • [Fix] Fix illegal memory access when weights are partially empty in input combine cuda (#4101) (#4111)
  • [Fix] Fix illegal memory access when weights are partially empty in input combine cuda (#4101)
  • [Fix] Fix illegal memory access when weights are partially empty in input combine cuda (#4101)
  • [Fix] Add +fp16fml to -march for aarch64 (#4091)
  • [New] Allow merge_pooled_embedding take in device without index (#4061)
  • [Improvement] Migrate make_pta_acc_format() away from old macros, v3] (#4048)
  • [Fix] fix the type hack in dramKV wrapper (#4012)

Don't miss a new FBGEMM release

NewReleases is sending notifications on new releases.