Highlights
TBE
- Added new kernels and improved dispatch for CPU and GPU.
- Optimizations for SSD offloading, RocksDB integration, and checkpoint handling.
- Various bug fixes and performance improvements in TBE forward and backward passes.
Gen AI Ops
- Numerous fixes and optimizations in GEMM kernels, including Cutlass BF16 grouped GEMM tuning cache support.
- New kernels and performance improvements for FP8, Triton, and quantization operations.
- Added support for fused SILU with quantization and RMS with quantization.
- Enhanced heuristics and API updates for GenAI operators.
- Improved GPU atomic operations and kernel vectorization.
New Ops
- Introduction of HSTU ops, courtesy of Nvidia
Benchmarking
- New benchmarks for inference, Gather/Scatter, and DramKV.
Better Engineering
- Added build support for CUDA 12.9
- Upgraded CI instances and build matrix filters.
- Various fixes for OSS compatibility and build stability.
Software Requirements
FBGEMM_GPU v1.3.0 has been tested and known to work on the following setups:
- PyTorch: v2.8
- CUDA: v12.6, 12.8, 12.9
- Python: v3.9, 3.10, 3.11, 3.12, 3.13
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU (instructions here) and FBGEMM-GenAI (instructions here).
Availability
FBGEMM_GPU and FBGEMM GenAI can be fetched directly from PyPI:
# FBGEMM_GPU - CUDA (only the CUDA 12.6 variant is available)
pip install fbgemm-gpu==1.3.0
# FBGEMM_GPU - CPU
pip install fbgemm-gpu-cpu==1.3.0
# FBGEMM GenAI
pip install fbgemm-gpu-genai==1.3.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU - CUDA
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cu128/
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cu128/
# FBGEMM_GPU - CPU
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cpu
# FBGEMM GenAI
pip install fbgemm-gpu==1.3.0 --index-url https://download.pytorch.org/whl/cpu
GenAI
GEMM
- [Fix] Fix CQS signal facebook-unused-include-check in fbcode/deeplearning/fbgemm/src [B] [A] (#4401)
- [Improvement] Optimize some code out of compilation in the table lookup kernel (#4371)
- [Improvement] pyre-fixmes for D75477355 - FBGEMM (#4390)
- [Fix] fix feature eviction config conflict in inference operator (#4374)
- [Improvement] Add CudaEvent Sync to Two Hop All To One Copies (#4367)
- [Fix] fix output dtype issue in merge_pooled_embeddings when input tensors are all empty (#4325)
- [Fix] Fix meta implementation for nobag (#4370)
- [Fix] Enable stacked grouped GEMM (#4349)
- [New] Add new kernels for Cutlass BF16 grouped GEMM for tuning cache (#4300)
- [New] Support tuning cache for Cutlass BF16 grouped GEMM (#4299)
- [Improvement] 0 tokens for gather_scale_dense_tokens (#4319)
- [Improvement] improve write performance by ~10x (#4277)
- [New] OSS MPZCH CUDA kernel in FBGEMM (#4214)
- [Improvement] Migrate jagged tensor kernels to
FBGEMM_LAUNCH_KERNEL
, pt 1 (#4278) - [Improvement] wrap shard_input into dispatcher and access indices through data_ptr instead of [] (#4264)
- [Improvement] Vectorize f16 conversion (#4253)
- [Improvement] Use hardware-accelerated float16 conversion (#4245)
- [Improvement] Speed up AMD training by reducing GPU atomic ops (#4255)
- [Fix] fix diff failure for D75474404 (#4247)
- [Improvement] Pass B_offsets to forward wrapper for MTIA (#4242)
- [Fix] Issue warnings once. (#4164)
- [Fix] Fix test breaking in internal github repo (#4162)
- [New] Update heuristic for Cutlass BF16 Grouped GEMM (#4138)
- [Improvement] Refactor Cutlass BF16 Grouped GEMM (#4124)
- [Improvement] Add flag to disable fbgemm autovec in OSS (#4100)
- [New] FuseScatterAdd supports non-WS kernels. (#4135)
- [New] Build MXFP4/NVFP4 CUTLASS grouped GEMM (#4128)
- [Fix] Disable some sources from build (#4120)
- [Fix] Revert D73898778 (#4107)
- [Improvement] More if-constexpr (#4054)
- [Fix] Update repo version and reenable rowwise scaling (#4038)
- [Improvement] Autodetect Triton WS support. (#4009)
- [Fix] Minor fix to bf16 grouped gemm test (#4002)
- [Improvement] Improve Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon by 2%-10% (#3879)
GenAI Ops
- [Fix] Remove all imports of <torch/torch.h> from CK extensions (#4407)
- [Improvement] BF16I4 Preshuffled Batched GEMM (#4399)
- [Fix] silu_mul_quant fix (#4395)
- [Fix] Fix trunk health GenAI tests (#4394)
- [Improvement] add actual_batch_size to rope_qkv_varseq_prefill (#4380)
- [Fix] silu mul quant torch api fix (#4376)
- [Improvement] silu_mul API Update (#4359)
- [Improvement] Add FP32 support for routing_score dtype (#4352)
- [New] Add initial version of TuningCache and scripts for heuristic + kernel (#4289)
- [Improvement] Update AI Codesign Cutlass to 4.0 (#4276)
- [Improvement] modify convert_e4m3fn_kv_cache_to_e4m3fnuz_inplace kernel to support N_H_L dim and address correctness (#4286)
- [Fix] Fix for T227105252 ("Your diff, D76285873, broke one test") (#4318)
- [Improvement] Update torch API Definition (#4310)
- [Fix] [fbgemm_gpu] Disable GenAI test in OSS (#4311)
- [New] Fused SILU with quantization and RMS with quantization (#4204)
- [New] Leverage fuse kernel in inference workload (#1237) (#4157)
- [Improvement] Triton based activation kernels runs on valid tokens. (#4185)
- [Improvement] Triton based Gather/Scatter kernels runs on valid tokens. (#4184)
- [Improvement] IndexShuffling supports TP2EP. (#4180)
- [Fix] iRoPE varseq flag for pre-calculated kv qparams (#4160)
- [Improvement] Clean up IndexShuffling op. (#4155)
- [Fix] Revert D74855940 (#4149)
- [Improvement] [fbgemm_gpu] Enable building FB-internal sources only if CUDA version… (#4143)
- [Improvement] Decouple some operator defs from operator impl (#4140)
- [New] introduce kernel for converting e4m3fn kv_cache to e4m3fnuz (#4113)
- [Fix] OSS CI fixes (#4077)
- [New] Create dedicated kv cache header file (#4089)
- [New] Add FP8 rowwise support for MetaShuffling demo. (#4094)
- [New] pipelining one_shot and two_shot allreduce (#4081)
- [Improvement] E2E cleanup and baseline. (#1164) (#4080)
- [New] Add external qparams parameters to mqa_attn API (#4069)
- [Improvement] Reduce OSS CI timeout (#4067)
- [New] TokenShuffling MoE Example for OSS (#4065)
- [Improvement] Add external qparams parameters to dequantize_int4_cache API (#4062)
- [New]
fbgemm_gpu.experimental.gen_ai.moe.silu_mul_quant
. (#4059) - [Fix] Remove unused variable in gqa_attn_splitk_attn_kernel (#4014)
- [New]
fbgemm_gpu.experimental.gen_ai.moe.silu_mul
. (#1144) (#4057) - [New]
fbgemm_gpu.experimental.gen_ai.moe.scatter_add_dense_tokens
. (#4055) - [New] int4 kv (#3878)
- [Improvement] Optimize kv cache usage for yoco (#4030)
- [Improvement] Use cudaMemset/hipMemset to setup IndexShuffling kernel. (#4016)
- [Improvement] Cleanup gather/scatter ops. (#4001)
- [New]
torch.ops.fbgemm.gather_scale_quant_dense_tokens
(#3995) - [Improvement] Make gqa tests device agnostic (#3985)
- [New] Add
index_shuffling
togather_scatter_bench
. (#3999) - [Improvement] Fuse ScatterAdd into GroupedGEMM. (#3994)
- [Improvement] GroupedGEMM perf improvement. (#3992)
- [New] Open source TokenShuffling MoE kernels. (#3978)
- [Improvement] Make kv_cache tests device agnostic (#3961)
- [Improvement] Make quantize_bench easier to use in OSS (#3968)
- [Improvement] Modernize FP8 Rowwise Tuning (#3955)
- [Fix] reland fp8 kv cache dequantization fix (#3956)
- [Improvement] Update DeepGemm with Latest Performance Improvements (#3954)
- [New] Add DEEPGEMM Masked API. (#3949)
- [New] Add harness for comms benchmark (#3936)
- [Improvement] Use Int64 Indexing in Grouped Gemm (#3930)
FP8
- [New] Support skip scaling for input tensor for Triton rowwise FP8 kernel (#4362)
- [Improvement] Enrich auto-tune shapes for OC OBA model (#4368)
- [Improvement] Vectorize load/store for FP8 Quantization (#4262)
- [Improvement] Build and optimize BF16 grouped GEMM on blackwell (#4353)
- [Improvement] Tune FP8 grouped GEMM for Llama4 shapes (#4326)
- [Improvement] Add tensor checks in FP8 batched GEMM (#4347)
- [Improvement] Improve heuristic for Cutlass FP8 Grouped GEMM (#4309)
- [New] Support tuning cache for Cutlass FP8 Grouped GEMM (#4308)
- [New] Support tuning cache for Cutlass FP8 GEMM (#4301)
- [New] CK Preshuffle FP8 Rowwise GEMM (#4234)
- [Improvement] Simplify CK FP8 Kernel Launch and enable FP16 Outputs. (#4233)
- [Fix] Revert D76342974: Multisect successfully blamed "D76342974: [FBGEMM] Make Cutlass FP8 Rowwise bias always FP32" for one test failure (#4323)
- [Improvement] Make Cutlass FP8 Rowwise bias always FP32 (#4317)
- [Improvement] group gemm float8 dtype for AMD GPU (#4282)
- [Improvement] Use float8e4m3fn for MI350+ (#4280)
- [Improvement] Partition correctness issues from performance issues when pruning configs (#4261)
- [Fix] Disable use_cuda_graph for inductor integration with non-persistent fp8_rowwise (#4259)
- [Improvement] cap scale_ub used in fp8 kv_cache quantization (#4237)
- [Fix] Fix FP8 rowwise GEMM meta kernel with 3d input tensor (#4231)
- [Fix] Fix FP8 quantization meta kernels with 3d input tensor (#4221)
- [Improvement] Add best config for FP8 medium shapes (#4169)
- [Improvement] Update matmul_fp8_row_meta signature (#4168)
- [Improvement] Support Bias in _kernel_matmul_fp8_row_non_persistent (#4167)
- [Improvement] Optimize cudaGetDeviceProperties runtime overhead (#4209)
- [Improvement] Enable FP8 Grouped/Batched/Regular GEMM with dispatched SM100 and optimizations (resubmit) (#4197)
- [New] cache quantization (#4031)
- [Fix] FP8 kv code improvement (#4040)
- [Fix] Fix templates for FP8 Rowwise Slow Accumulation (#4037)
Triton
- [Fix] Fix trunk health (#4379)
- [Improvement] Replace torch quantization implementation with Triton version (#4217)
- [Improvement] Support BF16 in Triton downcast quantization mx4 unpack kernel (#4203)
- [Fix] FP4 Triton kernel bug fix (#4181)
- [Improvement] INT64 address indexing. (#1049) (#3965)
Quantize Ops
- [New] Utilities for slicing preshuffled tensors (#4396)
- [Improvement] Support scale_bias_last on tbe lookup kernel (#4363)
- [New] NVFP4 quantization emulation kernels as reference (#4324)
- [Improvement] Update quantize_test to fix lint (#4355)
- [Improvement] Disable
comm/multi_gpu_car_test.py
on non-CUDA machines (#4328) - [Improvement] NVFP4 kernel fusion and optimization (#4329)
- [Improvement] SymInt-ify quantize.cpp (#4322)
- [Fix] Fix IMA issue in QKV quantize kernel (#4305)
- [New] Added triton implementation for nvfp4 quantization scheme (#4275)
- [Improvement] Enable K that is not divisible by group size for shuffled mixed dtype kernels. (#4267)
- [Fix] Fix the output shape for 3d input for
f8f8bf16_rowwise_meta
(#4254) - [Improvement] optimization of perKVhead quantization (#4161)
- [Improvement] Allow multiple group sizes to be passed in quantize_bench (#4137)
- [New] Add Llama4 shapes in quantize_bench (#4129)
- [New] Add MXFP4 PT reference quantization kernel and refactor CUTLASS FP4 GEMM (#4117)
- [New] Support Triton unpacked MXFP4 quantization kernel (#4116)
- [Improvement] Migrate TBE inference kernels to
FBGEMM_LAUNCH_KERNEL
(#4092) - [Fix] [fbgemm_gpu] Remove
sm_100
andsm_120
(#4024) - [New] Enable FP4 CUTLASS GEMM and CUDA quantization kernels (#4004)
- [Fix] Fix weighted TBE inference NaN (un-init) row_weights (#4006)
- [New] add logic for new heuristics of l4_17b_128e shapes (#3984)
- [Improvement] refine the heuristics sweeping scripts for llama4 17b_128e shapes (#3983)
- [Improvement] Performance Optimization: Improved TileShape Configuration for Large Llama Shapes (#3790)
TBE
TBE GPU
- [Improvement] Move tbe weights as buffers so export can track properly (#4369)
- [Improvement] Support prefetch pipeline in bounds_check_indices (#4312)
- [New] tbe cpu nobag dispatch and backward pass kernel impl (#4303)
- [Improvement] tbe cpu nobag dispatch and forward pass kernel impl (#4302)
- [Fix] Fix the large indices forward test (#4232)
- [Fix] Limit the grid size for the TBE forward kernel (#4208)
- [New] Add auto-gen aux args for SSD TBE (#4220)
- [Improvement] Replace enable_optimizer_offloading with aux_bool (#4219)
- [Fix] pyre fixes for D75477355, group 1 (#4212)
- [Improvement] Add more parameter specializations for autovec TBE kernels (#4153)
- [New] Add TBE data configuration reporter to TBE forward" (#4130)
- [Fix] support filling partial rows from backend (#4158)
- [Improvement] Migrate TBE cache kernels to
FBGEMM_LAUNCH_KERNEL
(#4127) - [Improvement] Migrate TBE utility kernels to
FBGEMM_LAUNCH_KERNEL
(#4122) - [New] Writeback support (#4103)
- [New] Add meta functions for cache ops (#4118)
- [Fix] Fix backward_dense_test (#3702)
- [New] Add more parameter specializations for autovec TBE kernels (#4047)
- [Improvement] Migrate TBE forward kernels to
FBGEMM_LAUNCH_KERNEL
(#4079) - [Improvement] Upcast round_up inputs/outputs/ to uint64_t (#4052)
- [Fix] Fix test_indices_estimation max_i should not be <1 (#4056)
- [Improvement] Make sure that all TBE grad_output is contiguous and 16-byte aligned (#4041)
- [Improvement] Use overflow_safe_int_t in TBE forward training (#3953)
- [Fix] [fbgemm_gpu] Add missing init.py (#4008)
- [New] DramKVstore (#4005)
- [Improvement] Support prefetch pipeline in bounds_check_indices (#3923)
- [Improvement] Include NONE in bounds_check_mode validation (#3990)
- [Fix] Fix learning_rate_tensor to avoid in-place op and recompilation issues (#3989)
- [Fix] Fix bounds_check_indices v2 bug (#3979)
- [New] Add
__TEMPLATE_SOURCE_FILE__
macro to TBE codegen sources (#3962) - [Improvement] Add bound checks in TBE grad_indice_weights kernel (#3943)
- [Improvement] Use bounds_check_indices v2 on ROCm (#3916)
- [New] Add basic CLI for EEG parameter extraction (#3951)
- [Improvement] Pass in sharding position information to TBE to facilitate logging / dump / etc. (#3927)
- [Improvement] Fix flaky TBE unit tests (#3938)
- [Fix] Fix get_infos_metadata meta dispatch (#3946)
- [Improvement] Change set_learning_rate_tensor (#3945)
TBE SSD
- [Fix] Remove
debug_split_optimizer_states
(#4397) - [Improvement] Update the rowwise adagrad optimizer to leverage optimizer state offloading, v4, frontend (#4249)
- [Fix] fix bugs in D76548519 (#4378)
- [Fix] patch fixes for eviction (#4304)
- [Fix] [fbgemm_gpu] SSD test fix for OSS (#4351)
- [Fix] Fixing reading from EmbeddingRocksDB connection (#4341)
- [Improvement] Adding a separate utils file for KVTensorMetaData (#4298) (#4335)
- [Improvement] Making create_rocksdb_hard_link_snapshot function a no_op (#4340)
- [New] Adding E2E unit tests for KVTensorMetaData class (#4298)
- [New] Adding KVTensorMetaData class (#4297)
- [New] Adding get_kvtensor_serializable_metadata function to (#4296)
- [Fix] ssd offloading: fix trunk break for optimizer state (#4313)
- [Improvement] Optimize integration test (#4284)
- [Fix] disable rdb ckpt with dram backend (#4292)
- [Fix] [fbgemm_gpu] Fix DRAM test (#4288)
- [New] Added unit tests for the entire ssd offloading using rocksdb checkpoint flow (#4228)
- [Improvement] Expose SE/DESE support to EmbeddingRocksDBWrapper for training pipeline (#4227)
- [Improvement] Adding Serialization and Deserialization functions for KVTensor (#4226)
- [New] Creating ReadOnlyEmbeddingKVDB class and necessary functions (#4225)
- [New] Creating RocksDBCheckpointHandler to expose rocksdb checkpoint to python (#4224)
- [New] Adding function to create a snapshot and exposing it from EmbeddingRocksDBWrapper (#4223)
- [New] Creating Checkpointhandle and connecting to EmbeddingRocksDB (#4222)
- [Improvement] Adding helper function for enabling RocksDB Checkpoint (#4213)
- [New] Track updated rows in SSDTBE (#4211)
- [Improvement] Pass the updated embeddings to EmbeddingKVDB (#4210)
- [Improvement] chunk processing l2 cache flush (#4216)
- [Improvement] chunking opt split and fix duplicate flush (#4260)
- [Improvement] optimize unit test running time (#4268)
- [Improvement] add load checkpoint support for virtual table (#4250)
- [Fix] fixes for dram kv enablement (#4246)
- [Improvement] Dram integretion test - split tbe training (#4241)
- [New] Dram integration test - dram kv tensor wrapper (#4240)
- [New] Dram optimizer initialization (#4239)
- [Improvement] Adding a mutex lock to set_range function (#4207)
- [Improvement] Update the rowwise adagrad optimizer to leverage optimizer state offloading, v4, backend (#4195)
- [Improvement] support get state dict and apply state dict (#4145)
- [New] Add tensor_stream unit test (#4099)
- [New] implement optimizer state with opt offloading (#4141)
- [Fix] Fix a lint error. (#4156)
- [Improvement] shardTensor metadata recalc after checkpoint state_dict (#4146)
- [Improvement] Migrate TBE SSD cache kernels to
FBGEMM_LAUNCH_KERNEL
(#4142) - [Improvement] make sure narrow returns contiguous tensor (#4139)
- [Improvement] do not create random value for optimizer (#4132)
- [Fix] fix lint error (#4131)
- [Improvement] Append columns to the SSD cache for storing optimizer data, v3 (#4125)
- [New] Support ordered read based on weight id in KVT (#4108)
- [New] Add new interfaces to SSD TBE for checkpoint saving and loading (#4088)
- [New] Add logic to stream weights in EmbeddingKVDB (#4058)
- [Improvement] Add enable_raw_embedding_streaming from TBE config to EmbeddingKVDB (#4053)
- [Improvement] update ssd tbe and kvtensor with UT (#4084)
- [Fix] Fix list OSS (#4083)
- [New] add UT for new use cases (#4074)
- [New] support zero collision tables in ssd operator (#4033)
- [Improvement] small changes for kvzch (#4073)
- [New] add c++ bucket sorted id utils (#3982)
- [New] add rocksdb backend api for getting id tensors (#3981)
- [New] add kvtensor discrete id IO support (#3972)
- [Improvement] skip multiple flush at the same iteration (#3967)
- [Improvement] always toggle compaction to true (#3935)
Other Ops
Sparse Ops
- [New] add optimized reorder_batched_ad_indices_kernel on AMD (#4388)
- [Improvement] add more checks for CPU combined inputs coalescing (#4202)
- [Improvement] preprocessor logic for hipBLAS V3 API changes (#4281) (#4320)
- [Fix] Move sparse_op registration + correct sigmoid XL lowering settings (#4179)
- [New] Add keep_orig_idx_per_feature parameter to block_bucketize_sparse_features kernel (#4027)
- [Fix] [fbgemm_gpu] Fix missing operator registration (#4042)
- [New] Move batched_complete_cumsum op to FBGEMM (#4036)
Permute Ops
- [Improvement] Relax the checks for dimensions of pooled_embs (#4165)
- [New] support permute_multi_embedding_function on torch.export (#3897)
Benchmarks
- [New] handle inference buck gpu deps (#4358)
- [New] kvzch inference python operator (#4344)
- [Fix] Revert D75462895: Multisect successfully blamed "D75462895: [fbgemm_gpu] Add TBE data configuration reporter to TBE forward (v2)" for one test failure (#4381)
- [New] Add TBE data configuration reporter to TBE forward (v2) (#4364)
- [Improvement] [fbgemm_gpu] Upgrade benchmark workflows (#4337)
- [Fix] comment out benchmark UT (#4338)
- [New] Added silu and rms fusion with nvfp4 quantization for Triton kernel (#4285)
- [Improvement] benchmark for DramKV (#4238)
- [New] Add kernel execution timing to the
KernelLauncher
class (#4201) - [Fix] Fix vbe benchmark for MTIA (#4171)
- [Fix] Fix TBE benchmark results logging (#4170)
- [New] ReportTBE data configuration with EEG-based indices (squash stack from D73450767) (#4046)
- [New] add batched support (#4003)
- [Improvement] Cleanup shuffling ops. (#4013)
- [New] Add Gather/Scatter related benchmark. (#3993)
- [Improvement] Add Support for Indices and Offsets File Options in Benchmarking Script (#3966)
- [New] Support MTIA for vbe (#3963)
- [Fix] [fbgemm_gpu] Fix CPU benchmark scripts (#3941)
Better Engineering
Utilities
- [Improvement] Migrate jagged tensor kernels to
FBGEMM_LAUNCH_KERNEL
, pt 2 (#4350) - [Improvement] Migrate SourceContext macros to source_context.h (#4392)
- [Fix] Deprecate barrier isolation macros (#4357)
- [Improvement] Decouple some operator defs from operator impl (#4272)
- [Improvement] Decouple some operator defs from operator impl (#4175)
- [Improvement] Decouple some operator defs from operator impl (#4175)
- [Fix] Silence autodeps warnings (#4229)
- [Improvement] Migrate TBE UVM cache kernels to
FBGEMM_LAUNCH_KERNEL
(#4193) - [Improvement] pyre fixes for D75477355, group 11 (#4215)
- [Improvement] Migrate to the new
PackedTensorAccessor
TensorAccessor
definitions (#3991) - [New] Add
set_max_dynamic_smem
(#4398)
Builds
- [Improvement] [fbgemm_gpu] Update release versioning to 1.3 (#4408)
- [Improvement] [fbgemm_gpu] Increase PIP install test timeouts (#4406)
- [Improvement] Use newer CMake module features (#4377)
- [Improvement] Increase timeout for Nova GenAI workflow (#4372)
- [Improvement] [fbgemm_gpu] Upgrade CI instances (#4366)
- [Improvement] FBGEMM build changes to support integration with pytorch (#4354)
- [Fix] [fbgemm_gpu] Fix CUDA 12.9 OSS compilation for HSTU (#4360)
- [Improvement] [fbgemm_gpu] Add build support for CUDA 12.9 (#4356)
- [Improvement] [fbgemm_gpu] Integrate HSTU into OSS CI (#4236)
- [Misc] OSS the TorchRec MPZCH CPU kernel functions (#4295)
- [Improvement] Enable HSTU builds in fbcode (#4290)
- [Fix] Fix use __fp16 by default (#4257)
- [Fix] [fbgemm_gpu] Fix detection mechanism for FBPKG builds (#4243)
- [Fix] Fix the logic around filtering FB-only code from the build (#4230)
- [Improvement] [fbgemm_gpu] Improvements to the Nova build matrix filter script (#4206)
- [Improvement] [fbgemm_gpu] Filter Build Coordinates Generated by Nova (#4196)
- [Improvement] Enable FP8 Grouped/Batched/Regular GEMM with dispatched SM100 and optimizations (#4172)
- [Fix] Trim constexpr from isA to improve Windows clang-cl support. (#4119)
- [Improvement] Enable FP8 Grouped/Batched/Regular GEMM with dispatched SM100 and optimizations (#4172)
- [Fix] [fbgemm_gpu] Remove sm90 target from OSS builds (#4177)
- [Fix] [fbgemm_gpu] Disable GenAI builds against CUDA 11.8 (#4173)
- [Improvement] [fbgemm_gpu] Support ROCm 6.4 builds (#4114)
- [Improvement] [fbgemm_gpu] Enable GenAI ROCm builds in CI and Nova workflows (#4066)
- [Improvement] Support building for armv8.1 (#4068)
- [Improvement] [ROCm OSS Enablement] Update bash build and install scripts to account for targets and variants (#4032)
- [Fix] [ROCm OSS Enablement] Update setup.py to account for targets and variants (#4023)
- [Fix] fix build that excludes a bunch of features (#4020)
- [Improvement] [fbgemm_gpu] Add rate limiting to github metrics scripts (#4007)
- [Fix] [fbgemm_gpu] Disable MoE tests in OSS (#3998)
- [New] [FBGEMM][PR] [fbgemm_gpu] Enable ROCm builds for GenAI, pt 2 (#3996)
- [New] [fbgemm_gpu] Enable ROCm builds for GenAI, pt 1 (#3910)
- [Fix] [fbgemm_gpu] Fix CUDA publish version for PyPI (#3988)
- [Fix] Fix runner for ROCm CI (#3986)
- [Fix] fix lint issue (#3980)
- [Fix] Fix split_embeddings_utils in CMake (#3974)
- [Fix] [fbgemm_gpu] Disable grouped GEMM tests in OSS (#3971)
- [Fix] Guard CUDA API call against older driver versions (#3970)
- [Fix] [fbgemm_gpu] Fix coalesce ops build in OSS (#3964)
- [Fix] [fbgemm_gpu] Install missing pyyaml package in PIP install tests (#3957)
- [Improvement] [fbgemm_gpu] Update release workflows (#3952)
- [Improvement] [fbgemm_gpu] Reduce OSS build sizes for non-GenAI FBGEMM_GPU (#3948)
- [New] [fbgemm_gpu] Add Scripts for Generating Release Reports (#3676)
- [Fix] [fbgemm_gpu] Fix ROCm test reliability (#4385)
- [Fix] [fbgemm_gpu] ROCm fixes for CI (#4345)
Documentation
- [Fix] Fix benchmark helper text (#4375)
- [Fix] [fbgemm_gpu] Fix broken docs link (#4252)
- [Fix] [fbgemm_gpu] Fix releases docs (#4188)
- [Improvement] Update ReadMe. (#4126)
- [Improvement] Update fbgemm gen_ai README (#3997)
- [Improvement] [fbgemm_gpu] Update release version in docs (#3959)
Other
Utilities
- [Improvement] Decouple embedding_ssd_{}_pt2_autograd from CUDA files (#4389)
- [Improvement] Decouple embedding_ssd_{}_pt2_autograd from CUDA files (#4389)
- [Fix] Back out "Add make directory to filestore abstraction" (#4386)
- [Improvement] Add CudaEvents Barrier before MemCpy V33 (#4348)
- [Improvement] kv embedding inference cache wrapper (#4343)
- [New] Implement a stat library for fbgemm embedding (#4339)
- [Improvement] Add make directory to filestore abstraction (#4346)
- [Improvement] Add manifold wrapper (#4291)
- [Improvement] Prevent duplicate operator registrations (#4327)
- [Fix] Replace references to folly::hash::fnv(32|64) with the _BROKEN alias (#4332)
- [Fix] [CK] Update CK revision to include fix for F16 atomics (#4306)
- [Fix] Back out "Migrate TBE UVM cache kernels to
FBGEMM_LAUNCH_KERNEL
" (#4315) - [Improvement] Include template filename in the DSA file descriptor (#4287)
- [Improvement] Add feature evict for dram_kv_embedding_cache. (#4187)
- [Fix] Fix Signed-Unsigned Comparison in Tensor Utils (#4279)
- [Improvement] CK Version Update (#4235)
- [Improvement] Jemalloc Mempool and Adaptation for CPU HASHTABLE (#4154)
- [Fix] Revert D73927918 (#4190)
- [Improvement] [fbgemm_gpu] Install conda using Miniforge (#4178)
- [Fix] Silence autodeps warnings (#4163)
- [New] Add checks for dimensions of pooled_embs (#4159)
- [Fix] [fbgemm_gpu] Change Conda update channel (#4144)
- [Improvement] Migrate
embedding_bounds_check
toFBGEMM_LAUNCH_KERNEL
(#4104) - [Fix] Re-enable
c10_retrieve_device_side_assertion_info
in kernel launch check (#4109) - [Fix] Remove call to c10_retrieve_device_side_assertion_info for now (#4098)
- [Fix] Deprecate
get_device_properties
forat::cuda::getDeviceProperties
(#4096) - [Improvement] update hipify_torch submodule for version 2 (#4028) (#4093)
- [Improvement] Update bounds_check_indices version selection logic (#4086)
- [Improvement] Simplify weight row cache load and evict routines, v2 (#4087)
- [New] Compute optimizer state pointer in table row (#4078)
- [Improvement] Refactor bounds_check_indices (#4049)
- [Improvement] Clean up
WeightRow
in preparation for optimizer state offloading (#4021) - [New] Enable NaN checks on tensor arguments to kernel launches (#4029)
- [Fix] Use CUDAStream instead of cudaStream_t in kernel_launcher (#4071)
- [Fix] Remove cudaStreamGetDevice() call (#4070)
- [Improvement] use fmaf to do multiply and add (#4051)
- [Fix] Fix shared memory check for HIP (#4044)
- [Fix] Back out "Migrate make_pta_acc_format() away from old macros, v2]" (#4039)
- [Improvement] Move ComputeDevice to split_table_batched_embeddings_ops_common (#4034)
- [Improvement] Optimize if-statements with if-constexpr (#4022)
- [Fix] Fixes and enhancements to FBGEMM_LAUNCH_KERNEL (#4015)
- [Improvement] Expand the EEG CLI utility to support indices generation (#4017)
- [Improvement] Remove old code associated with TensorAccessor creation (#4010)
- [Improvement] Simplify
WeightRowAccessor
based on its usage (#3975) - [New] Add kernel to initialize shared memory for testing (#4000)
- [Improvement]
PackedTensorAccessor
cleanup (#3987) - [Improvement]
TensorAccessor
cleanup (#3973) - [New] Incorporate
__TEMPLATE_SOURCE_FILE__
into kernel launcher facilities (#3969) - [New] Incorporate Torch DSA (#3950)
- [New] Better kernel launch utilities, pt 2 (#3947)
- [New] Better kernel launch utilities (#3914)
Miscellaneous
- [New] Add HSTU in fbgemm_gpu/experimental/ (#4090)
- [Fix] Fix inputs in EmbeddingSpMDM8BitTest (#4200)
- [Fix] Fix weights pointer not moving in idx==-1 case (#4199)
- [Improvement] Support ROCm 6.4 builds, pt 2 (#4151)
- [Fix] Fix cpuinfo not being initialized before checking for ARM SVE2 (#4121)
- [Fix] Fix illegal memory access when weights are partially empty in input combine cuda (#4101) (#4111)
- [Fix] Fix illegal memory access when weights are partially empty in input combine cuda (#4101)
- [Fix] Fix illegal memory access when weights are partially empty in input combine cuda (#4101)
- [Fix] Add
+fp16fml
to-march
for aarch64 (#4091) - [New] Allow merge_pooled_embedding take in device without index (#4061)
- [Improvement] Migrate make_pta_acc_format() away from old macros, v3] (#4048)
- [Fix] fix the type hack in dramKV wrapper (#4012)