github ByteDance-Seed/Triton-distributed v0.0.1-rc

pre-release15 days ago

Compiled with

  • Triton v3.4
  • NVSHMEM: v3.3.9

What's Changed

  • feat: support mega kernel in #93 by @XG-zheng
  • feat: support E2E MoE models like Qwen/Qwen3-235B-A22B in #85 by @houqi @XG-zheng @KnowingNothing @wenlei-bao @preminstrel
  • feat: support GEMM+AllReduce on Hopper
  • feat: GroupedGEMM+ReduceScatter supported on L20/Ampere
  • feat: default use NVLS ld_reduce with .acc::f32 precision for BF16/FP16 reduction: for better precision
  • fix: support NVLS multimem.st in vectorized way
  • fix: fix some hang problem with cooperative_launch_grids. close #81
  • fix: some BUGs in AG+GroupedGEMM which may cause unexpected memory access
  • opt: AllReduce One-Shot latency to 9us in H800x8 on very small data message: close #57
  • opt: AllReduce Two-Shot latency performance fix: return symmetric buffer directly to save some d2d copy overhead
  • opt: AllReduce DoubleTree implementation much faster but still not for production: better pipeline needed.
  • trival: support compile without CUDA toolkit and torch
  • Enable rocSHMEM host API usage by @drprajap in #68

Known Issue

  • AMD related is not included in the wheels. if you want to try AMD, build it yourself.

Full Changelog: experimental...v0.0.1-rc

Don't miss a new Triton-distributed release

NewReleases is sending notifications on new releases.