github NVIDIA/Megatron-LM core_v0.14.0
NVIDIA Megatron Core 0.14.0

23 hours ago
  • Features
    • Inference
      • Add async support for DynamicInferenceEngine (MR !3187)
      • Pad input tensors and enable FP8 weights for FP8 inference (MR !3341)
      • Force inference to always gather logits with tensor parallelism (MR !3442)
      • Multi batch size CUDA Graphs for Dynamic Inference (MR !3402)
    • Post-training
      • ModelOpt updates (MR !3268)
        • Add speculative decoding AR validation feature
        • Add DeepSeek and Qwen model configs
    • Performance
      • ModelCommProcessGroup integration (MR !3391)
      • Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism (MR !3398)
        • Flexible creation and management of communication groups
      • Add support for Spike No More embedding initializations and weight decay skipping (MR !3500)
    • MoE
      • We're actively optimizing large-scale fine-grained MoE performance on Blackwell Platform.
      • Features:
      • Memory Optimization
        • Support recomputation for FP8 layernorm/moe_act/shared_experts (MR !3465)
        • Support optimizer offloading for DSV3 FP8 training (MR !3659)
      • Performance Optimization
      • Bug fixes:
        • Fix router input jitter dtype (MR !3774)
    • Model support
    • Ease of use
      • Add uv support for source installs (MR !3615)
      • Automated weekly prereleases (MR !3574)
  • Bug fixes
    • Use mscale_all_dim for softmax_factor (MR !2800)
    • Fix FP8 param blockwise scaling unit test (MR !3480)
    • Fix unit test blockwise scaling (MR !3491)
    • Optimize prefill for token-less requests (MR !3499)
    • Add default values for Fp8Padding and Fp8Unpadding (MR !3501)
    • Fix CUDA graph logic for flexible pp layout (MR !3505)
    • Load FP8 models with strict=False (MR !3508)
    • Skip rope check for torch < 1.4.0 (MR !3528)
    • Disable Apex tests for stability (MR !3539)
    • Fix typo in parallel_state expert parallelism (MR !3548)
    • Guard modelopt on macOS (MR !3549)
    • Retry on CUDA function failure (MR !3554)
    • Fix NCCL mem pool creation error (MR !3557)
    • Fix get_rotary_seq_len return type (MR !3559)
    • Retry on CUDA function failure (MR !3560)
    • Fix NCCL allocator attribute error (MR !3565)
    • Ensure multi-prompt inference works (MR !3568)
    • Fix MD5 on FIPS systems (MR !3577)
    • Fixes dynamic context and inference bugs (MR !3582)
    • Fix TE version for interleaved fused RoPE (MR !3586)
    • Fix MTP with MoE and TP logging (MR !3594)
    • Guard TE import fix (MR !3596)
    • Add assertion for NCCL UB case (MR !3599)
    • Remove Encoder PP related Functions (MR !3604)
    • Fix segfaults in tests (MR !3605)
    • Fix TE error in distributed optimizer (MR !3625)
    • Remove redundant barrier in checkpoint flow (MR !3626)
    • Support VPP MTP, fix logging (MR !3630)
    • Retry mechanism for free(): invalid pointer errors (MR !3632)
    • Fix test_replication.py issues (MR !3633)
    • Fix typo in parallel_state (MR !3634)
    • Fix CUDA graph logic determination (MR !3635)
    • Fix TE installation error (MR !3636)
    • Ensure correct sharding type in local tests (MR !3643)
    • Fix cudagraphed backward buffer reuse for last layer (MR !3645)
    • Set default for packed_seq_params in get_rotary_seq_len (MR !3651)
    • Fix dynamic example script errors (MR !3653)
    • Guard TE import fix (MR !3666)
  • Breaking changes:
    • megatron.core.distributed.custom_fsdp refactored as breaking change to megatron.core.distributed.fsdp.src.megatron_fsdp
  • Known issues

Don't miss a new Megatron-LM release

NewReleases is sending notifications on new releases.