CHANGELOG.md

# Changelog

## NVIDIA Megatron Core 0.12.0

- Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16)
- Context parallel: fix loss scaling when calculate_per_token_loss=True
- Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw)
- Inference
  - Support in-flight batching and chunked KV cache
  - Reduce memory usage,
    - by not materializing full attention mask
    - by only materializing logits for the last token during decode
    - by removing an obsolete tensor reference
- Hybrid Model
  - Inference
    - Add CUDA graph support
    - Change tools/run_mamba_text_generation_server.py to use megatron.core.inference
    - Fix a shape issue when materializing logits for Mamba model
  - Improve initialization of Mamba layers
  - Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model)
  - Make num_floating_point_operations work with hybrid model
  - Make hybrid_conversion.py work with mixer that uses TE linear
  - Add FP8 support
  - Fix Mamba dt_bias tensor parallelism
  - Support multimodal tokenizer
  - Improve data parallelism scaling
- MoE
  - Features:
    - DeepEP support, compatible with all the parallelisms and token drop / dropless
    - Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training
    - CUDA Graph support for MoE
    - Multi-Token Prediction (MTP) Support
    - Fused indices_to_multihot kernel for DeepEP dispatcher
  - Bug fixes:
    - Fix Hang Issue with MoE+Dense Hybrid models
    - Update theoretical memory and tflops estimation for MoE and MLA
    - Fix MoE Aux loss scaling for per token loss
    - Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications
  - Known issues:
    - The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training.

## NVIDIA Megatron Core 0.11.0

- Add multi datacenter training support though N/S connection
- MoE
  - Features
    - Support DeepSeek-V3 fine-tuning
      - Aux-loss-free load balancing strategy
      - Node-limited routing and Device-limited routing support.
      - Tensor Parallelism support for MLA and Sequence Auxiliary Loss
      - MTP (with TP and PP support) is coming soon.
    - Permutation / Unpermutation fusion kernel from TransformerEngine.
    - Uneven virtual pipeline parallel split support in first and last PP stage.
  - Bug fixes:
    - Fix the grad scale when TP != expert-TP and average_in_collective is enabled in DDP.
    - Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding.
  - Known Issues:
    - When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params.
- Add MX-FP16 support for optimizer and master weights
- CUDA Graph memory optimizations
- Enable UCC backend for PP communication
- Optimizer CPU offload support for memory savings
- Models
  - Initial RADIO/CRADIO implementation
  - llama3.2 support
- Hybrid Model
  - Support quantization via TensorRT Model Optimizer

## NVIDIA Megatron Core 0.10.0

- Adding MLA to MCore
- Enable FP8 for GroupedMLP
- MoE Parallel Folding
- Enhance MoE Architecture: Support MoE Layer Frequency Patterns and Configurable MoE FFN Hidden Size
- Multimodal: NVLM training and evaluation support in MCore
- Mamba Hybrid
  - Increase performance and reduce memory footprint of Triton language/compiler distributed caching
  - Add more unit testing and fix bugs

## NVIDIA Megatron Core 0.9.0

- Uneven pipeline parallelism
  - Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks
- Per layer CUDAGraph support for GPT training with Transformer Engine modules
- Enable different TP sizes for the vision encoder
- Enable pipeline parallelism for T5 & Llava models
- Support multi-tile multi-image input in Llava models
- MoE
  - FP8 support
  - Runtime upcycling support
  - Dispatcher implementation optimizations
  - Shared expert support with overlapping optimizations
    - Qwen Model support
- Known Issues
  - When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.
- NVRx / Fault tolerance
  - fault and hang detection in addition to existing straggler detection
  - graceful exit and auto restart

## NVIDIA Megatron Core 0.8.0

- Multimodal
  - Added initial support for training vision language models using the LLaVA architecture
  - Added initial support for inference with multimodal inputs
  - End-to-end multimodal example from data collection to training to evaluation is provided in examples/multimodal
- MoE
  - Context Parallel support.
  - Distributed checkpoint support for grouped GEMM.
- Mamba

## NVIDIA Megatron Core 0.7.0

- MoE
  - Token drop support
  - Several efficiency optimizations
  - Improved model parallelism
  - Memory optimizations
- Distributed checkpointing
  - Enabled for Retro
  - Asynchronous checkpoint saving
- Several minor bug fixes, speed improvements, and memory optimizations

## NVIDIA Megatron Core 0.6.0

- MoE (Mixture of Experts)
  - Performance optimization
    - Communication optimization for multi GPU and Single GPU
    - 23% improvement (323 TFLOPS/GPU) over MCore 0.5.0 on Mixtral with Hopper BF16
    - GroupedMLP enhancement for Hopper
    - DP Overlapping. Support overlapping computation with gradient reduction and parameter gathering.
  - All-to-All based Token Dispatcher
  - Layer-wise logging for load balancing loss.
  - Improved expert parallel support including distributed optimizer.
- Distributed optimizer
- RETRO
  - Data processing
- BERT
  - Distributed checkpointing
- Dist checkpointing
  - PyTorch native distributed backend
  - Improved saving/loading speed
- TensorRT-LLM Export
  - Integration with TensorRT Model Optimizer Post-training quantization (PTQ)
  - Text generation driver to perform PTQ in Megatron-LM
  - Llama2 and Nemotron3-8b examples to use TensorRT-LLM unified build API to build engine after training.
- Several minor enhancements, bug fixes, and documentation updates

## NVIDIA Megatron Core 0.5.0

### Key Features and Enhancements

Megatron core documentation is now [live!](https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start)

### Model Features

- MoE (Mixture of Experts)
  - Support for Z-loss, Load balancing and Sinkhorn
  - Layer and communications refactor
  - Richer parallelism mappings and EP can be combined with other model parallel techniques for larger MoE variants, e.g. EP + TP + DP + SP + PP
  - Token dropless architecture with Top-K routing
  - Performance optimization with with GroupedGEMM when number of local experts is > 1
  - Distributed checkpointing
- Interleaved rotary embedding

### Datasets

- Masked WordPiece datasets for BERT and T5
- Raw and mock datasets

### Parallelism

### Performance

- Activation offloading to CPU
- Rope and Swiglu fusion
- Sliding window attention (via Transformer Engine)

### General Improvements

- Timers

## NVIDIA Megatron Core 0.4.0

### Key Features and Enhancements

#### Models

- BERT
- RETRO
- T5

#### Parallelism

- Mixture of Experts support for GPT
- Model parallel efficient Distributed Data Parallel (DDP)
- Context Parallel (2D Tensor Parallel) support

#### Datasets

- GPT Dataset
- Blended Dataset