# Changelog ## NVIDIA Megatron Core 0.13.0 * Features * Inference * Add async support for DynamicInferenceEngine ([MR \!3187](https://github.com/NVIDIA/Megatron-LM/commit/05079d55a5bfcc7a43f4619e36a40a9e8db3f882)) * Pad input tensors and enable FP8 weights for FP8 inference ([MR \!3341](https://github.com/NVIDIA/Megatron-LM/commit/6a6cd478839d90cf09a837adf8c79cbc844bc920)) * Force inference to always gather logits with tensor parallelism ([MR \!3442](https://github.com/NVIDIA/Megatron-LM/commit/7c9cdcb794089968278c7272e0261a68edf5d369)) * Multi batch size CUDA Graphs for Dynamic Inference ([MR \!3402](https://github.com/NVIDIA/Megatron-LM/commit/30aabe5e3133c6d70aa55aaabad4ea8cb04ce63c)) * Post-training * ModelOpt updates ([MR \!3268](https://github.com/NVIDIA/Megatron-LM/commit/550ed5243c3a18e39430c15e8918ee63e41d7eaf)) * Add speculative decoding AR validation feature * Add DeepSeek and Qwen model configs * Performance * ModelCommProcessGroup integration ([MR \!3391](https://github.com/NVIDIA/Megatron-LM/commit/26adc2dfde53fbc2b063e2fdd1d9ed26578811a6)) * Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism ([MR \!3398](https://github.com/NVIDIA/Megatron-LM/commit/45400df7da7fa23e3aff86804e5ac254d9a8d3c0)) * Flexible creation and management of communication groups * Add support for Spike No More embedding initializations and weight decay skipping ([MR \!3500](https://github.com/NVIDIA/Megatron-LM/commit/ee74aa66a06b24e511270f285db475941ef63bfd)) * Model support * Add MiMo video VLM train example (\[MR \!3543) * Add AVLM for MIMO (\[MR \!3624) * Ease of use * Add uv support for source installs ([MR \!3615](https://github.com/NVIDIA/Megatron-LM/commit/164204cd7216e642bdef7299c569d95f02f9a79e)) * Automated weekly prereleases ([MR \!3574](https://github.com/NVIDIA/Megatron-LM/commit/7e59266c70ef34a246438640af690b55c7ecac28)) * Bug fixes * Use mscale\_all\_dim for softmax\_factor ([MR \!2800](https://github.com/NVIDIA/Megatron-LM/commit/e96a358f60c82b8ac8d965d91c3cc4ad0230a4e0)) * Fix FP8 param blockwise scaling unit test ([MR \!3480](https://github.com/NVIDIA/Megatron-LM/commit/57082f946a04c3390fcfc43634dc546ec3ded033)) * Fix unit test blockwise scaling ([MR \!3491](https://github.com/NVIDIA/Megatron-LM/commit/6d95fe63658f967e56a3fda88a9c30a424fcb520)) * Optimize prefill for token-less requests ([MR \!3499](https://github.com/NVIDIA/Megatron-LM/commit/daaa650a9ac4291d4027ca2fdeb4298ce024efd2)) * Add default values for Fp8Padding and Fp8Unpadding ([MR \!3501](https://github.com/NVIDIA/Megatron-LM/commit/42b2b1d10a9cb699b7e5aa40f6bfba9c2a1348aa)) * Fix CUDA graph logic for flexible pp layout ([MR \!3505](https://github.com/NVIDIA/Megatron-LM/commit/020d85e50ddf0f0282802002acb3662129a519c5)) * Load FP8 models with strict=False ([MR \!3508](https://github.com/NVIDIA/Megatron-LM/commit/1ab876ddc4c1893c76f26d775226a8d1dcdfb3d2)) * Skip rope check for torch \< 1.4.0 ([MR \!3528](https://github.com/NVIDIA/Megatron-LM/commit/d8180ef8ed0bb6f305dcdedf1b27d91304f361a3)) * Disable Apex tests for stability ([MR \!3539](https://github.com/NVIDIA/Megatron-LM/commit/d1256277fe378add0a2cfd7251f5a350b6d126ec)) * Fix typo in parallel\_state expert parallelism ([MR \!3548](https://github.com/NVIDIA/Megatron-LM/commit/5783ff32af759b8102cf0cb0bb82b30c48b9da26)) * Guard modelopt on macOS ([MR \!3549](https://github.com/NVIDIA/Megatron-LM/commit/76144fe1106e4fb0e69aa75b7a6ab66e71e8f37f)) * Retry on CUDA function failure ([MR \!3554](https://github.com/NVIDIA/Megatron-LM/commit/809aab68307a64c1386d68cc78ef70f8f4e12a80)) * Fix NCCL mem pool creation error ([MR \!3557](https://github.com/NVIDIA/Megatron-LM/commit/b61e21153146a563309b5d44cb5d7f7425806072)) * Fix get\_rotary\_seq\_len return type ([MR \!3559](https://github.com/NVIDIA/Megatron-LM/commit/1fa6bc83c7aeae95abc8e86ff0aac596985a01c3)) * Retry on CUDA function failure ([MR \!3560](https://github.com/NVIDIA/Megatron-LM/commit/7da88d74865c3f1a59894173246f26e7b3bf91b9)) * Fix NCCL allocator attribute error ([MR \!3565](https://github.com/NVIDIA/Megatron-LM/commit/6b656114795d74c3353cb007c59af49b1752f447)) * Ensure multi-prompt inference works ([MR \!3568](https://github.com/NVIDIA/Megatron-LM/commit/0fae48931000c9c7af06f7dcf037b5b7d96e0cd6)) * Fix MD5 on FIPS systems ([MR \!3577](https://github.com/NVIDIA/Megatron-LM/commit/83ee8c2848a3b1d42b40086a64da11e19f4b191f)) * Fixes dynamic context and inference bugs ([MR \!3582](https://github.com/NVIDIA/Megatron-LM/commit/e9c1da60a1ccc85376666d58568ed1d3e5a4f9db)) * Fix TE version for interleaved fused RoPE ([MR \!3586](https://github.com/NVIDIA/Megatron-LM/commit/b72b6cc161f5273b545bca09677382917cf20492)) * Fix MTP with MoE and TP logging ([MR \!3594](https://github.com/NVIDIA/Megatron-LM/commit/9af96623b66693e058f6bfce8d0094dc976792d8)) * Guard TE import fix ([MR \!3596](https://github.com/NVIDIA/Megatron-LM/commit/1bf946b1ec3f11e71459c7c0d06a97edbed96a1a)) * Add assertion for NCCL UB case ([MR \!3599](https://github.com/NVIDIA/Megatron-LM/commit/e11d28592f19c122859be764b7afe7c208d9acc1)) * Remove Encoder PP related Functions ([MR \!3604](https://github.com/NVIDIA/Megatron-LM/commit/9e49aa4446a58cc21c4dc0c5d0806551ad075ca7)) * Fix segfaults in tests ([MR \!3605](https://github.com/NVIDIA/Megatron-LM/commit/f6492fe8164fd5b9ad55007d435ccfc66cb98cc7)) * Fix TE error in distributed optimizer ([MR \!3625](https://github.com/NVIDIA/Megatron-LM/commit/e6c510ff3c1159f8955589b26f7c395bdf0607d9)) * Remove redundant barrier in checkpoint flow ([MR \!3626](https://github.com/NVIDIA/Megatron-LM/commit/26869feb6a3ac7f5616cb7253c37a4244d107d70)) * Support VPP MTP, fix logging ([MR \!3630](https://github.com/NVIDIA/Megatron-LM/commit/c351a473c7eedac2c43eab0815afb9759f4f8187)) * Retry mechanism for free(): invalid pointer errors ([MR \!3632](https://github.com/NVIDIA/Megatron-LM/commit/ec35b41b2df145a7ccb84afc48d94e0786e094da)) * Fix test\_replication.py issues ([MR \!3633](https://github.com/NVIDIA/Megatron-LM/commit/f7b50b271b2e0e396069e02551b21aa6fb374b43)) * Fix typo in parallel\_state ([MR \!3634](https://github.com/NVIDIA/Megatron-LM/commit/3c79a2c330290df58804c33e28e7c197fcc1f0b9)) * Fix CUDA graph logic determination ([MR \!3635](https://github.com/NVIDIA/Megatron-LM/commit/90efa3ef8a3c4f9e0f1db9f67ab9348bfa501387)) * Fix TE installation error ([MR \!3636](https://github.com/NVIDIA/Megatron-LM/commit/7e7322c01c9cb8ec254ecd9042700b22b70fe5c8)) * Ensure correct sharding type in local tests ([MR \!3643](https://github.com/NVIDIA/Megatron-LM/commit/946357f8dd7fdc12424b3a66bc999e6c0a02696c)) * Fix cudagraphed backward buffer reuse for last layer ([MR \!3645](https://github.com/NVIDIA/Megatron-LM/commit/ee61cf450d24760952e8995aab045ab6d55b986e)) * Set default for packed\_seq\_params in get\_rotary\_seq\_len ([MR \!3651](https://github.com/NVIDIA/Megatron-LM/commit/510d58c46664f44c556005ac928c5c531e12f761)) * Fix dynamic example script errors ([MR \!3653](https://github.com/NVIDIA/Megatron-LM/commit/72e290bf1f4bbf0c8047bb10a51da6ea6372e163)) * Guard TE import fix ([MR \!3666](https://github.com/NVIDIA/Megatron-LM/commit/ac198fc0d60a8c748597e01ca4c6887d3a7bcf3d)) * Known issues ## NVIDIA Megatron Core 0.12.0 * Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16) * Context parallel: fix loss scaling when calculate_per_token_loss=True * Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw) * Inference * Support in-flight batching and chunked KV cache * Reduce memory usage, * by not materializing full attention mask * by only materializing logits for the last token during decode * by removing an obsolete tensor reference * Hybrid Model * Inference * Add CUDA graph support * Change tools/run_mamba_text_generation_server.py to use megatron.core.inference * Fix a shape issue when materializing logits for Mamba model * Improve initialization of Mamba layers * Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model) * Make num_floating_point_operations work with hybrid model * Make hybrid_conversion.py work with mixer that uses TE linear * Add FP8 support * Fix Mamba dt_bias tensor parallelism * Support multimodal tokenizer * Improve data parallelism scaling * MoE * Features: * DeepEP support, compatible with all the parallelisms and token drop / dropless * Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training * CUDA Graph support for MoE * Multi-Token Prediction (MTP) Support * Fused indices_to_multihot kernel for DeepEP dispatcher * Bug fixes: * Fix Hang Issue with MoE+Dense Hybrid models * Update theoretical memory and tflops estimation for MoE and MLA * Fix MoE Aux loss scaling for per token loss * Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications * Known issues: * The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training. ## NVIDIA Megatron Core 0.11.0 * Add multi datacenter training support though N/S connection * MoE * Features * Support DeepSeek-V3 fine-tuning * Aux-loss-free load balancing strategy * Node-limited routing and Device-limited routing support. * Tensor Parallelism support for MLA and Sequence Auxiliary Loss * MTP (with TP and PP support) is coming soon. * Permutation / Unpermutation fusion kernel from TransformerEngine. * Uneven virtual pipeline parallel split support in first and last PP stage. * Bug fixes: * Fix the grad scale when TP != expert-TP and average_in_collective is enabled in DDP. * Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding. * Known Issues: * When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params. * Add MX-FP16 support for optimizer and master weights * CUDA Graph memory optimizations * Enable UCC backend for PP communication * Optimizer CPU offload support for memory savings * Models * Initial RADIO/CRADIO implementation * llama3.2 support * Hybrid Model * Support quantization via TensorRT Model Optimizer ## NVIDIA Megatron Core 0.10.0 * Adding MLA to MCore * Enable FP8 for GroupedMLP * MoE Parallel Folding * Enhance MoE Architecture: Support MoE Layer Frequency Patterns and Configurable MoE FFN Hidden Size * Multimodal: NVLM training and evaluation support in MCore * Mamba Hybrid * Increase performance and reduce memory footprint of Triton language/compiler distributed caching * Add more unit testing and fix bugs ## NVIDIA Megatron Core 0.9.0 * Uneven pipeline parallelism * Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks * Per layer CUDAGraph support for GPT training with Transformer Engine modules * Enable different TP sizes for the vision encoder * Enable pipeline parallelism for T5 & Llava models * Support multi-tile multi-image input in Llava models * MoE * FP8 support * Runtime upcycling support * Dispatcher implementation optimizations * Shared expert support with overlapping optimizations * Qwen Model support * Known Issues * When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context. * NVRx / Fault tolerance * fault and hang detection in addition to existing straggler detection * graceful exit and auto restart ## NVIDIA Megatron Core 0.8.0 * Multimodal * Added initial support for training vision language models using the LLaVA architecture * Added initial support for inference with multimodal inputs * End-to-end multimodal example from data collection to training to evaluation is provided in examples/multimodal * MoE * Context Parallel support. * Distributed checkpoint support for grouped GEMM. * Mamba ## NVIDIA Megatron Core 0.7.0 * MoE * Token drop support * Several efficiency optimizations * Improved model parallelism * Memory optimizations * Distributed checkpointing * Enabled for Retro * Asynchronous checkpoint saving * Several minor bug fixes, speed improvements, and memory optimizations ## NVIDIA Megatron Core 0.6.0 * MoE (Mixture of Experts) * Performance optimization * Communication optimization for multi GPU and Single GPU * 23% improvement (323 TFLOPS/GPU) over MCore 0.5.0 on Mixtral with Hopper BF16 * GroupedMLP enhancement for Hopper * DP Overlapping. Support overlapping computation with gradient reduction and parameter gathering. * All-to-All based Token Dispatcher * Layer-wise logging for load balancing loss. * Improved expert parallel support including distributed optimizer. * Distributed optimizer * RETRO * Data processing * BERT * Distributed checkpointing * Dist checkpointing * PyTorch native distributed backend * Improved saving/loading speed * TensorRT-LLM Export * Integration with TensorRT Model Optimizer Post-training quantization (PTQ) * Text generation driver to perform PTQ in Megatron-LM * Llama2 and Nemotron3-8b examples to use TensorRT-LLM unified build API to build engine after training. * Several minor enhancements, bug fixes, and documentation updates ## NVIDIA Megatron Core 0.5.0 ### Key Features and Enhancements Megatron core documentation is now [live!](https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start) ### Model Features * MoE (Mixture of Experts) * Support for Z-loss, Load balancing and Sinkhorn * Layer and communications refactor * Richer parallelism mappings and EP can be combined with other model parallel techniques for larger MoE variants, e.g. EP + TP + DP + SP + PP * Token dropless architecture with Top-K routing * Performance optimization with with GroupedGEMM when number of local experts is > 1 * Distributed checkpointing * Interleaved rotary embedding ### Datasets * Masked WordPiece datasets for BERT and T5 * Raw and mock datasets ### Parallelism ### Performance * Activation offloading to CPU * Rope and Swiglu fusion * Sliding window attention (via Transformer Engine) ### General Improvements * Timers ## NVIDIA Megatron Core 0.4.0 ### Key Features and Enhancements #### Models * BERT * RETRO * T5 #### Parallelism * Mixture of Experts support for GPT * Model parallel efficient Distributed Data Parallel (DDP) * Context Parallel (2D Tensor Parallel) support #### Datasets * GPT Dataset * Blended Dataset