Initial commit

f356f546 · maming · f356f546 · f356f546 · f356f546 · f356f546
Commit f356f546 authored Feb 04, 2026 by maming
20 changed files
--- a/Megatron-LM/docs/source/api-guide/index.rst
+++ b/Megatron-LM/docs/source/api-guide/index.rst
+API Guide
+=========
+
+.. toctree::
+   :maxdepth: 4
+
+   models
+   tensor_parallel
+   context_parallel
+   pipeline_parallel
+   custom_fsdp
+   fusions
+   transformer
+   moe
+   dist_checkpointing
+   dist_optimizer
+   distributed
+   datasets
+   multi_latent_attention
+   num_microbatches_calculator
+   optimizer_param_scheduler
+   optimizer_cpu_offload
+   multi_token_prediction
+   encoder_decoder_parallelism
\ No newline at end of file
--- a/Megatron-LM/docs/source/api-guide/models.bert.rst
+++ b/Megatron-LM/docs/source/api-guide/models.bert.rst
+models.bert package
+===================
+Useful package for training bert and bert like encoder only models. It optionally comes with a binary head that can be used for classification tasks . 
+
+Submodules
+----------
+
+models.bert.bert\_model module
+------------------------------
+
+.. automodule:: core.models.bert.bert_model
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Module contents
+---------------
+
+.. automodule:: core.models.bert
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/models.gpt.rst
+++ b/Megatron-LM/docs/source/api-guide/models.gpt.rst
+models.gpt package
+==================
+This is the implementation of the popular GPT model. It supports several features like model parallelization (Tensor Parallel, Pipeline Parallel, Data Parallel) , mixture of experts, FP8 , Distributed optimizer etc. We are constantly adding new features. So be on the lookout or raise an issue if you want to have something added. 
+
+Submodules
+----------
+
+models.gpt.gpt\_model module
+----------------------------
+
+.. automodule:: core.models.gpt.gpt_model
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Module contents
+---------------
+
+.. automodule:: core.models.gpt
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/models.rst
+++ b/Megatron-LM/docs/source/api-guide/models.rst
+models package
+==============
+This package contains most of the popular LLMs . Currently we have support for GPT, Bert, T5 and Retro . This is an ever growing list so keep an eye out. 
+
+Subpackages
+-----------
+
+.. toctree::
+   :maxdepth: 4
+
+   models.gpt
+   models.t5
+   models.bert
+
+Module contents
+---------------
+
+.. automodule:: core.models
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/models.t5.rst
+++ b/Megatron-LM/docs/source/api-guide/models.t5.rst
+models.t5 package
+=================
+
+Submodules
+----------
+
+models.t5.t5\_model module
+--------------------------
+
+.. automodule:: core.models.T5.t5_model
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Module contents
+---------------
+
+.. automodule:: core.models.T5
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/moe.rst
+++ b/Megatron-LM/docs/source/api-guide/moe.rst
+Mixture of Experts package
+==========================
+
+.. mdinclude :: ../../../megatron/core/transformer/moe/README.md
--- a/Megatron-LM/docs/source/api-guide/multi_latent_attention.rst
+++ b/Megatron-LM/docs/source/api-guide/multi_latent_attention.rst
+Multi-Latent Attention
+======================
+
+Multi-Latent Attention overview 
+-------------------------------
+
+Multi-Latent Attention ("MLA") is an innovative attention mechanism introduced by Deepseek team that enhances the efficiency of attention computation by leveraging multiple latent spaces. This approach is particularly beneficial for large language models (LLMs), as it reduces the computational burden associated with traditional attention mechanisms. According to Deepseek-V2 technical report, MLA achieves better performance compared to Multi-Head Attention (MHA) and requires smaller KV cache.
+
+Enabling Multi-Latent Attention
+-------------------------------
+
+To enable MLA in Megatron-LM, set the following flags in command line:
+- `--multi-latent-attention` to enable MLA in MLP.
+- Set `MLATransformerConfig` to configure MLA.
--- a/Megatron-LM/docs/source/api-guide/multi_token_prediction.md
+++ b/Megatron-LM/docs/source/api-guide/multi_token_prediction.md
+# Multi-Token Prediction (MTP)
+
+Multi-Token Prediction (MTP) extends the prediction scope to multiple future tokens at each position. On the one hand, an MTP objective densifies the training signals and may improve
+data efficiency. On the other hand, MTP may enable the model to pre-plan its representations for better prediction of future tokens. In this implementation of MTP, we sequentially predict additional tokens and keep the complete causal chain at each prediction depth. The following figure illustrates our implementation of MTP in [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3/).
+
+![MTP_implementation](../images/multi_token_prediction/MTP_implementation.png)
+
+The k-th MTP module consists of a shared embedding layer, a projection matrix, a Transformer block, and a shared output head. For the i-th input token at the (k - 1)-th prediction depth, we first combine the representation of the i-th token and the embedding of the (i + K)-th token with the linear projection. The combined serves as the input of the Transformer block at the k-th depth to produce the output representation.
+
+For more information, please refer to [DeepSeek-V3 Technical Report](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf)
+
+## Related Arguments
+
+We can train GPTModel like models with Multi-Token Prediction (MTP) by setting mtp_num_layers to be a positive integer.
+
+| Item | Description |
+| --- | --- |
+| mtp_num_layers | Number of Multi-Token Prediction (MTP) Layers. MTP extends the prediction scope to multiple future tokens at each position. This MTP implementation sequentially predict additional tokens by using D sequential modules to predict D additional tokens. Default is None. |
+| mtp_loss_scaling_factor | Scaling factor of Multi-Token Prediction (MTP) loss. We compute the average of the MTP losses across all depths, and multiply it the scaling factor to obtain the overall MTP loss, which serves as an additional training objective. Default is 0.1. |
+
+## Precautions
+
+Please do not use Context Parallel (CP), or arbitrary AttnMaskType, or learned absolute position embedding type with MTP. These use cases are not yet supported.
--- a/Megatron-LM/docs/source/api-guide/num_microbatches_calculator.rst
+++ b/Megatron-LM/docs/source/api-guide/num_microbatches_calculator.rst
+Microbatches Calculator
+=======================
+This api is used to calculate the number of microbatches required to fit a given model on a given batch size.
+
+
+Module contents
+---------------
+
+.. automodule:: core.num_microbatches_calculator
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/optimizer_cpu_offload.rst
+++ b/Megatron-LM/docs/source/api-guide/optimizer_cpu_offload.rst
+Optimizer CPU offload package
+==============================
+
+.. mdinclude :: ../../../megatron/core/optimizer/cpu_offloading/README.md
--- a/Megatron-LM/docs/source/api-guide/optimizer_param_scheduler.rst
+++ b/Megatron-LM/docs/source/api-guide/optimizer_param_scheduler.rst
+Optimizer Parameters Scheduler
+==============================
+This api is used to calculate the learning rate and weight decay for the optimizer.
+
+
+Module contents
+---------------
+
+.. automodule:: core.optimizer_param_scheduler
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/pipeline_parallel.rst
+++ b/Megatron-LM/docs/source/api-guide/pipeline_parallel.rst
+pipeline\_parallel package
+==========================
+
+This package contains implementations for two different pipeline parallelism
+schedules (one without interleaving and one with interleaving, see `Efficient
+Large-Scale Language Model Training on GPU Clusters Using Megatron-LM <https://arxiv.org/abs/2104.04473>`_
+for details), and a default no-pipelining schedule. It also contains methods
+for the point-to-point communication that is needed between pipeline stages.
+
+Submodules
+----------
+
+.. mdinclude:: pipeline_parallel_layout.md
+
+pipeline\_parallel.p2p\_communication module
+--------------------------------------------
+
+Contains implementations for the various point-to-point communication needed
+(e.g., `recv_forward` and `recv_backward`) in the different pipeline parallelism
+schedules.
+
+.. automodule:: core.pipeline_parallel.p2p_communication
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+pipeline\_parallel.schedules module
+-----------------------------------
+
+Contains implementations for two pipeline parallelism schedules
+(`forward_backward_pipelining_with_interleaving`for pipeline parallelism with
+interleaving, `forward_backward_pipelining_without_interleaving` for pipeline
+parallelism without interleaving) and a default no-pipelining schedule
+(`forward_backward_no_pipelining`). `get_forward_backward_func` returns the right
+scheduling function to use based on the configuration being trained
+(e.g., if pipeline-parallel size is 1, use `forward_backward_no_pipelining`).
+
+.. automodule:: core.pipeline_parallel.schedules
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Module contents
+---------------
+
+.. automodule:: core.pipeline_parallel
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/pipeline_parallel_layout.md
+++ b/Megatron-LM/docs/source/api-guide/pipeline_parallel_layout.md
+# Custom Pipeline Model Parallel Layout
+
+*This is an experimental feature and may be changed.*
+
+`--pipeline-model-parallel-layout` is a flexible API for defining the pipeline parallel partitioning, which is essential for balanced partitioning for an imbalanced model. For example, to partition DeepSeek-V3 (61 decoder layers + 1 mtp layer) with PP16VPP2, we can include the arguments as follows:
+
+```bash
+--pipeline-model-parallel-size 16
+--pipeline-model-parallel-layout "Et*3|(tt|)*29,m|L"
+```
+
+| PP \ VPP rank |            0            |       1       |
+|---------------|-------------------------|---------------|
+|       0       | embedding + 3 × decoder |  2 × decoder  |
+|      1~13     |        2 × decoder      |  2 × decoder  |
+|       14      |        2 × decoder      |      mtp      |
+|       15      |        2 × decoder      |      loss     |
+
+In the layout string, stages are split by '|'. Replicated stages or layers can be described with multiplication. Commas can be used cosmetically. Symbol choices:
+
+* `E` = embedding layer
+* `t` = transformer decoder layer
+* `m` = MTP layer
+* `L` = loss calculation layer
+
+Note that it is legal to have empty stages, e.g., `E||t|L` (the second stage is empty).
--- a/Megatron-LM/docs/source/api-guide/tensor_parallel.rst
+++ b/Megatron-LM/docs/source/api-guide/tensor_parallel.rst
+tensor\_parallel package
+========================
+
+This package contains an implementation for tensor parallelism in transformer
+models (see `Megatron-LM: Training Multi-Billion Parameter Language Models
+Using Model Parallelism <https://arxiv.org/abs/1909.08053>`_ and `Reducing
+Activation Recomputation in Large Transformer Models <https://arxiv.org/abs/2205.05198>`_
+for details).
+
+Submodules
+----------
+
+tensor\_parallel.cross\_entropy module
+--------------------------------------
+
+.. automodule:: core.tensor_parallel.cross_entropy
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.data module
+----------------------------
+
+.. automodule:: core.tensor_parallel.data
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.layers module
+------------------------------
+
+.. automodule:: core.tensor_parallel.layers
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.mappings module
+--------------------------------
+
+.. automodule:: core.tensor_parallel.mappings
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.random module
+------------------------------
+
+.. automodule:: core.tensor_parallel.random
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+tensor\_parallel.utils module
+-----------------------------
+
+.. automodule:: core.tensor_parallel.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Module contents
+---------------
+
+.. automodule:: core.tensor_parallel
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/api-guide/transformer.rst
+++ b/Megatron-LM/docs/source/api-guide/transformer.rst
+transformer package
+===================
+
+The `transformer` package provides a customizable and configurable
+implementation of the transformer model architecture. Each component
+of a transformer stack, from entire layers down to individual linear
+layers, can be customized by swapping in different PyTorch modules
+using the "spec" parameters (see `here
+<https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/mcore_customization.html>`_). The
+configuration of the transformer (hidden size, number of layers,
+number of attention heads, etc.) is provided via a `TransformerConfig`
+object.
+
+Submodules
+----------
+
+transformer.attention module
+----------------------------
+
+This is the entire attention portion, either self or cross attention,
+of a transformer layer including the query, key, and value
+projections, a "core" attention calculation (e.g. dot product
+attention), and final output linear projection.
+
+.. automodule:: core.transformer.attention
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.dot\_product\_attention module
+------------------------------------------
+
+This is a PyTorch-only implementation of dot product attention. A more
+efficient implementation, like those provided by FlashAttention or
+CUDNN's FusedAttention, are typically used when training speed is
+important.
+
+.. automodule:: core.transformer.dot_product_attention
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.enums module
+------------------------
+
+.. automodule:: core.transformer.enums
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.identity\_op module
+-------------------------------
+
+This provides a pass-through module that can be used in specs to
+indicate that the operation should not be performed. For example, when
+using LayerNorm with the subsequent linear layer, an IdentityOp can be
+passed in as the LayerNorm module to use.
+
+.. automodule:: core.transformer.identity_op
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.mlp module
+----------------------
+
+This is the entire MLP portion of the transformer layer with an input
+projection, non-linearity, and output projection.
+
+.. automodule:: core.transformer.mlp
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.module module
+-------------------------
+
+This provides a common base class for all modules used in the
+transformer that contains some common functionality.
+
+.. automodule:: core.transformer.module
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.transformer\_block module
+-------------------------------------
+
+A block, or stack, of several transformer layers. The layers can all
+be the same or each can be unique.
+
+.. automodule:: core.transformer.transformer_block
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.transformer\_config module
+--------------------------------------
+
+This contains all of the configuration options for the
+transformer. Using a dataclass reduces code bloat by keeping all
+arguments together in a dataclass instead of passing several arguments
+through multiple layers of function calls.
+
+.. automodule:: core.transformer.transformer_config
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.transformer\_layer module
+-------------------------------------
+
+A single standard transformer layer including attention and MLP blocks.
+
+.. automodule:: core.transformer.transformer_layer
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+transformer.utils module
+------------------------
+
+Various utilities used in the transformer implementation.
+
+.. automodule:: core.transformer.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Module contents
+---------------
+
+.. automodule:: core.transformer
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/Megatron-LM/docs/source/images/context_parallel/CP_overview.png
+++ b/Megatron-LM/docs/source/images/context_parallel/CP_overview.png
--- a/Megatron-LM/docs/source/images/context_parallel/CP_results.png
+++ b/Megatron-LM/docs/source/images/context_parallel/CP_results.png
--- a/Megatron-LM/docs/source/images/custom_fsdp/FSDP_Allreduce.png
+++ b/Megatron-LM/docs/source/images/custom_fsdp/FSDP_Allreduce.png
--- a/Megatron-LM/docs/source/images/custom_fsdp/FSDP_workflow.png
+++ b/Megatron-LM/docs/source/images/custom_fsdp/FSDP_workflow.png
--- a/Megatron-LM/docs/source/images/custom_fsdp/MCore_Custom_FSDP_Class_Diagram.png
+++ b/Megatron-LM/docs/source/images/custom_fsdp/MCore_Custom_FSDP_Class_Diagram.png