Commit d520d24f authored by silencealiang's avatar silencealiang
Browse files

Merge branch 'main' into 'main'

megatron升级v0.10

See merge request !3
parents 3aca1415 481609bb
dist\_checkpointing.strategies package
======================================
Package defining different checkpoint formats (backends) and saving/loading algorithms (strategies).
Strategies can be used for implementing new checkpoint formats or implementing new (more optimal for a given use case) ways of saving/loading of existing formats.
Strategies are passed to `dist_checkpointing.load` and `dist_checkpointing.save` functions and control the actual saving/loading procedure.
Submodules
----------
dist\_checkpointing.strategies.base module
------------------------------------------
.. automodule:: core.dist_checkpointing.strategies.base
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.strategies.tensorstore module
-------------------------------------------------
.. automodule:: core.dist_checkpointing.strategies.tensorstore
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.strategies.two\_stage module
------------------------------------------------
.. automodule:: core.dist_checkpointing.strategies.two_stage
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.strategies.zarr module
------------------------------------------
.. automodule:: core.dist_checkpointing.strategies.zarr
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.dist_checkpointing.strategies
:members:
:undoc-members:
:show-inheritance:
# Distributed Optimizer
The motivation for the distributed optimizer is to save memory by distributing the optimizer state evenly across data parallel ranks (https://arxiv.org/abs/1910.02054), versus the naive method of replicating the optimizer state across data parallel ranks.
Theoretical memory savings vary depending on the combination of the datatype of the model's parameters (`param_dtype`) and main gradients accumulated across data-parallel replicas (`grad_dtype`). We always use `fp32` main parameters for optimizer steps. In the current implementation, the theoretical number of bytes per parameter is (where d is the data parallel size):
| | Non-distributed optim | Distributed optim |
| ------ | ------ | ------ |
| `fp16` parameters, `fp16` gradients | 20 | 4 + 16/d |
| `bf16` parameters, `fp32` gradients | 18 | 6 + 12/d |
| `fp32` parameters, `fp32` gradients | 16 | 8 + 8/d |
Our implementation of the distributed optimizer uses contiguous buffers for parameters and main gradients; model gradients are copied over to the main gradients as soon as they are fully computed.
The figures below illustrate the distributed optimizer's sharding scheme, and the key steps of the distributed optimizer's parameter update:
## Data flow
![Data flow](../images/distrib_optimizer/data_flow.png)
## Sharding scheme
![Sharding scheme](../images/distrib_optimizer/sharding_scheme.png)
## Key steps
_(note: using illustrations above, assuming `bf16` model weights, `bf16` model gradients that are computed by the backward pass and `fp32` main gradients that are also used for optimizer steps; we always use `fp32` main weights for optimizer steps)_
- Backward pass finishes (gradient buffer holds 16 `fp32` gradient elements).
- Call reduce-scatter on each DP rank.
- Each DP rank now has 4 elements within the gradient buffer that are fully reduced (remaining 12 elements are garbage).
- DP rank 0 has gradient values for elements [0:4].
- DP rank 1 has gradient values for elements [4:8].
- DP rank 2 has gradient values for elements [8:12].
- DP rank 3 has gradient values for elements [12:16].
- Optimizer.step().
- Each DP rank copies its 4 `fp32` main parameter elements into the corresponding `bf16` parameter buffer (each element is cast from fp32 to fp16).
- Call all-gather on each DP rank.
- The parameter buffer now contains all 16, fully updated, `bf16` model parameter elements. Parameters in PyTorch modules already point to the appropriate locations in this parameter buffer, and thus forward passes are ready to run after the all-gather completes.
- At this point, the gradient buffer is also ready to be zero'd for the next iteration.
distributed package
===================
This package contains various utilities to finalize model weight gradients
on each rank before the optimizer step. This includes a distributed data
parallelism wrapper to all-reduce or reduce-scatter the gradients across
data-parallel replicas, and a `finalize\_model\_grads` method to
synchronize gradients across different parallelism modes (e.g., 'tied'
layers on different pipeline stages, or gradients for experts in a MoE on
different ranks due to expert parallelism).
Submodules
----------
distributed.distributed\_data\_parallel
---------------------------------------
Model wrapper for distributed data parallelism. Stores gradients in a
contiguous buffer, and supports the option of overlapping communication
(all-reduce or reduce-scatter) with backprop computation by breaking up
full model's gradients into smaller buckets and running all-reduce /
reduce-scatter on each bucket asynchronously.
.. automodule:: core.distributed.distributed_data_parallel
:members:
:undoc-members:
:show-inheritance:
distributed.finalize\_model\_grads
----------------------------------
Finalize model gradients for optimizer step across all used parallelism modes.
Synchronizes the all-reduce / reduce-scatter of model gradients across DP replicas,
all-reduces the layernorm gradients for sequence parallelism, embedding gradients
across first and last pipeline stages (if not tied), and expert gradients for expert
parallelism.
.. automodule:: core.distributed.finalize_model_grads
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
Contains functionality to synchronize gradients across different ranks before
optimizer step.
.. automodule:: core.distributed
:members:
:undoc-members:
:show-inheritance:
encoder-decoder-parallelism package
===================================
Mcore (as of 0.9) supports heterogeneous parallelism for encoder-decoder models.
In particular, the user is now able to specify the amount of tensor and pipeline parallelism and have it be
distinct from that in the decoder.
Submodules
----------
Encoder Pipeline Parallelism
----------------------------
Supported in: T5, LLaVa.
The new argument for encoder parallelism is `--encoder-pipeline-model-parallel-size`. This argument is completely distinct
from the usual argument that controls pipelining: `--pipeline-model-parallel-size`, which controls the amount of pipelining in the decoder
in the context of encoder-decoder models.
The total amount of pipelining in an encoder-decoder model is the sum of these two arguments. By default, the amount of
encoder pipelining is 0, and the amount of decoder pipelining is 1, meaning that the encoder & decoder share the single pipeline rank.
If `--pipeline-model-parallel-size` > 1,then the amount of encoder parallelism has to be specified and has to be greater than 0.
This is because we are not able to share pipeline ranks between the encoder and decoder anymore.
Encoder Tensor Parallelism
--------------------------
Supported in: LLaVa.
Since we expect encoders to be much smaller than decoders, we also give users the ability to set a different amount of tensor
parallelism than the decoder. This is achieved with the argument `--encoder-tensor-model-parallel-size`. To use this option, you must
be using encoder pipeline parallelism (ie, `--encoder-pipeline-model-parallel-size` > 0).
Unlike with encoder pipeline parallelism, which was unrestricted by the amount of decoder pipeline parallelism, we only allow encoders to have
less than or the same amount of tensor parallelism as the decoder. The summary of how we do this is that within p2p_communication.py, we have
to send the activations of one encoder rank to several decoder ranks; correspondingly, we have to add support for summing gradients from several
(downstream) decoder ranks for the encoder rank. We have not seen a quantization-related degradation from summing these gradient tensors
together yet; it could happen in very large models.
Number of GPUs Required
-----------------------
The total amount of GPUs required to train a model when these options enabled is:
dp * etp * epp * cp + dp * tp * pp * cp
where:
dp: amount of data parallelism (this is the same for the encoder & decoder)
[e]tp: amount of tensor parallelism
[e]pp: amount of pipeline parallelism
cp: amount of context parallelism (as with dp, this is the same for the encoder & decoder)
The default value of this argument is 0; in practice, we will use the amount of tensor parallelism in the decoder to construct the encoder.
fusions package
===============
This package provides modules that provide commonly fused
operations. Fusing operations improves compute efficiency by
increasing the amount of work done each time a tensor is read from
memory. To perform the fusion, modules in this either rely on PyTorch
functionality for doing just-in-time compilation
(i.e. `torch.jit.script` in older PyTorch versions of `torch.compile`
in recent versions), or call into custom kernels in external libraries
such as Apex or TransformerEngine.
Submodules
----------
fusions.fused\_bias\_dropout module
-----------------------------------
This module uses PyTorch JIT to fuse the bias add and dropout operations. Since dropout is not used during inference, different functions are used when in train mode and when in inference mode.
.. automodule:: core.fusions.fused_bias_dropout
:members:
:undoc-members:
:show-inheritance:
fusions.fused\_bias\_gelu module
--------------------------------
This module uses PyTorch JIT to fuse the bias add and GeLU nonlinearity operations.
.. automodule:: core.fusions.fused_bias_gelu
:members:
:undoc-members:
:show-inheritance:
fusions.fused\_layer\_norm module
---------------------------------
This module provides a wrapper around various fused LayerNorm implementation in Apex.
.. automodule:: core.fusions.fused_layer_norm
:members:
:undoc-members:
:show-inheritance:
fusions.fused\_softmax module
-----------------------------
This module provides wrappers around variations of Softmax in Apex.
.. automodule:: core.fusions.fused_softmax
:members:
:undoc-members:
:show-inheritance:
fusions.fused\_cross\_entropy\_loss module
------------------------------------------
This module uses PyTorch JIT to fuse the cross entropy loss calculation and batches communication calls.
.. automodule:: core.fusions.fused_cross_entropy
:members:
:undoc-members:
:show-inheritance:
API Guide
=========
.. toctree::
:maxdepth: 4
models
tensor_parallel
context_parallel
pipeline_parallel
fusions
transformer
moe
dist_checkpointing
dist_optimizer
distributed
datasets
num_microbatches_calculator
optimizer_param_scheduler
encoder_decoder_parallelism
\ No newline at end of file
models.bert package
===================
Useful package for training bert and bert like encoder only models. It optionally comes with a binary head that can be used for classification tasks .
Submodules
----------
models.bert.bert\_model module
------------------------------
.. automodule:: core.models.bert.bert_model
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.models.bert
:members:
:undoc-members:
:show-inheritance:
models.gpt package
==================
This is the implementation of the popular GPT model. It supports several features like model parallelization (Tensor Parallel, Pipeline Parallel, Data Parallel) , mixture of experts, FP8 , Distributed optimizer etc. We are constantly adding new features. So be on the lookout or raise an issue if you want to have something added.
Submodules
----------
models.gpt.gpt\_model module
----------------------------
.. automodule:: core.models.gpt.gpt_model
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.models.gpt
:members:
:undoc-members:
:show-inheritance:
models package
==============
This package contains most of the popular LLMs . Currently we have support for GPT, Bert, T5 and Retro . This is an ever growing list so keep an eye out.
Subpackages
-----------
.. toctree::
:maxdepth: 4
models.gpt
models.t5
models.bert
Module contents
---------------
.. automodule:: core.models
:members:
:undoc-members:
:show-inheritance:
models.t5 package
=================
Submodules
----------
models.t5.t5\_model module
--------------------------
.. automodule:: core.models.T5.t5_model
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.models.T5
:members:
:undoc-members:
:show-inheritance:
Mixture of Experts package
==========================
.. mdinclude :: ../../../megatron/core/transformer/moe/README.md
Microbatches Calculator
=======================
This api is used to calculate the number of microbatches required to fit a given model on a given batch size.
Module contents
---------------
.. automodule:: core.num_microbatches_calculator
:members:
:undoc-members:
:show-inheritance:
Optimizer Parameters Scheduler
==============================
This api is used to calculate the learning rate and weight decay for the optimizer.
Module contents
---------------
.. automodule:: core.optimizer_param_scheduler
:members:
:undoc-members:
:show-inheritance:
pipeline\_parallel package
==========================
This package contains implementations for two different pipeline parallelism
schedules (one without interleaving and one with interleaving, see `Efficient
Large-Scale Language Model Training on GPU Clusters Using Megatron-LM <https://arxiv.org/abs/2104.04473>`_
for details), and a default no-pipelining schedule. It also contains methods
for the point-to-point communication that is needed between pipeline stages.
Submodules
----------
pipeline\_parallel.p2p\_communication module
--------------------------------------------
Contains implementations for the various point-to-point communication needed
(e.g., `recv_forward` and `recv_backward`) in the different pipeline parallelism
schedules.
.. automodule:: core.pipeline_parallel.p2p_communication
:members:
:undoc-members:
:show-inheritance:
pipeline\_parallel.schedules module
-----------------------------------
Contains implementations for two pipeline parallelism schedules
(`forward_backward_pipelining_with_interleaving`for pipeline parallelism with
interleaving, `forward_backward_pipelining_without_interleaving` for pipeline
parallelism without interleaving) and a default no-pipelining schedule
(`forward_backward_no_pipelining`). `get_forward_backward_func` returns the right
scheduling function to use based on the configuration being trained
(e.g., if pipeline-parallel size is 1, use `forward_backward_no_pipelining`).
.. automodule:: core.pipeline_parallel.schedules
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.pipeline_parallel
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel package
========================
This package contains an implementation for tensor parallelism in transformer
models (see `Megatron-LM: Training Multi-Billion Parameter Language Models
Using Model Parallelism <https://arxiv.org/abs/1909.08053>`_ and `Reducing
Activation Recomputation in Large Transformer Models <https://arxiv.org/abs/2205.05198>`_
for details).
Submodules
----------
tensor\_parallel.cross\_entropy module
--------------------------------------
.. automodule:: core.tensor_parallel.cross_entropy
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.data module
----------------------------
.. automodule:: core.tensor_parallel.data
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.layers module
------------------------------
.. automodule:: core.tensor_parallel.layers
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.mappings module
--------------------------------
.. automodule:: core.tensor_parallel.mappings
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.random module
------------------------------
.. automodule:: core.tensor_parallel.random
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.utils module
-----------------------------
.. automodule:: core.tensor_parallel.utils
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.tensor_parallel
:members:
:undoc-members:
:show-inheritance:
transformer package
===================
The `transformer` package provides a customizable and configurable
implementation of the transformer model architecture. Each component
of a transformer stack, from entire layers down to individual linear
layers, can be customized by swapping in different PyTorch modules
using the "spec" parameters (see `here
<https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/mcore_customization.html>`_). The
configuration of the transformer (hidden size, number of layers,
number of attention heads, etc.) is provided via a `TransformerConfig`
object.
Submodules
----------
transformer.attention module
----------------------------
This is the entire attention portion, either self or cross attention,
of a transformer layer including the query, key, and value
projections, a "core" attention calculation (e.g. dot product
attention), and final output linear projection.
.. automodule:: core.transformer.attention
:members:
:undoc-members:
:show-inheritance:
transformer.dot\_product\_attention module
------------------------------------------
This is a PyTorch-only implementation of dot product attention. A more
efficient implementation, like those provided by FlashAttention or
CUDNN's FusedAttention, are typically used when training speed is
important.
.. automodule:: core.transformer.dot_product_attention
:members:
:undoc-members:
:show-inheritance:
transformer.enums module
------------------------
.. automodule:: core.transformer.enums
:members:
:undoc-members:
:show-inheritance:
transformer.identity\_op module
-------------------------------
This provides a pass-through module that can be used in specs to
indicate that the operation should not be performed. For example, when
using LayerNorm with the subsequent linear layer, an IdentityOp can be
passed in as the LayerNorm module to use.
.. automodule:: core.transformer.identity_op
:members:
:undoc-members:
:show-inheritance:
transformer.mlp module
----------------------
This is the entire MLP portion of the transformer layer with an input
projection, non-linearity, and output projection.
.. automodule:: core.transformer.mlp
:members:
:undoc-members:
:show-inheritance:
transformer.module module
-------------------------
This provides a common base class for all modules used in the
transformer that contains some common functionality.
.. automodule:: core.transformer.module
:members:
:undoc-members:
:show-inheritance:
transformer.transformer\_block module
-------------------------------------
A block, or stack, of several transformer layers. The layers can all
be the same or each can be unique.
.. automodule:: core.transformer.transformer_block
:members:
:undoc-members:
:show-inheritance:
transformer.transformer\_config module
--------------------------------------
This contains all of the configuration options for the
transformer. Using a dataclass reduces code bloat by keeping all
arguments together in a dataclass instead of passing several arguments
through multiple layers of function calls.
.. automodule:: core.transformer.transformer_config
:members:
:undoc-members:
:show-inheritance:
transformer.transformer\_layer module
-------------------------------------
A single standard transformer layer including attention and MLP blocks.
.. automodule:: core.transformer.transformer_layer
:members:
:undoc-members:
:show-inheritance:
transformer.utils module
------------------------
Various utilities used in the transformer implementation.
.. automodule:: core.transformer.utils
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.transformer
:members:
:undoc-members:
:show-inheritance:
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment