Initial commit

deb8370c · hepj · deb8370c · deb8370c · deb8370c · deb8370c
Commit deb8370c authored Jan 09, 2025 by hepj
20 changed files
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/context_parallel.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/context_parallel.rst
+context\_parallel package
+=========================
+Context parallelism overview 
+----------------------------
+.. figure:: ../images/context_parallel/CP_overview.png
+   :alt: cp_overview
+   :align: center
+   Figure 1: A transformer layer running with TP2CP2. Communications next to Attention are for CP, others are for TP. (AG/RS: all-gather in forward and reduce-scatter in backward, RS/AG: reduce-scatter in forward and all-gather in backward, /AG: no-op in forward and all-gather in backward).
+Context Parallelism ("CP") is a parallelization scheme on the dimension of sequence length. Unlike prior SP (sequence parallelism) which only splits the sequence of Dropout and LayerNorm activations, CP partitions the network inputs and all activations along sequence dimension. With CP, all modules except attention (e.g., Linear, LayerNorm, etc.) can work as usual without any changes, because they do not have inter-token operations. As for attention, the Q (query) of each token needs to compute with the KV (key and value) of all tokens in the same sequence. Hence, CP requires additional all-gather across GPUs to collect the full sequence of KV. Correspondingly, reduce-scatter should be applied to the activation gradients of KV in backward propagation. To reduce activation memory footprint, each GPU only stores the KV of a sequence chunk in forward and gathers KV again in backward. KV communication happens between a GPU and its counterparts in other TP groups. The all-gather and reduce-scatter are transformed to point-to-point communications in ring topology under the hood. Exchanging KV also can leverage MQA/GQA to reduce communication volumes, as they only have one or few attention heads for KV.
+For example, in Figure 1, assuming sequence length is 8K, each GPU processes 4K tokens. GPU0 and GPU2 compose a CP group, they exchange KV with each other. Same thing also happens between GPU1 and GPU3. CP is similar to `Ring Attention <https://arxiv.org/abs/2310.01889>`_ but provides better performance by (1) leveraging the latest OSS and cuDNN flash attention kernels; (2) removing unnecessary computation resulted from low-triangle causal masking and achieving optimal load balance among GPUs.
+Context parallelism benefits 
+----------------------------
+.. figure:: ../images/context_parallel/CP_results.png
+   :alt: cp_results
+   :align: center
+   Figure 2: Speedup of 175B GPT with various TP+CP combinations vs. full recompute (i.e., TP8CP1).
+LLM encounters OOM (out of memory) issue with long context (i.e., long sequence length) because of linearly increasing memory footprint of activations. Recomputing activations in backward can avoid OOM but also introduce significant overheads (~30% with full recompute). Enlarging TP (tensor model parallelism) can fix the OOM issue as well, but it potentially makes compute (e.g., Linear) too short to overlap communication latencies. To be clear, scaling out to more GPUs with bigger TP can hit the overlapping problem no matter if OOM happens.
+CP can better address the issues. With CP, each GPU only computes on a part of the sequence, which reduces both computation and communication by CP times. Therefore, there are no concerns about the overlapping between them. The activation memory footprint per GPU is also CP times smaller, hence no OOM issue any more. As Figure 2 shows, the combinations of TP and CP can achieve optimal performance by eliminating recompute overheads and making the best tradeoff between computation and communications.
+Enabling context parallelism
+----------------------------
+CP support has been added to GPT. All models that share GPT code path also should be able to benefit from CP, such as Llama. CP can work with TP (tensor model parallelism), PP (pipeline model parallelism), and DP (data parallelism), where the total number of GPUs equals TPxCPxPPxDP. CP also can work with different attention variants, including MHA/MQA/GQA, uni-directional and bi-directional masking.
+CP is enabled by simply setting context_parallel_size=<CP_SIZE> in command line. Default context_parallel_size is 1, which means CP is disabled. Running with CP requires Megatron-Core (>=0.5.0) and Transformer Engine (>=1.1).
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/datasets.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/datasets.rst
+datasets package
+================
+.. mdinclude :: ../../../megatron/core/datasets/readme.md
+Submodules
+----------
+datasets.blended\_megatron\_dataset\_config module
+---------------------------------------------------
+.. automodule:: core.datasets.blended_megatron_dataset_config
+   :members:
+   :undoc-members:
+   :show-inheritance:
+datasets.blended\_megatron\_dataset\_builder module
+---------------------------------------------------
+.. automodule:: core.datasets.blended_megatron_dataset_builder
+   :members:
+   :undoc-members:
+   :show-inheritance:
+datasets.megatron\_tokenizer module
+-----------------------------------
+.. automodule:: core.datasets.megatron_tokenizer
+   :members:
+   :undoc-members:
+   :show-inheritance:
+datasets.indexed\_dataset module
+--------------------------------
+.. automodule:: core.datasets.indexed_dataset
+   :members:
+   :undoc-members:
+   :show-inheritance:
+datasets.megatron\_dataset module
+---------------------------------
+.. automodule:: core.datasets.megatron_dataset
+   :members:
+   :undoc-members:
+   :show-inheritance:
+datasets.gpt\_dataset module
+----------------------------
+.. automodule:: core.datasets.gpt_dataset
+   :members:
+   :undoc-members:
+   :show-inheritance:
+datasets.masked\_dataset module
+-------------------------------
+.. automodule:: core.datasets.masked_dataset
+   :members:
+   :undoc-members:
+   :show-inheritance:
+datasets.bert\_dataset module
+-----------------------------
+.. automodule:: core.datasets.bert_dataset
+   :members:
+   :undoc-members:
+   :show-inheritance:
+datasets.t5\_dataset module
+---------------------------
+.. automodule:: core.datasets.t5_dataset
+   :members:
+   :undoc-members:
+   :show-inheritance:
+datasets.blended\_dataset module
+----------------------------------
+.. automodule:: core.datasets.blended_dataset
+   :members:
+   :undoc-members:
+   :show-inheritance:
+datasets.utils module
+---------------------
+.. automodule:: core.datasets.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.datasets
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/dist_checkpointing.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/dist_checkpointing.rst
+dist\_checkpointing package
+===========================
+A library for saving and loading the distributed checkpoints.
+A "distributed checkpoint" can have various underlying formats (current default format is based on Zarr)
+but has a distinctive property - the checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism)
+can be loaded in a different parallel configuration.
+Using the library requires defining sharded state_dict dictionaries with functions from  *mapping* and *optimizer* modules.
+Those state dicts can be saved or loaded with a *serialization* module using strategies from *strategies* module.
+Subpackages
+-----------
+.. toctree::
+   :maxdepth: 4
+   dist_checkpointing.strategies
+Submodules
+----------
+dist\_checkpointing.serialization module
+----------------------------------------
+.. automodule:: core.dist_checkpointing.serialization
+   :members:
+   :undoc-members:
+   :show-inheritance:
+dist\_checkpointing.mapping module
+----------------------------------
+.. automodule:: core.dist_checkpointing.mapping
+   :members:
+   :undoc-members:
+   :show-inheritance:
+dist\_checkpointing.optimizer module
+------------------------------------
+.. automodule:: core.dist_checkpointing.optimizer
+   :members:
+   :undoc-members:
+   :show-inheritance:
+dist\_checkpointing.core module
+-------------------------------
+.. automodule:: core.dist_checkpointing.core
+   :members:
+   :undoc-members:
+   :show-inheritance:
+dist\_checkpointing.dict\_utils module
+--------------------------------------
+.. automodule:: core.dist_checkpointing.dict_utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+dist\_checkpointing.utils module
+--------------------------------
+.. automodule:: core.dist_checkpointing.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.dist_checkpointing
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/dist_checkpointing.strategies.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/dist_checkpointing.strategies.rst
+dist\_checkpointing.strategies package
+======================================
+Package defining different checkpoint formats (backends) and saving/loading algorithms (strategies).
+Strategies can be used for implementing new checkpoint formats or implementing new (more optimal for a given use case) ways of saving/loading of existing formats.
+Strategies are passed to `dist_checkpointing.load` and `dist_checkpointing.save` functions and control the actual saving/loading procedure.
+Submodules
+----------
+dist\_checkpointing.strategies.base module
+------------------------------------------
+.. automodule:: core.dist_checkpointing.strategies.base
+   :members:
+   :undoc-members:
+   :show-inheritance:
+dist\_checkpointing.strategies.tensorstore module
+-------------------------------------------------
+.. automodule:: core.dist_checkpointing.strategies.tensorstore
+   :members:
+   :undoc-members:
+   :show-inheritance:
+dist\_checkpointing.strategies.two\_stage module
+------------------------------------------------
+.. automodule:: core.dist_checkpointing.strategies.two_stage
+   :members:
+   :undoc-members:
+   :show-inheritance:
+dist\_checkpointing.strategies.zarr module
+------------------------------------------
+.. automodule:: core.dist_checkpointing.strategies.zarr
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.dist_checkpointing.strategies
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/distributed.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/distributed.rst
+distributed package
+===================
+This package contains various utilities to finalize model weight gradients
+on each rank before the optimizer step. This includes a distributed data
+parallelism wrapper to all-reduce or reduce-scatter the gradients across
+data-parallel replicas, and a `finalize\_model\_grads` method to
+synchronize gradients across different parallelism modes (e.g., 'tied'
+layers on different pipeline stages, or gradients for experts in a MoE on
+different ranks due to expert parallelism).
+Submodules
+----------
+distributed.distributed\_data\_parallel
+---------------------------------------
+Model wrapper for distributed data parallelism. Stores gradients in a
+contiguous buffer, and supports the option of overlapping communication
+(all-reduce or reduce-scatter) with backprop computation by breaking up
+full model's gradients into smaller buckets and running all-reduce /
+reduce-scatter on each bucket asynchronously. 
+.. automodule:: core.distributed.distributed_data_parallel
+   :members:
+   :undoc-members:
+   :show-inheritance:
+distributed.finalize\_model\_grads
+----------------------------------
+Finalize model gradients for optimizer step across all used parallelism modes.
+Synchronizes the all-reduce / reduce-scatter of model gradients across DP replicas,
+all-reduces the layernorm gradients for sequence parallelism, embedding gradients
+across first and last pipeline stages (if not tied), and expert gradients for expert
+parallelism.
+.. automodule:: core.distributed.finalize_model_grads
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+Contains functionality to synchronize gradients across different ranks before
+optimizer step.
+.. automodule:: core.distributed
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/fusions.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/fusions.rst
+fusions package
+===============
+This package provides modules that provide commonly fused
+operations. Fusing operations improves compute efficiency by
+increasing the amount of work done each time a tensor is read from
+memory. To perform the fusion, modules in this either rely on PyTorch
+functionality for doing just-in-time compilation
+(i.e. `torch.jit.script` in older PyTorch versions of `torch.compile`
+in recent versions), or call into custom kernels in external libraries
+such as Apex or TransformerEngine.
+Submodules
+----------
+fusions.fused\_bias\_dropout module
+-----------------------------------
+This module uses PyTorch JIT to fuse the bias add and dropout operations. Since dropout is not used during inference, different functions are used when in train mode and when in inference mode.
+.. automodule:: core.fusions.fused_bias_dropout
+   :members:
+   :undoc-members:
+   :show-inheritance:
+fusions.fused\_bias\_gelu module
+--------------------------------
+This module uses PyTorch JIT to fuse the bias add and GeLU nonlinearity operations.
+.. automodule:: core.fusions.fused_bias_gelu
+   :members:
+   :undoc-members:
+   :show-inheritance:
+fusions.fused\_layer\_norm module
+---------------------------------
+This module provides a wrapper around various fused LayerNorm implementation in Apex.
+.. automodule:: core.fusions.fused_layer_norm
+   :members:
+   :undoc-members:
+   :show-inheritance:
+fusions.fused\_softmax module
+-----------------------------
+This module provides wrappers around variations of Softmax in Apex.
+.. automodule:: core.fusions.fused_softmax
+   :members:
+   :undoc-members:
+   :show-inheritance:
+fusions.fused\_cross\_entropy\_loss module
+------------------------------------------
+This module uses PyTorch JIT to fuse the cross entropy loss calculation and batches communication calls.
+.. automodule:: core.fusions.fused_softmax
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/index.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/index.rst
+API Guide
+=========
+.. toctree::
+   :maxdepth: 4
+   models
+   tensor_parallel
+   context_parallel
+   pipeline_parallel
+   fusions
+   transformer
+   moe
+   dist_checkpointing
+   distributed
+   datasets
+   num_microbatches_calculator
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/models.bert.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/models.bert.rst
+models.bert package
+===================
+Useful package for training bert and bert like encoder only models. It optionally comes with a binary head that can be used for classification tasks . 
+Submodules
+----------
+models.bert.bert\_model module
+------------------------------
+.. automodule:: core.models.bert.bert_model
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.models.bert
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/models.gpt.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/models.gpt.rst
+models.gpt package
+==================
+This is the implementation of the popular GPT model. It supports several features like model parallelization (Tensor Parallel, Pipeline Parallel, Data Parallel) , mixture of experts, FP8 , Distributed optimizer etc. We are constantly adding new features. So be on the lookout or raise an issue if you want to have something added. 
+Submodules
+----------
+models.gpt.gpt\_model module
+----------------------------
+.. automodule:: core.models.gpt.gpt_model
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.models.gpt
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/models.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/models.rst
+models package
+==============
+This package contains most of the popular LLMs . Currently we have support for GPT, Bert, T5 and Retro . This is an ever growing list so keep an eye out. 
+Subpackages
+-----------
+.. toctree::
+   :maxdepth: 4
+   models.gpt
+   models.t5
+   models.bert
+Module contents
+---------------
+.. automodule:: core.models
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/models.t5.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/models.t5.rst
+models.t5 package
+=================
+Submodules
+----------
+models.t5.t5\_model module
+--------------------------
+.. automodule:: core.models.T5.t5_model
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.models.T5
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/moe.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/moe.rst
+Mixture of Experts package
+==========================
+.. mdinclude :: ../../../megatron/core/transformer/moe/README.md
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/num_microbatches_calculator.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/num_microbatches_calculator.rst
+Microbatches Calculator
+==============
+This api is used to calculate the number of microbatches required to fit a given model on a given batch size.
+Module contents
+---------------
+.. automodule:: core.num_microbatches_calculator
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/pipeline_parallel.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/pipeline_parallel.rst
+pipeline\_parallel package
+==========================
+This package contains implementations for two different pipeline parallelism
+schedules (one without interleaving and one with interleaving, see `Efficient
+Large-Scale Language Model Training on GPU Clusters Using Megatron-LM <https://arxiv.org/abs/2104.04473>`_
+for details), and a default no-pipelining schedule. It also contains methods
+for the point-to-point communication that is needed between pipeline stages.
+Submodules
+----------
+pipeline\_parallel.p2p\_communication module
+--------------------------------------------
+Contains implementations for the various point-to-point communication needed
+(e.g., `recv_forward` and `recv_backward`) in the different pipeline parallelism
+schedules.
+.. automodule:: core.pipeline_parallel.p2p_communication
+   :members:
+   :undoc-members:
+   :show-inheritance:
+pipeline\_parallel.schedules module
+-----------------------------------
+Contains implementations for two pipeline parallelism schedules
+(`forward_backward_pipelining_with_interleaving`for pipeline parallelism with
+interleaving, `forward_backward_pipelining_without_interleaving` for pipeline
+parallelism without interleaving) and a default no-pipelining schedule
+(`forward_backward_no_pipelining`). `get_forward_backward_func` returns the right
+scheduling function to use based on the configuration being trained
+(e.g., if pipeline-parallel size is 1, use `forward_backward_no_pipelining`).
+.. automodule:: core.pipeline_parallel.schedules
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.pipeline_parallel
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/tensor_parallel.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/tensor_parallel.rst
+tensor\_parallel package
+========================
+This package contains an implementation for tensor parallelism in transformer
+models (see `Megatron-LM: Training Multi-Billion Parameter Language Models
+Using Model Parallelism <https://arxiv.org/abs/1909.08053>`_ and `Reducing
+Activation Recomputation in Large Transformer Models <https://arxiv.org/abs/2205.05198>`_
+for details).
+Submodules
+----------
+tensor\_parallel.cross\_entropy module
+--------------------------------------
+.. automodule:: core.tensor_parallel.cross_entropy
+   :members:
+   :undoc-members:
+   :show-inheritance:
+tensor\_parallel.data module
+----------------------------
+.. automodule:: core.tensor_parallel.data
+   :members:
+   :undoc-members:
+   :show-inheritance:
+tensor\_parallel.layers module
+------------------------------
+.. automodule:: core.tensor_parallel.layers
+   :members:
+   :undoc-members:
+   :show-inheritance:
+tensor\_parallel.mappings module
+--------------------------------
+.. automodule:: core.tensor_parallel.mappings
+   :members:
+   :undoc-members:
+   :show-inheritance:
+tensor\_parallel.random module
+------------------------------
+.. automodule:: core.tensor_parallel.random
+   :members:
+   :undoc-members:
+   :show-inheritance:
+tensor\_parallel.utils module
+-----------------------------
+.. automodule:: core.tensor_parallel.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.tensor_parallel
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/api-guide/transformer.rst
+++ b/PAI-Megatron-LM-240718/docs/source/api-guide/transformer.rst
+transformer package
+===================
+The `transformer` package provides a customizable and configurable
+implementation of the transformer model architecture. Each component
+of a transformer stack, from entire layers down to individual linear
+layers, can be customized by swapping in different PyTorch modules
+using the "spec" parameters (see `here
+<https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/mcore_customization.html>`_). The
+configuration of the transformer (hidden size, number of layers,
+number of attention heads, etc.) is provided via a `TransformerConfig`
+object.
+Submodules
+----------
+transformer.attention module
+----------------------------
+This is the entire attention portion, either self or cross attention,
+of a transformer layer including the query, key, and value
+projections, a "core" attention calculation (e.g. dot product
+attention), and final output linear projection.
+.. automodule:: core.transformer.attention
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.dot\_product\_attention module
+------------------------------------------
+This is a PyTorch-only implementation of dot product attention. A more
+efficient implementation, like those provided by FlashAttention or
+CUDNN's FusedAttention, are typically used when training speed is
+important.
+.. automodule:: core.transformer.dot_product_attention
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.enums module
+------------------------
+.. automodule:: core.transformer.enums
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.identity\_op module
+-------------------------------
+This provides a pass-through module that can be used in specs to
+indicate that the operation should not be performed. For example, when
+using LayerNorm with the subsequent linear layer, an IdentityOp can be
+passed in as the LayerNorm module to use.
+.. automodule:: core.transformer.identity_op
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.mlp module
+----------------------
+This is the entire MLP portion of the transformer layer with an input
+projection, non-linearity, and output projection.
+.. automodule:: core.transformer.mlp
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.module module
+-------------------------
+This provides a common base class for all modules used in the
+transformer that contains some common functionality.
+.. automodule:: core.transformer.module
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.transformer\_block module
+-------------------------------------
+A block, or stack, of several transformer layers. The layers can all
+be the same or each can be unique.
+.. automodule:: core.transformer.transformer_block
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.transformer\_config module
+--------------------------------------
+This contains all of the configuration options for the
+transformer. Using a dataclass reduces code bloat by keeping all
+arguments together in a dataclass instead of passing several arguments
+through multiple layers of function calls.
+.. automodule:: core.transformer.transformer_config
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.transformer\_layer module
+-------------------------------------
+A single standard transformer layer including attention and MLP blocks.
+.. automodule:: core.transformer.transformer_layer
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.utils module
+------------------------
+Various utilities used in the transformer implementation.
+.. automodule:: core.transformer.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.transformer
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/PAI-Megatron-LM-240718/docs/source/distrib_optimizer.md
+++ b/PAI-Megatron-LM-240718/docs/source/distrib_optimizer.md
+# Distributed Optimizer
+The motivation for the distributed optimizer is to save memory by distributing the optimizer state evenly across data parallel ranks, versus the current method of replicating the optimizer state across data parallel ranks. As described in https://arxiv.org/abs/1910.02054, this branch specifically implements the following:
+- [yes] distribute all 'non-overlapping' optimizer state (i.e., model params already in fp32 are NOT distributed)
+- [no] distribute model gradients
+- [no] distribute model parameters
+Theoretical memory savings vary depending on the combination of the model's param dtype and grad dtype. In the current implementation, the theoretical number of bytes per parameter is (where 'd' is the data parallel size):
+|        | Non-distributed optim | Distributed optim |
+| ------ | ------ | ------ |
+| float16 param, float16 grads | 20 | 4 + 16/d |
+| float16 param, fp32 grads    | 18 | 6 + 12/d |
+| fp32 param, fp32 grads       | 16 | 8 + 8/d  |
+The implementation of the distributed optimizer is centered on using the contiguous grad buffer for communicating grads & params between the model state and the optimizer state. The grad buffer at any given moment either holds:
+1. all model grads
+2. a 1/d size _copy_ of the main grads (before copying to the optimizer state)
+3. a 1/d size _copy_ of the main params (after copying from the optimizer state)
+4. all model params
+5. zeros (or None), between iterations
+The grad buffer is used for performing reduce-scatter and all-gather operations, for passing grads & params between the model state and optimizer state. With this implementation, no dynamic buffers are allocated.
+The figures below illustrate the grad buffer's sharding scheme, and the key steps of the distributed optimizer's param update:
+## Data flow
+![Data flow](images/distrib_optimizer/data_flow.png)
+## Sharding scheme
+![Sharding scheme](images/distrib_optimizer/sharding_scheme.png)
+## Key steps
+_(note: using illustrations above, and assuming fp16 grads)_
+- Backward pass finishes (grad buffer holds 16 fp16 grad elements)
+- Call reduce-scatter on each DP rank
+- Each DP rank now has 4 elements within the grad buffer that are fully reduced (remaining 12 elements are garbage)
+- Each DP rank copies its relevant 4 fp16 grad elements from the grad buffer into 4 fp32 main grad elements (separate buffer, owned by the optimizer); i.e.
+  - DP rank 0 copies elements [0:4]
+  - DP rank 1 copies elements [4:8]
+  - DP rank 2 copies elements [8:12]
+  - DP rank 3 copies elements [12:16]
+- Optimizer.step()
+- Each DP rank copies its 4 fp32 main (/optimizer) param elements into the corresponding 4 fp16 elements in the grad buffer
+- Call all-gather on each DP rank
+- Grad buffer now contains all 16, fully updated, fp16 model param elements
+- Copy updated model params from grad buffer into their respective param tensors
+- (At this point, grad buffer is ready to be zero'd for the next iteration)
--- a/PAI-Megatron-LM-240718/docs/source/images/context_parallel/CP_overview.png
+++ b/PAI-Megatron-LM-240718/docs/source/images/context_parallel/CP_overview.png
--- a/PAI-Megatron-LM-240718/docs/source/images/context_parallel/CP_results.png
+++ b/PAI-Megatron-LM-240718/docs/source/images/context_parallel/CP_results.png
--- a/PAI-Megatron-LM-240718/docs/source/images/distrib_optimizer/data_flow.png
+++ b/PAI-Megatron-LM-240718/docs/source/images/distrib_optimizer/data_flow.png