update to core_v0.9

4b097dee · liangjing · 3aca1415 · 4b097dee · 4b097dee · 4b097dee
Commit 4b097dee authored Oct 29, 2024 by liangjing
20 changed files
--- a/docs/source/api-guide/index.rst
+++ b/docs/source/api-guide/index.rst
+API Guide
+=========
+.. toctree::
+   :maxdepth: 4
+   models
+   tensor_parallel
+   context_parallel
+   pipeline_parallel
+   fusions
+   transformer
+   moe
+   dist_checkpointing
+   dist_optimizer
+   distributed
+   datasets
+   num_microbatches_calculator
+   optimizer_param_scheduler
--- a/docs/source/api-guide/models.bert.rst
+++ b/docs/source/api-guide/models.bert.rst
+models.bert package
+===================
+Useful package for training bert and bert like encoder only models. It optionally comes with a binary head that can be used for classification tasks . 
+Submodules
+----------
+models.bert.bert\_model module
+------------------------------
+.. automodule:: core.models.bert.bert_model
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.models.bert
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/api-guide/models.gpt.rst
+++ b/docs/source/api-guide/models.gpt.rst
+models.gpt package
+==================
+This is the implementation of the popular GPT model. It supports several features like model parallelization (Tensor Parallel, Pipeline Parallel, Data Parallel) , mixture of experts, FP8 , Distributed optimizer etc. We are constantly adding new features. So be on the lookout or raise an issue if you want to have something added. 
+Submodules
+----------
+models.gpt.gpt\_model module
+----------------------------
+.. automodule:: core.models.gpt.gpt_model
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.models.gpt
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/api-guide/models.rst
+++ b/docs/source/api-guide/models.rst
+models package
+==============
+This package contains most of the popular LLMs . Currently we have support for GPT, Bert, T5 and Retro . This is an ever growing list so keep an eye out. 
+Subpackages
+-----------
+.. toctree::
+   :maxdepth: 4
+   models.gpt
+   models.t5
+   models.bert
+Module contents
+---------------
+.. automodule:: core.models
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/api-guide/models.t5.rst
+++ b/docs/source/api-guide/models.t5.rst
+models.t5 package
+=================
+Submodules
+----------
+models.t5.t5\_model module
+--------------------------
+.. automodule:: core.models.T5.t5_model
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.models.T5
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/api-guide/moe.rst
+++ b/docs/source/api-guide/moe.rst
+Mixture of Experts package
+==========================
+.. mdinclude :: ../../../megatron/core/transformer/moe/README.md
--- a/docs/source/api-guide/num_microbatches_calculator.rst
+++ b/docs/source/api-guide/num_microbatches_calculator.rst
+Microbatches Calculator
+=======================
+This api is used to calculate the number of microbatches required to fit a given model on a given batch size.
+Module contents
+---------------
+.. automodule:: core.num_microbatches_calculator
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/api-guide/optimizer_param_scheduler.rst
+++ b/docs/source/api-guide/optimizer_param_scheduler.rst
+Optimizer Parameters Scheduler
+==============================
+This api is used to calculate the learning rate and weight decay for the optimizer.
+Module contents
+---------------
+.. automodule:: core.optimizer_param_scheduler
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/api-guide/pipeline_parallel.rst
+++ b/docs/source/api-guide/pipeline_parallel.rst
+pipeline\_parallel package
+==========================
+This package contains implementations for two different pipeline parallelism
+schedules (one without interleaving and one with interleaving, see `Efficient
+Large-Scale Language Model Training on GPU Clusters Using Megatron-LM <https://arxiv.org/abs/2104.04473>`_
+for details), and a default no-pipelining schedule. It also contains methods
+for the point-to-point communication that is needed between pipeline stages.
+Submodules
+----------
+pipeline\_parallel.p2p\_communication module
+--------------------------------------------
+Contains implementations for the various point-to-point communication needed
+(e.g., `recv_forward` and `recv_backward`) in the different pipeline parallelism
+schedules.
+.. automodule:: core.pipeline_parallel.p2p_communication
+   :members:
+   :undoc-members:
+   :show-inheritance:
+pipeline\_parallel.schedules module
+-----------------------------------
+Contains implementations for two pipeline parallelism schedules
+(`forward_backward_pipelining_with_interleaving`for pipeline parallelism with
+interleaving, `forward_backward_pipelining_without_interleaving` for pipeline
+parallelism without interleaving) and a default no-pipelining schedule
+(`forward_backward_no_pipelining`). `get_forward_backward_func` returns the right
+scheduling function to use based on the configuration being trained
+(e.g., if pipeline-parallel size is 1, use `forward_backward_no_pipelining`).
+.. automodule:: core.pipeline_parallel.schedules
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.pipeline_parallel
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/api-guide/tensor_parallel.rst
+++ b/docs/source/api-guide/tensor_parallel.rst
+tensor\_parallel package
+========================
+This package contains an implementation for tensor parallelism in transformer
+models (see `Megatron-LM: Training Multi-Billion Parameter Language Models
+Using Model Parallelism <https://arxiv.org/abs/1909.08053>`_ and `Reducing
+Activation Recomputation in Large Transformer Models <https://arxiv.org/abs/2205.05198>`_
+for details).
+Submodules
+----------
+tensor\_parallel.cross\_entropy module
+--------------------------------------
+.. automodule:: core.tensor_parallel.cross_entropy
+   :members:
+   :undoc-members:
+   :show-inheritance:
+tensor\_parallel.data module
+----------------------------
+.. automodule:: core.tensor_parallel.data
+   :members:
+   :undoc-members:
+   :show-inheritance:
+tensor\_parallel.layers module
+------------------------------
+.. automodule:: core.tensor_parallel.layers
+   :members:
+   :undoc-members:
+   :show-inheritance:
+tensor\_parallel.mappings module
+--------------------------------
+.. automodule:: core.tensor_parallel.mappings
+   :members:
+   :undoc-members:
+   :show-inheritance:
+tensor\_parallel.random module
+------------------------------
+.. automodule:: core.tensor_parallel.random
+   :members:
+   :undoc-members:
+   :show-inheritance:
+tensor\_parallel.utils module
+-----------------------------
+.. automodule:: core.tensor_parallel.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.tensor_parallel
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/api-guide/transformer.rst
+++ b/docs/source/api-guide/transformer.rst
+transformer package
+===================
+The `transformer` package provides a customizable and configurable
+implementation of the transformer model architecture. Each component
+of a transformer stack, from entire layers down to individual linear
+layers, can be customized by swapping in different PyTorch modules
+using the "spec" parameters (see `here
+<https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/mcore_customization.html>`_). The
+configuration of the transformer (hidden size, number of layers,
+number of attention heads, etc.) is provided via a `TransformerConfig`
+object.
+Submodules
+----------
+transformer.attention module
+----------------------------
+This is the entire attention portion, either self or cross attention,
+of a transformer layer including the query, key, and value
+projections, a "core" attention calculation (e.g. dot product
+attention), and final output linear projection.
+.. automodule:: core.transformer.attention
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.dot\_product\_attention module
+------------------------------------------
+This is a PyTorch-only implementation of dot product attention. A more
+efficient implementation, like those provided by FlashAttention or
+CUDNN's FusedAttention, are typically used when training speed is
+important.
+.. automodule:: core.transformer.dot_product_attention
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.enums module
+------------------------
+.. automodule:: core.transformer.enums
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.identity\_op module
+-------------------------------
+This provides a pass-through module that can be used in specs to
+indicate that the operation should not be performed. For example, when
+using LayerNorm with the subsequent linear layer, an IdentityOp can be
+passed in as the LayerNorm module to use.
+.. automodule:: core.transformer.identity_op
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.mlp module
+----------------------
+This is the entire MLP portion of the transformer layer with an input
+projection, non-linearity, and output projection.
+.. automodule:: core.transformer.mlp
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.module module
+-------------------------
+This provides a common base class for all modules used in the
+transformer that contains some common functionality.
+.. automodule:: core.transformer.module
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.transformer\_block module
+-------------------------------------
+A block, or stack, of several transformer layers. The layers can all
+be the same or each can be unique.
+.. automodule:: core.transformer.transformer_block
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.transformer\_config module
+--------------------------------------
+This contains all of the configuration options for the
+transformer. Using a dataclass reduces code bloat by keeping all
+arguments together in a dataclass instead of passing several arguments
+through multiple layers of function calls.
+.. automodule:: core.transformer.transformer_config
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.transformer\_layer module
+-------------------------------------
+A single standard transformer layer including attention and MLP blocks.
+.. automodule:: core.transformer.transformer_layer
+   :members:
+   :undoc-members:
+   :show-inheritance:
+transformer.utils module
+------------------------
+Various utilities used in the transformer implementation.
+.. automodule:: core.transformer.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+Module contents
+---------------
+.. automodule:: core.transformer
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/images/context_parallel/CP_overview.png
+++ b/docs/source/images/context_parallel/CP_overview.png
--- a/docs/source/images/context_parallel/CP_results.png
+++ b/docs/source/images/context_parallel/CP_results.png
--- a/docs/source/images/distrib_optimizer/data_flow.png
+++ b/docs/source/images/distrib_optimizer/data_flow.png
--- a/docs/source/images/distrib_optimizer/sharding_scheme.png
+++ b/docs/source/images/distrib_optimizer/sharding_scheme.png
--- a/docs/source/images/moe/token_drop.png
+++ b/docs/source/images/moe/token_drop.png
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
+.. Lumache documentation master file, created by
+   sphinx-quickstart on Tue Aug 15 13:44:10 2023.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+Megatron Core User Guide
+===================================
+**Megatron Core** is a Python library that has the core components required to build your language models. 
+A reference implementation of megatorn core can be found in  `NeMo <https://github.com/NVIDIA/NeMo/tree/main>`_ It offers a *simple* and
+*intuitive* API.
+.. toctree::
+   :maxdepth: 2
+   :caption: User Guide
+   user-guide/index
+.. toctree::
+   :maxdepth: 3
+   :caption: API Guide
+   api-guide/index
--- a/docs/source/user-guide/index.rst
+++ b/docs/source/user-guide/index.rst
+USER GUIDE 
+==========
+.. mdinclude:: ../../../megatron/core/QuickStart.md
\ No newline at end of file
--- a/examples/detxoify_lm/README.md
+++ b/examples/detxoify_lm/README.md
--- a/examples/detxoify_lm/annotations/filter-selfgeneration.py
+++ b/examples/detxoify_lm/annotations/filter-selfgeneration.py