"docs/en/advanced_guides/datasets/lyft.md" did not exist on "17ac0691daf640426961e147b985837a0fab1b44"
Commit 523ec9cc authored by wangsen's avatar wangsen
Browse files

all

parents
Pipeline #1668 failed with stages
in 0 seconds
kill 150486
kill 150491
kill 150497
kill 150501
kill 150503
kill 150504
kill 150505
kill 150522
kill 150526
kill 150529
kill 150534
kill 150535
kill 150539
kill 150541
kill 150544
kill 150549
kill 150551
kill 150554
kill 150556
kill 150561
kill 150562
kill 150567
kill 150568
kill 150571
kill 150575
kill 150576
kill 150581
kill 150582
kill 150583
kill 150589
kill 150590
kill 150593
kill 150597
kill 150598
kill 150600
kill 150604
kill 150606
kill 150608
kill 150611
kill 150615
kill 150617
kill 150619
kill 150621
kill 150624
kill 150626
kill 150630
kill 150633
kill 150634
kill 150635
kill 150638
kill 150643
kill 150644
kill 150645
kill 150647
kill 150653
kill 150654
kill 150655
kill 150658
kill 150663
kill 150664
kill 150665
kill 150667
kill 150673
kill 150674
kill 150675
kill 150677
kill 150683
kill 150684
kill 150685
kill 150686
kill 150693
kill 150694
kill 150695
kill 150696
kill 150703
kill 150704
kill 150705
kill 150706
kill 150707
kill 150714
kill 150716
kill 150717
kill 150718
kill 150720
kill 150726
kill 150728
kill 150730
kill 150731
kill 150736
kill 150738
kill 150740
kill 150742
kill 150744
kill 150748
kill 150749
kill 150751
kill 150752
kill 150756
kill 150757
kill 150762
kill 150763
kill 150764
kill 150765
kill 150772
kill 150773
kill 150774
kill 150775
kill 150778
kill 150779
kill 150781
kill 150788
kill 150789
kill 150790
kill 150793
kill 150795
kill 150797
kill 150798
kill 150803
kill 150804
kill 150806
kill 150807
kill 150813
kill 150814
kill 150816
kill 150819
kill 150822
kill 150823
kill 150828
kill 150830
kill 150831
kill 150832
kill 150833
kill 150838
kill 150839
kill 150842
kill 150843
kill 150844
kill 150845
kill 150850
kill 150851
kill 150854
kill 150855
kill 150856
kill 150857
kill 150861
kill 150862
kill 150867
kill 150868
kill 150871
kill 150872
kill 150873
kill 150876
kill 150878
kill 150880
kill 150882
kill 150883
kill 150885
kill 150887
kill 150890
kill 150893
kill 150894
kill 150895
kill 150899
kill 150900
kill 150903
kill 150904
kill 150906
kill 150909
kill 150910
kill 150914
kill 150915
kill 150918
kill 150919
kill 150921
kill 150923
kill 150926
kill 150927
kill 150929
kill 150930
kill 150934
kill 150936
kill 150937
kill 150938
kill 150942
kill 150944
kill 150945
kill 150946
kill 150949
kill 150953
kill 150955
kill 150956
kill 150959
kill 150961
kill 150962
kill 150964
kill 150967
kill 150969
kill 150970
kill 150971
kill 150975
kill 150976
kill 150977
kill 150981
kill 150982
kill 150983
kill 150987
kill 150988
kill 150990
kill 150993
kill 150994
kill 150996
kill 150998
kill 151000
kill 151003
kill 151004
kill 151005
kill 151009
kill 151010
kill 151011
kill 151016
kill 151017
kill 151020
kill 151021
kill 151024
kill 151026
kill 151027
kill 151031
kill 151033
kill 151035
kill 151037
kill 151039
This diff is collapsed.
This source diff could not be displayed because it is too large. You can view the blob instead.
mpirun --allow-run-as-root -H node11,node12,node13,node14,node15,node16,node17,node18 --map-by node -mca plm_rsh_args "-p 3344" $1
# Llama, Mistral and other Llama-like model support in Megatron-LM
NOTE: Llama-3 and Mistral support in Megatron is currently experimental and we are still evaluting benchmark results to confirm model conversion, training and inference correctness.
The [Llama-2](https://ai.meta.com/llama/) and [Llama-3](https://llama.meta.com/) family of models are an open-source set of pretrained & finetuned (for chat) models that have achieved strong results across a wide set of benchmarks. At their times of release, both Llama-2 and Llama-3 models achieved among the best results for open-source models, and were competitive with leading closed-source models (see https://arxiv.org/pdf/2307.09288.pdf and https://ai.meta.com/blog/meta-llama-3/).
Similarly, [Mistral-7b](https://mistral.ai/news/announcing-mistral-7b/) is an open-source model with pretrained and finetuned (for chat) variants that achieve strong benchmark results.
Architecturally Llama-2, Llama-3 and Mistral-7b are very similar. As such Megatron can support loading checkpoints from all three for inference and finetuning. Converting the checkpoints and loading them is slightly different for each model and is detailed for each below.
# Llama-2
Llama-2 checkpoints can be loaded into Megatron for inference and for finetuning. Loading these checkpoints consists of three steps:
1. Get access to download the checkpoints.
2. Convert the checkpoints from Meta/Huggingface format to Megatron format.
3. Setup arguments for launching the model.
The following sections detail these steps. The final section lists benchmark result comparisons between: 1) Llama-2 inference code running the Meta-format checkpoints, and 2) Megatron inference code running the converted checkpoints.
## Contents
* [Download Meta or Huggingface checkpoints](#download-meta-or-huggingface-checkpoints)
* [Convert checkpoint format](#convert-checkpoint-format)
* [Meta format](#meta-format)
* [Huggingface format](#huggingface-format)
* [Launch model](#launch-model)
* [Megatron](#launch-megatron)
* [Meta](#launch-meta)
* [Huggingface](#launch-hf)
* [Benchmark results](#benchmark-results)
## Download Meta or Huggingface checkpoints
Users must first apply for access to download the Llama-2 checkpoints either directly from [Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or through [Huggingface](https://huggingface.co/docs/transformers/main/model_doc/llama2) (HF). The checkpoints are available in two formats, Meta's native format (available from both the Meta and HF links), and HF's format (available only from HF). Either format can be converted to Megatron, as detailed next.
## Convert checkpoint format
We recommend passing `--dtype bf16` for training or finetuning. Inference can be done in bfloat16 or float16.
### Meta format
The Meta format checkpoints are converted to HF format as an intermediate step before converting to Megatron format. The `transformers` package is required, and must have version >=4.31.0 (e.g., `pip install transformers>=4.31.0`). (**Note**: we have specifically tested with versions `4.31.0` and `4.32.0`; your experience may vary with newer versions.) Assuming the downloaded checkpoints are in `$CHECKPOINT_DIR` (with separate sub-directories for 7B, 13B, 70B, etc.), the following example command can be used to convert from Llama-2 format to HF format in bfloat16:
```
python tools/checkpoint/convert.py --model-type GPT \
> --loader llama_mistral \
> --saver megatron \
> --checkpoint-type meta \
> --model-size llama2-7B \
> --load-dir $LLAMA_META_FORMAT_DIR \
> --save-dir ${MEGATRON_FORMAT_DIR} \
> --tokenizer-model ${TOKENIZER_MODEL} \
> --target-tensor-parallel-size ${TP} \
> --target-pipeline-parallel-size ${PP} \
> --bf16
```
Valid values for `--model-size` are `llama2-7B`, `llama2-13B`, and `llama2-70B` (for pretrained-only models), and `llama2-7Bf`, `llama2-13Bf`, and `llama2-70Bf` (for chat-finetuned models).
### Huggingface format
The HF checkpoints can be converted to Megatron format by using Megatron's own Llama-2 checkpoint converter for HF format (see script `tools/checkpoint/loader_llama_mistral.py`). One important argument that must be set correctly is the tensor parallel size (`TP`) for each model. The following table shows these values:
| Model size | Tensor parallel size (`TP`) |
| ---------- | --------------------------- |
| 7B | 1 |
| 13B | 2 |
| 70B | 8 |
Using these values for `TP`, along with the path to the Llama-2 tokenizer model (automatically downloaded with original checkpoint download; see `${TOKENIZER_MODEL}` below), run the following command from the root of your Megatron source code to convert from HF format to Megatron format:
```
$>: python tools/checkpoint/convert.py \
> --model-type GPT \
> --loader llama_mistral \
> --saver megatron \
> --target-tensor-parallel-size ${TP} \
> --checkpoint-type hf
> --load-dir ${HF_FORMAT_DIR} \
> --save-dir ${MEGATRON_FORMAT_DIR} \
> --tokenizer-model ${TOKENIZER_MODEL}
```
After this conversion, we are ready to load the checkpoints into a Megatron GPT model.
## Launch model
### Launch Megatron
If loading for either inference or finetuning, use the following arguments:
```
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size 1 \
--seq-length 4096 \
--max-position-embeddings 4096 \
--tokenizer-type Llama2Tokenizer \
--tokenizer-model ${TOKENIZER_MODEL} \
--load ${CHECKPOINT_DIR} \
--exit-on-missing-checkpoint \
--use-checkpoint-args \
--no-load-optim \
--no-load-rng \
--untie-embeddings-and-output-weights \
--use-rotary-position-embeddings \
--normalization RMSNorm \
--no-position-embedding \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32
```
### Launch Meta
Meta checkpoints can be launched with: https://github.com/facebookresearch/llama
### Launch Huggingface
Huggingface checkpoints can be launched with: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py
## Benchmark results
The tables below list the benchmark comparisons between native Llama-2 (using Meta's checkpoint and Meta's inference code) and Megatron (using a converted HF checkpoint and Megatron's inference code).
The values are the percent error between Megatron and Llama-2, calculated using the formula: `|<llama_score> - <megatron_score>| / <llama_score>`, where the type of score is detailed before each table. Across all tests (80 total per model size), the mean error is 0.15%. The small difference in benchmark scores between the two models is due to minor arithmetic differences in implementation that alter the numerics slightly. Some of the factors that influence this difference include:
- Megatron performs batch matrix multiplications in a couple places, such as within self attention and in SwiGLU, that Llama performs separately.
- Megatron uses `torch.baddbmm` within self attention, versus Llama using `torch.matmul`.
- Megatron uses a `sin`/`cos` implementation for rotary position embeddings, versus Llama using a `polar`/`complex` implementation.
- Llama calls `torch.set_default_dtype(torch.float16)` during initialization, which Megatron does not.
### Big Bench
Score type: multiple choice grade.
| bigbench / standard | 7b | 13b | 70b |
| -- | -- | -- | -- |
| date_understanding | 0.29% | 0.13% | 0.12% |
| general_knowledge | 0.00% | 0.00% | 0.00% |
| human_organs_senses | 0.00% | 0.00% | 0.00% |
| intent_recognition | 0.00% | 0.11% | 0.00% |
| riddle_sense | 0.00% | 0.00% | 0.00% |
| similarities_abstraction | 0.00% | 0.58% | 0.00% |
| simple_arithmetic_json_multiple_choice | 0.00% | 0.00% | 0.00% |
| undo_permutation | 0.19% | 0.19% | 0.18% |
### Multilingual
Score type: multiple choice grade.
| multilingual / xcopa | 7b | 13b | 70b |
| -- | -- | -- | -- |
| en-template-mGPT-remove-punctuation | 0.08% | 0.00% | 0.00% |
| et-template-mGPT-remove-punctuation | 0.00% | 0.13% | 0.25% |
| ht-template-mGPT-remove-punctuation | 0.26% | 0.13% | 0.26% |
| id-template-mGPT-remove-punctuation | 0.11% | 0.00% | 0.19% |
| it-template-mGPT-remove-punctuation | 0.00% | 0.10% | 0.09% |
| qu-template-mGPT-remove-punctuation | 0.00% | 0.00% | 0.27% |
| sw-template-mGPT-remove-punctuation | 0.14% | 0.13% | 0.13% |
| th-template-mGPT-remove-punctuation | 0.25% | 0.13% | 0.13% |
| tr-template-mGPT-remove-punctuation | 0.26% | 0.00% | 0.34% |
| vi-template-mGPT-remove-punctuation | 0.00% | 0.11% | 0.00% |
| zh-template-mGPT-remove-punctuation | 0.00% | 0.10% | 0.09% |
### LM Evaluation Harness
Score type: multiple choice grade.
| lm-eval | 7b | 13b | 70b |
| -- | -- | -- | -- |
| boolq | 0.04% | 0.04% | 0.07% |
| hellaswag | 0.02% | 0.03% | 0.03% |
| piqa | 0.00% | 0.00% | 0.07% |
| winogrande | 0.00% | 0.11% | 0.20% |
### MMLU
Score type: multiple choice grade.
Note: the number in brackets is the number of sub-tasks for each supercategory.
| mmlu | 7b | 13b | 70b |
| -- | -- | -- | -- |
| stem [18] | 0.79% | 0.05% | 0.01% |
| humanities [13] | 0.19% | 0.01% | 0.02% |
| other (business, health, misc.) [14] | 0.08% | 0.06% | 0.12% |
| social sciences [12] | 0.37% | 0.21% | 0.01% |
# Llama-3
Llama-3 checkpoints can be loaded into Megatron for inference and for finetuning. Loading these checkpoints consists of several steps:
1. Get access to download the checkpoints (weights and tokenizer).
2. Clone the llama3 loading code from Meta.
3. Install the llama package from source.
4. Convert the checkpoints from Meta/Huggingface format to Megatron format.
5. Setup arguments for launching the model.
The following sections detail these steps.
## Contents
* [Download Meta or Huggingface checkpoints](#download-meta-or-huggingface-checkpoints)
* [Install tiktoken](#install-tiktoken)
* [Install llama package from Meta](#install-llama-package)
* [Convert checkpoint format](#convert-checkpoint-format)
* [Meta format](#meta-format)
* [Huggingface format](#huggingface-format)
* [Launch model](#launch-model)
* [Megatron](#launch-megatron)
* [Meta](#launch-meta)
* [Huggingface](#launch-hf)
* [Benchmark results](#benchmark-results)
## Download Meta or Huggingface checkpoints
Users must first apply for access to download the Llama-3 checkpoints either directly from [Meta](https://llama.meta.com/llama-downloads) or through [Huggingface](https://huggingface.co/meta-llama) (HF). The checkpoints are available in two formats, Meta's native format (available from both the Meta and HF links), and HF's format (available only from HF). Either format can be converted to Megatron, as detailed next.
## Install tiktoken
The Llama-3 tokenizer relies on the availability of the `tiktoken` module which can be installed through `pip`.
## Install llama package from Meta
1. In a location outside of the megatron-lm source directory, e.g `~`: `git clone https://github.com/meta-llama/llama3.git`
2. `cd $LLAMA3_SOURCE_DIR`
4. `pip install -e .`
## Convert checkpoint format
We recommend passing `--dtype bf16` for training or finetuning. Inference can be done in bfloat16 or float16.
### Meta format
The Meta format checkpoints are converted to HF format as an intermediate step before converting to Megatron format. The `transformers` package is required, and must have version >=4.31.0 (e.g., `pip install transformers>=4.31.0`). (**Note**: we have specifically tested with versions `4.31.0` and `4.32.0`; your experience may vary with newer versions.) Assuming the downloaded checkpoints are in `$CHECKPOINT_DIR` (with separate sub-directories for 8B, 70B, etc.), the following example command can be used to convert from Llama-3 format to HF format in bfloat16:
```
python tools/checkpoint/convert.py \
> --model-type GPT \
> --loader llama_mistral \
> --saver mcore \
> --checkpoint-type meta \
> --model-size llama3-8B \
> --load-dir $LLAMA_META_FORMAT_DIR \
> --save-dir ${MEGATRON_FORMAT_DIR} \
> --tokenizer-model ${TOKENIZER_MODEL} \
> --target-tensor-parallel-size ${TP} \
> --target-pipeline-parallel-size ${PP} \
> --bf16
```
Valid values for `--model_size` are `llama3-8B` and `llama3-70B` (for pretrained-only models), and `llama3-8Bf` and `llama3-70Bf` (for chat-finetuned models).
### Huggingface format
The HF checkpoints can be converted to Megatron format by using Megatron's own Llama-3 checkpoint converter for HF format (see script `tools/checkpoint/loader_llama_mistral.py`). One important argument that must be set correctly is the tensor parallel size (`TP`) for each model. The following table shows these values:
| Model size | Tensor parallel size (`TP`) |
| ---------- | --------------------------- |
| 8B | 1 |
| 70B | 8 |
Using these values for `TP`, along with the path to the Llama-3 tokenizer model (automatically downloaded with original checkpoint download; see `${TOKENIZER_MODEL}` below), run the following command from the root of your Megatron source code to convert from HF format to Megatron format:
```
$>: python tools/checkpoint/convert.py \
> --model-type GPT \
> --loader llama_mistral \
> --saver mcore \
> --target-tensor-parallel-size ${TP} \
> --checkpoint-type hf
> --load-dir ${HF_FORMAT_DIR} \
> --save-dir ${MEGATRON_FORMAT_DIR} \
> --tokenizer-model ${TOKENIZER_MODEL}
> --model-size llama3-8B \
```
Valid values for `--model-size` are `llama3-8B` and `llama3-70B` (for pretrained-only models), and `llama3-8Bf` and `llama3-70Bf` (for chat-finetuned models).
After this conversion, we are ready to load the checkpoints into a Megatron GPT model.
## Launch model
### Launch Megatron
If loading for either inference or finetuning, use the following arguments:
```
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size 1 \
--seq-length 4096 \
--max-position-embeddings 4096 \
--tokenizer-type Llama3Tokenizer \
--tokenizer-model ${TOKENIZER_MODEL} \
--load ${CHECKPOINT_DIR} \
--exit-on-missing-checkpoint \
--use-checkpoint-args \
--no-load-optim \
--no-load-rng \
--untie-embeddings-and-output-weights \
--normalization RMSNorm \
--position-embedding-type rope \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32
```
### Launch Meta
Meta checkpoints can be launched with: https://github.com/meta-llama/llama3
### Launch Huggingface
Huggingface checkpoints can be launched by following the instructions here: https://huggingface.co/blog/llama3
## Benchmark results
Llama-3 support in Megatron is currently experimental and we are still carrying out benchmark evaluations.
# Mistral-7b
Megatron currently supports loading the v.03 release of Mistral-7b (which does not use sliding window attention and offers a larger 32768 vocabulary) for inference and finetuning. Loading these checkpoints consists of several steps:
1. Get access to download the checkpoints (weights and tokenizer).
2. Install the `mistral-common` package
3. Convert the checkpoints from HuggingFace format to Megatron format.
4. Setup arguments for launching the model.
The following sections detail these steps.
## Contents
* [Download Huggingface checkpoints](#download-huggingface-checkpoints)
* [Install mistral-common packgage](#install-mistral-common)
* [Convert checkpoint format](#convert-checkpoint-format)
* [Launch model](#launch-model)
* [Benchmark results](#benchmark-results)
## Download Huggingface checkpoints
Users must first apply for access to download the Mistral-7b checkpoints through [Huggingface](https://huggingface.co/mistralai/Mistral-7B-v0.3) (HF). Megatron does not currently support the v0.1 or v0.2 checkpoints, ensure you download v0.3. Megatron also does not currently support using the raw weights directly from [Mistral](https://docs.mistral.ai/getting-started/open_weight_models/).
## Install the mistral-common package
`pip install mistral-common`
## Convert checkpoint format
The HF checkpoints can be converted to Megatron format by using Megatron's own Mistral checkpoint converter for HF format (see script `tools/checkpoint/loader_llama_mistral.py`).
Using the path to the Mistral tokenizer model (downloaded alongside the HF checkpoint), run the following command from the root of your Megatron source code to convert from HF format to mcore format:
```
$>: python tools/checkpoint/convert.py \
> --model-type GPT \
> --loader llama_mistral \
> --saver mcore \
> --target-tensor-parallel-size ${TP} \
> --checkpoint-type hf \
> --load-dir ${HF_FORMAT_DIR} \
> --save-dir ${MEGATRON_FORMAT_DIR} \
> --tokenizer-model ${TOKENIZER_MODEL} \
> --model-size mistral-7B \
```
Valid values for `--model-size` are mistral-7B for the pretrained model or mistral-7Bf for the chat fine-tuned model.
After this conversion, we are ready to load the checkpoints into an mcore GPT model.
## Launch model
If loading for either inference or finetuning, use the following arguments:
```
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size 1 \
--seq-length 4096 \
--max-position-embeddings 4096 \
--tokenizer-type MistralTokenizer \
--tokenizer-model ${TOKENIZER_MODEL} \
--load ${CHECKPOINT_DIR} \
--exit-on-missing-checkpoint \
--use-checkpoint-args \
--no-load-optim \
--no-load-rng \
--untie-embeddings-and-output-weights \
--normalization RMSNorm \
--position-embedding-type rope \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32
```
## Benchmark results
Mistral-7B support in Megatron is currently experimental and we are still carrying out benchmark evaluations.
# Other Llama-like model support
*Note: Experimental*
Many models such as Yi-34B use the Llama architecture and may be converted from HuggingFace to Megatron using the commands in [Llama3](#llama-3).
context\_parallel package
=========================
Context parallelism overview
----------------------------
.. figure:: ../images/context_parallel/CP_overview.png
:alt: cp_overview
:align: center
Figure 1: A transformer layer running with TP2CP2. Communications next to Attention are for CP, others are for TP. (AG/RS: all-gather in forward and reduce-scatter in backward, RS/AG: reduce-scatter in forward and all-gather in backward, /AG: no-op in forward and all-gather in backward).
Context Parallelism ("CP") is a parallelization scheme on the dimension of sequence length. Unlike prior SP (sequence parallelism) which only splits the sequence of Dropout and LayerNorm activations, CP partitions the network inputs and all activations along sequence dimension. With CP, all modules except attention (e.g., Linear, LayerNorm, etc.) can work as usual without any changes, because they do not have inter-token operations. As for attention, the Q (query) of each token needs to compute with the KV (key and value) of all tokens in the same sequence. Hence, CP requires additional all-gather across GPUs to collect the full sequence of KV. Correspondingly, reduce-scatter should be applied to the activation gradients of KV in backward propagation. To reduce activation memory footprint, each GPU only stores the KV of a sequence chunk in forward and gathers KV again in backward. KV communication happens between a GPU and its counterparts in other TP groups. The all-gather and reduce-scatter are transformed to point-to-point communications in ring topology under the hood. Exchanging KV also can leverage MQA/GQA to reduce communication volumes, as they only have one or few attention heads for KV.
For example, in Figure 1, assuming sequence length is 8K, each GPU processes 4K tokens. GPU0 and GPU2 compose a CP group, they exchange KV with each other. Same thing also happens between GPU1 and GPU3. CP is similar to `Ring Attention <https://arxiv.org/abs/2310.01889>`_ but provides better performance by (1) leveraging the latest OSS and cuDNN flash attention kernels; (2) removing unnecessary computation resulted from low-triangle causal masking and achieving optimal load balance among GPUs.
Context parallelism benefits
----------------------------
.. figure:: ../images/context_parallel/CP_results.png
:alt: cp_results
:align: center
Figure 2: Speedup of 175B GPT with various TP+CP combinations vs. full recompute (i.e., TP8CP1).
LLM encounters OOM (out of memory) issue with long context (i.e., long sequence length) because of linearly increasing memory footprint of activations. Recomputing activations in backward can avoid OOM but also introduce significant overheads (~30% with full recompute). Enlarging TP (tensor model parallelism) can fix the OOM issue as well, but it potentially makes compute (e.g., Linear) too short to overlap communication latencies. To be clear, scaling out to more GPUs with bigger TP can hit the overlapping problem no matter if OOM happens.
CP can better address the issues. With CP, each GPU only computes on a part of the sequence, which reduces both computation and communication by CP times. Therefore, there are no concerns about the overlapping between them. The activation memory footprint per GPU is also CP times smaller, hence no OOM issue any more. As Figure 2 shows, the combinations of TP and CP can achieve optimal performance by eliminating recompute overheads and making the best tradeoff between computation and communications.
Enabling context parallelism
----------------------------
CP support has been added to GPT. All models that share GPT code path also should be able to benefit from CP, such as Llama. CP can work with TP (tensor model parallelism), PP (pipeline model parallelism), and DP (data parallelism), where the total number of GPUs equals TPxCPxPPxDP. CP also can work with different attention variants, including MHA/MQA/GQA, uni-directional and bi-directional masking.
CP is enabled by simply setting context_parallel_size=<CP_SIZE> in command line. Default context_parallel_size is 1, which means CP is disabled. Running with CP requires Megatron-Core (>=0.5.0) and Transformer Engine (>=1.1).
datasets package
================
.. mdinclude :: ../../../megatron/core/datasets/readme.md
Submodules
----------
datasets.blended\_megatron\_dataset\_config module
---------------------------------------------------
.. automodule:: core.datasets.blended_megatron_dataset_config
:members:
:undoc-members:
:show-inheritance:
datasets.blended\_megatron\_dataset\_builder module
---------------------------------------------------
.. automodule:: core.datasets.blended_megatron_dataset_builder
:members:
:undoc-members:
:show-inheritance:
datasets.megatron\_tokenizer module
-----------------------------------
.. automodule:: core.datasets.megatron_tokenizer
:members:
:undoc-members:
:show-inheritance:
datasets.indexed\_dataset module
--------------------------------
.. automodule:: core.datasets.indexed_dataset
:members:
:undoc-members:
:show-inheritance:
datasets.megatron\_dataset module
---------------------------------
.. automodule:: core.datasets.megatron_dataset
:members:
:undoc-members:
:show-inheritance:
datasets.gpt\_dataset module
----------------------------
.. automodule:: core.datasets.gpt_dataset
:members:
:undoc-members:
:show-inheritance:
datasets.masked\_dataset module
-------------------------------
.. automodule:: core.datasets.masked_dataset
:members:
:undoc-members:
:show-inheritance:
datasets.bert\_dataset module
-----------------------------
.. automodule:: core.datasets.bert_dataset
:members:
:undoc-members:
:show-inheritance:
datasets.t5\_dataset module
---------------------------
.. automodule:: core.datasets.t5_dataset
:members:
:undoc-members:
:show-inheritance:
datasets.blended\_dataset module
----------------------------------
.. automodule:: core.datasets.blended_dataset
:members:
:undoc-members:
:show-inheritance:
datasets.utils module
---------------------
.. automodule:: core.datasets.utils
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.datasets
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing package
===========================
A library for saving and loading the distributed checkpoints.
A "distributed checkpoint" can have various underlying formats (current default format is based on Zarr)
but has a distinctive property - the checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism)
can be loaded in a different parallel configuration.
Using the library requires defining sharded state_dict dictionaries with functions from *mapping* and *optimizer* modules.
Those state dicts can be saved or loaded with a *serialization* module using strategies from *strategies* module.
Subpackages
-----------
.. toctree::
:maxdepth: 4
dist_checkpointing.strategies
Submodules
----------
dist\_checkpointing.serialization module
----------------------------------------
.. automodule:: core.dist_checkpointing.serialization
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.mapping module
----------------------------------
.. automodule:: core.dist_checkpointing.mapping
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.optimizer module
------------------------------------
.. automodule:: core.dist_checkpointing.optimizer
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.core module
-------------------------------
.. automodule:: core.dist_checkpointing.core
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.dict\_utils module
--------------------------------------
.. automodule:: core.dist_checkpointing.dict_utils
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.utils module
--------------------------------
.. automodule:: core.dist_checkpointing.utils
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.dist_checkpointing
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.strategies package
======================================
Package defining different checkpoint formats (backends) and saving/loading algorithms (strategies).
Strategies can be used for implementing new checkpoint formats or implementing new (more optimal for a given use case) ways of saving/loading of existing formats.
Strategies are passed to `dist_checkpointing.load` and `dist_checkpointing.save` functions and control the actual saving/loading procedure.
Submodules
----------
dist\_checkpointing.strategies.base module
------------------------------------------
.. automodule:: core.dist_checkpointing.strategies.base
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.strategies.tensorstore module
-------------------------------------------------
.. automodule:: core.dist_checkpointing.strategies.tensorstore
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.strategies.two\_stage module
------------------------------------------------
.. automodule:: core.dist_checkpointing.strategies.two_stage
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.strategies.zarr module
------------------------------------------
.. automodule:: core.dist_checkpointing.strategies.zarr
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.dist_checkpointing.strategies
:members:
:undoc-members:
:show-inheritance:
distributed package
===================
This package contains various utilities to finalize model weight gradients
on each rank before the optimizer step. This includes a distributed data
parallelism wrapper to all-reduce or reduce-scatter the gradients across
data-parallel replicas, and a `finalize\_model\_grads` method to
synchronize gradients across different parallelism modes (e.g., 'tied'
layers on different pipeline stages, or gradients for experts in a MoE on
different ranks due to expert parallelism).
Submodules
----------
distributed.distributed\_data\_parallel
---------------------------------------
Model wrapper for distributed data parallelism. Stores gradients in a
contiguous buffer, and supports the option of overlapping communication
(all-reduce or reduce-scatter) with backprop computation by breaking up
full model's gradients into smaller buckets and running all-reduce /
reduce-scatter on each bucket asynchronously.
.. automodule:: core.distributed.distributed_data_parallel
:members:
:undoc-members:
:show-inheritance:
distributed.finalize\_model\_grads
----------------------------------
Finalize model gradients for optimizer step across all used parallelism modes.
Synchronizes the all-reduce / reduce-scatter of model gradients across DP replicas,
all-reduces the layernorm gradients for sequence parallelism, embedding gradients
across first and last pipeline stages (if not tied), and expert gradients for expert
parallelism.
.. automodule:: core.distributed.finalize_model_grads
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
Contains functionality to synchronize gradients across different ranks before
optimizer step.
.. automodule:: core.distributed
:members:
:undoc-members:
:show-inheritance:
fusions package
===============
This package provides modules that provide commonly fused
operations. Fusing operations improves compute efficiency by
increasing the amount of work done each time a tensor is read from
memory. To perform the fusion, modules in this either rely on PyTorch
functionality for doing just-in-time compilation
(i.e. `torch.jit.script` in older PyTorch versions of `torch.compile`
in recent versions), or call into custom kernels in external libraries
such as Apex or TransformerEngine.
Submodules
----------
fusions.fused\_bias\_dropout module
-----------------------------------
This module uses PyTorch JIT to fuse the bias add and dropout operations. Since dropout is not used during inference, different functions are used when in train mode and when in inference mode.
.. automodule:: core.fusions.fused_bias_dropout
:members:
:undoc-members:
:show-inheritance:
fusions.fused\_bias\_gelu module
--------------------------------
This module uses PyTorch JIT to fuse the bias add and GeLU nonlinearity operations.
.. automodule:: core.fusions.fused_bias_gelu
:members:
:undoc-members:
:show-inheritance:
fusions.fused\_layer\_norm module
---------------------------------
This module provides a wrapper around various fused LayerNorm implementation in Apex.
.. automodule:: core.fusions.fused_layer_norm
:members:
:undoc-members:
:show-inheritance:
fusions.fused\_softmax module
-----------------------------
This module provides wrappers around variations of Softmax in Apex.
.. automodule:: core.fusions.fused_softmax
:members:
:undoc-members:
:show-inheritance:
fusions.fused\_cross\_entropy\_loss module
------------------------------------------
This module uses PyTorch JIT to fuse the cross entropy loss calculation and batches communication calls.
.. automodule:: core.fusions.fused_softmax
:members:
:undoc-members:
:show-inheritance:
API Guide
=========
.. toctree::
:maxdepth: 4
models
tensor_parallel
context_parallel
pipeline_parallel
fusions
transformer
moe
dist_checkpointing
distributed
datasets
models.bert package
===================
Useful package for training bert and bert like encoder only models. It optionally comes with a binary head that can be used for classification tasks .
Submodules
----------
models.bert.bert\_model module
------------------------------
.. automodule:: core.models.bert.bert_model
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.models.bert
:members:
:undoc-members:
:show-inheritance:
models.gpt package
==================
This is the implementation of the popular GPT model. It supports several features like model parallelization (Tensor Parallel, Pipeline Parallel, Data Parallel) , mixture of experts, FP8 , Distributed optimizer etc. We are constantly adding new features. So be on the lookout or raise an issue if you want to have something added.
Submodules
----------
models.gpt.gpt\_model module
----------------------------
.. automodule:: core.models.gpt.gpt_model
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.models.gpt
:members:
:undoc-members:
:show-inheritance:
models package
==============
This package contains most of the popular LLMs . Currently we have support for GPT, Bert, T5 and Retro . This is an ever growing list so keep an eye out.
Subpackages
-----------
.. toctree::
:maxdepth: 4
models.gpt
models.t5
models.bert
Module contents
---------------
.. automodule:: core.models
:members:
:undoc-members:
:show-inheritance:
models.t5 package
=================
Submodules
----------
models.t5.t5\_model module
--------------------------
.. automodule:: core.models.T5.t5_model
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.models.T5
:members:
:undoc-members:
:show-inheritance:
Mixture of Experts package
==========================
.. mdinclude :: ../../../megatron/core/transformer/moe/README.md
pipeline\_parallel package
==========================
This package contains implementations for two different pipeline parallelism
schedules (one without interleaving and one with interleaving, see `Efficient
Large-Scale Language Model Training on GPU Clusters Using Megatron-LM <https://arxiv.org/abs/2104.04473>`_
for details), and a default no-pipelining schedule. It also contains methods
for the point-to-point communication that is needed between pipeline stages.
Submodules
----------
pipeline\_parallel.p2p\_communication module
--------------------------------------------
Contains implementations for the various point-to-point communication needed
(e.g., `recv_forward` and `recv_backward`) in the different pipeline parallelism
schedules.
.. automodule:: core.pipeline_parallel.p2p_communication
:members:
:undoc-members:
:show-inheritance:
pipeline\_parallel.schedules module
-----------------------------------
Contains implementations for two pipeline parallelism schedules
(`forward_backward_pipelining_with_interleaving`for pipeline parallelism with
interleaving, `forward_backward_pipelining_without_interleaving` for pipeline
parallelism without interleaving) and a default no-pipelining schedule
(`forward_backward_no_pipelining`). `get_forward_backward_func` returns the right
scheduling function to use based on the configuration being trained
(e.g., if pipeline-parallel size is 1, use `forward_backward_no_pipelining`).
.. automodule:: core.pipeline_parallel.schedules
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.pipeline_parallel
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel package
========================
This package contains an implementation for tensor parallelism in transformer
models (see `Megatron-LM: Training Multi-Billion Parameter Language Models
Using Model Parallelism <https://arxiv.org/abs/1909.08053>`_ and `Reducing
Activation Recomputation in Large Transformer Models <https://arxiv.org/abs/2205.05198>`_
for details).
Submodules
----------
tensor\_parallel.cross\_entropy module
--------------------------------------
.. automodule:: core.tensor_parallel.cross_entropy
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.data module
----------------------------
.. automodule:: core.tensor_parallel.data
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.layers module
------------------------------
.. automodule:: core.tensor_parallel.layers
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.mappings module
--------------------------------
.. automodule:: core.tensor_parallel.mappings
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.random module
------------------------------
.. automodule:: core.tensor_parallel.random
:members:
:undoc-members:
:show-inheritance:
tensor\_parallel.utils module
-----------------------------
.. automodule:: core.tensor_parallel.utils
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.tensor_parallel
:members:
:undoc-members:
:show-inheritance:
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment