pretrain.md

# Yuan2.0 Pretraining

## Introduction

This document provides instructions for pretraining model of Yuan2.0-M32.

Three models are provided, the main parameters are as follows:

|        | Layer number | Hidden size | Attention head | expert num |
| :--:   | :----------: | :---------: | :------------: | :--------: |
| 2*32B  |      24      |    2048     |       16       |     32     |

## Usage

The  scripts describe three models in Yuan2.0:

- 2B : [`pretrain_yuan2.0_moe_2x32B.sh`](../examples/pretrain_yuan2.0_moe_2x32B.sh)

### Example

An example script to run Yuan-2.1B-M32 pretraining is:

```shell
bash examples/pretrain_yuan2.0_moe_2x32B.sh
```

### Arguments setting

Before running the script, the relevant arguments should be set correctly.

Firstly,  make any desired modifications including setting the environment variables for `CHECKPOINT_PATH`, `DATA_PATH`,  `TOKENIZER_MODEL_PATH ` and `TENSORBOARD_PATH`.

If the dataset path is:

```bash
/path/dataset.bin
```

The `DATA_PATH` can be set :

```shell
#DATA_PATH='weight dataset_path'
DATA_PATH='1 /path/dataset'
```

The dataset  preprocess can see documentation [here]().

A simple and efficient three-dimensional model-parallel approach can be controlled by `--tensor-model-parallel-size` and `--pipeline-model-parallel-size ` flag.  If the `--pipeline-model-parallel-method` flag is set to `block`, the number of transformer layers shoule be specified by the `--pipeline-model-parallel-blocks` for each pipeline stage.

The Localized Filtering-based Attention(LFA) can be activated by the '`--use-lf-gate` flag. And the `--lf-conv2d-num-pad` flag shoule be set to `1` for training and `0` for inference.

The `--use-distributed-optimizer` and `--recompute-method` can control the use of memory during Training.

Further command line arguments are described in the source file [`arguments.py`](../megatron/arguments) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/blob/main/README.md)