In LiBai, you can try out different parallel strategies by simply changing the distributed config in [training config file](https://github.com/Oneflow-Inc/libai/blob/main/configs/common/train.py).
```python
# Distributed arguments
dist=dict(
data_parallel_size=1,
tensor_parallel_size=1,
pipeline_parallel_size=1,
# users must set the `pipeline_num_layers` attribute when `pipeline_parallel_size > 1`
pipeline_num_layers=None,
# users could customize the number of layers in different stages
# by setting the `custom_pipeline_stage_id ` attribute which is used for
# manually balance calculation between stages when running pipeline parallelism
# e.g. you can set `custom_pipeline_stage_id=[0, 0, 0, 1]`
# for `pipeline_num_layers=4 and pipeline_parallel_size=2`
# which means the first 3 layers will be placed on stage0 and
# the last layer will be placed on stage1
# NOTE: if it is None, LiBai will automatically set pipeline_stage_id
# `auto_pipeline_stage_id` and `actual_pipeline_stage_id` will be saved in `config.yaml`
custom_pipeline_stage_id=None,
)
```
For example, you can set `data_parallel_size=2` which will automatically split the input data into two groups for data parallel training.
## Distributed Setting Example
Here are some simple examples for you to understand the basic configuration of LiBai's distributed settings. LiBai's **BERT** model supports three parallelism techniques (**data parallel training**, **tensor parallel training** and **pipeline parallel training**). Take 1 node with 8 GPUs as an example. If you do not change any default settings, LiBai will execute **data parallel training** by default. You can try out different combinations of parallelism training techniques by updating [bert config file](https://github.com/Oneflow-Inc/libai/blob/main/configs/bert_large_pretrain.py) as follows:
#### **Pure Data Parallel Training on 8 GPUs**
In this example, the model is replicated on 8 GPUs, and each replica handles only part of the input data during iteration.
```python
from.common.trainimporttrain
...
train.dist.data_parallel_size=8
```
#### **Pure Tensor Parallel Training on 8 GPUs**
In this example, the weight of the layers in the model will be split into 8 parts for tensor parallel training on 8 GPUs.
```python
from.common.trainimporttrain
...
train.dist.tensor_parallel_size=8
```
**Note:** This only works for models built with ``libai.layers``.
#### **Pure Pipeline Parallel Training on 8 GPUs**
In this example, 8 GPUs will be split into 8 stages, and different layers of the model will be put on different stages automatically for pipeline parallel training.
-`train.dist.pipeline_num_layers` must be set consistent with the model layers. If unset, it will use the default value `1000`,
which might trigger unexpected behavior.
- For models which have been configured with pipeline parallelism(e.g., BERT, GPT-2, T5 and ViT), you can simply update the distributed config to execute pipeline parallel training on them. If you need to train your own model with pipeline parallel strategy, please refer to [Write Models](https://libai.readthedocs.io/en/latest/tutorials/basics/Write_Models.html) for more details about configuring your own model with pipeline parallelism.
#### **Data Parallel + Tensor Parallel for 2D Parallel Training on 8 GPUs**
In this example, 8 GPUs will be split into **2 groups**, and each group contains **4 GPUs**. The input data will be split into 2 parts by chunking in the batch dimension for data parallel training between 2 groups. The model is replicated between **2 data-parellel groups**. Within each group, the weight of each layers will be splited across **4 GPUs** for tensor parallel training.
```python
from.common.trainimporttrain
...
train.dist.data_parallel_size=2
train.dist.tensor_parallel_size=4
```
Here we provide a specific example for you to understand this. We number 8 GPUs from 0 to 7, e.g., ``[0, 1, 2, 3, 4, 5, 6, 7]``, and for ``data parallel + tensor parallel``, 8 GPUs will be split into 2 groups as ``[[0, 1, 2, 3], [4, 5, 6, 7]]``, ``GPU: [0, 1, 2, 3]`` as group-0 and ``GPU: [4, 5, 6, 7]`` as group-1. The model is replicated between group-0 and group-1. In group-0, the model will execute tensor parallel between ``GPU: [0, 1, 2, 3]``. In group-1, the model will execute tensor parallel between ``GPU: [4, 5, 6, 7]``, and each group only handle a portion of the input data for data parallel training.
#### **Data Parallel + Pipeline Parallel for 2D Parallel Training on 8 GPUs**
In this example, 8 GPUs will be split into **4 stages**. Each stage contains **2 GPUs** which will be split into **2 data-parallel groups**. Each stage only contains a portion of the model. The weight of the layers put on the specific stage is replicated on **2 data-parallel groups**. Each group handles a portion of the input data.
#### **Tensor Parallel + Pipeline Parallel for 2D Parallel Training on 8 GPUs**
In this example, 8 GPUs will be split into **4 stages**, each stage contains **2 GPUs** as a **group**. And different layers in the model will be put on different stages automatically for pipeline parallel training. The weight of the layers put on the specific stage will be split into 2 parts for tensor parallel training within the group.
#### **Data Parallel + Tensor Parallel + Pipeline Parallel for 3D Parallel Training on 8 GPUs**
In this example, 8 GPUs will also be split into **2 stages**, and different layers in the model will be put on different stages for pipeline parallel training. Each stage only contains a portion of the whole model, and each stage will be split into **2 groups**. In each stage, the model will be replicated between **2 data-parallel groups**, and each **data-parallel group** contains **2 GPUs**. The input data will be split into 2 parts by chunking in the batch dimension for data-parallel training between **2 data-parallel groups**. Within each group, the weight of each layer will be split across **2 GPUs** for tensor parallel training.
**Note:**`train.dist.data_parallel_size` will be automatically calculated by `(gpu_nums / (tensor_parallel_size * pipeline_parallel_size))` if only `train.dist.tensor_parallel_size` and `train.dist.pipeline_parallel_size` are set. For example:
```python
from.common.trainimporttrain
...
# only set tensor_parallel_size and pipeline_parallel_size
And the `data_parallel_size` will be automatically set to `(8 / (2 * 2)) = 2`
#### **Set `custom_pipeline_stage_id` for Load Balance**
In most cases, the transformer layers of common models have the same computational overhead, so there is no need to set `custom_pipeline_stage_id`.
But when transformer layers have unbalanced computational overhead, you can set `custom_pipeline_stage_id` for manually balance the compuation between stages in pipeline_parallelism
Evaluation is a process that takes a number of inputs/outputs pairs and calculates them to get metrics. You can always use the model directly and parse its inputs/outputs manually to perform evaluation. Alternatively, evaluation can be implemented in LiBai using the `DatasetEvaluator` interface.
LiBai includes a few `DatasetEvaluator` that computes metrics like top-N accuracy, PPL(Perplexity), etc. You can also implement your own `DatasetEvaluator` that performs some other jobs using the inputs/outputs pairs. For example, to count how many instances are detected on the validation set:
``` Python
class Counter(DatasetEvaluator):
def reset(self):
self.count = 0
def process(self, inputs, outputs):
for output in outputs:
self.count += len(output["instances"])
def evaluate(self):
# save self.count somewhere, or print it, or return it.
return {"count": self.count}
```
## Customize Evaluator using DatasetEvaluator
`DatasetEvaluator` is the Base class for a dataset evaluator. This class accumulates information of the inputs/outputs (by `process`) after every batch inference, and produces evaluation results in the end (by `evaluate`). The input is from the `trainer.get_batch()`, which converts the outputs of `dataset.__getitem__()` to dict. The output is from the dict return of `model.forward()`.
Firstly, declare a new evaluator class that inherits the `DatasetEvaluator` and overwrites its `process` and `evaluation` functions to satisfy the needs.
For example, declare a `MyEvaluator` class in `libai/evaluator/myevaluator.py`:
LiBai provides many features out of the box. This section shows how to configure them step by step.
## Automatic Mixed Precision Training
AMP stands for automatic mixed precision training. To enable AMP in LiBaiYou, add `train.amp.enabled=True` in your configuration file .
### Usage
```python
# import config
from.common.trainimporttrain
# get config
fromlibai.configimportget_config
train=get_config("common/train.py").train
# enable amp
train.amp.enabled=True
# disable amp
train.amp.enabled=False
```
## Gradient Clipping
Gradient clipping is a technique that tackles exploding gradients. The idea of gradient clipping is very simple: the gradient will be rescaled down if it gets too large.
LiBai supports gradient clipping in a convenient way, and you don't have to implement it by yourself. You just need to add some settings to your configuration file to enable gradient clipping.
**Note:** We do not recommend writing gradient clipping by yourself, because naive gradient clipping may fail when using tensor parallel or pipeline parallel.
### Usage
```python
# import config
from.common.optimimportoptim
# get config
fromlibai.configimportget_config
optim=get_config("common/optim.py").optim
# enable gradient clipping
optim.params.clip_grad_max_norm=1.0
optim.params.clip_grad_norm_type=2.0
# disable gradient clipping
optim.params.clip_grad_max_norm=None
optim.params.clip_grad_norm_type=None
```
`clip_grad_max_norm` and `clip_grad_norm_type` can be checked in [API docs of oneflow](https://oneflow.readthedocs.io/en/master/nn.html#oneflow.nn.utils.clip_grad_norm_).
## Gradient Accumulation
Gradient accumulation is a common strategy to train large-scale models when memory becomes the bottleneck. This technique splits the mini-batch into several micro-batches, then performs normal forward and backward operations. Models will only be updated after accumulating the gradients of all these micro-batches.
Besides, when training with pipeline parallel, gradient accumulation makes different stages executed in different micro-batch in parallel. Therefore, the calculation of each stage can be overlapped.
### Usage
```python
# import config
from.common.trainimporttrain
# get config
fromlibai.configimportget_config
train=get_config("common/train.py").train
# enable grad accumulation for 4 steps
train.num_accumulation_steps=4
# disable grad accumulation
train.num_accumulation_steps=None
```
## Activation Checkpointing
To reduce GPU memory usage and deploy a large model to a training system, LiBai supports activation checkpointing. LiBai uses a Transformer layer as the unit of checkpointing, because the activation size bloats in the middle of a Transformer layer, so checkpointing the input of a Transformer layer is storage-efficient.
LiBai supports [activation checkpointing](https://arxiv.org/abs/1604.06174) by `set_activation_checkpoint` in `GraphBase`. So models using `libai.layers.TransformerLayer` support activation checkpointing by default. If you want to set activation checkpointing for customized layers, you need to override this function.
Unlike normal data parallelism, where model states and gradients are replicated across data-parallel processes, Zero Redundancy Optimizer (ZeRO) partitions them across data-parallel processes, which can reduce memory footprint significantly.
- Level 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first and second moment estimates) are partitioned across the processes so that each process will only update its own partition.
- Level 2: The reduced 32-bit gradients for updating the model weights are also partitioned so that each process retains only the gradients corresponding to its portion of the optimizer states.
> **Note:** ZeRO only supports data parallel and pipeline parallel, or the combination of them. If you use tensor parallel in your training, make sure ZeRO is disabled.
Instead of directly using [`flow.save()`](https://oneflow.readthedocs.io/en/master/oneflow.html?highlight=save#oneflow.save) and [`flow.load()`](https://oneflow.readthedocs.io/en/master/oneflow.html?highlight=oneflow.load#oneflow.load), LiBai provides the [`checkpoint module`](https://libai.readthedocs.io/en/latest/modules/libai.utils.html#module-libai.utils.checkpoint) to deal with the complex situations when saving/loading model.
Typically, you don't need to write the relative code to load/save weights trained from LiBai when using LiBai's `DefaultTrainer` and `LazyConfig`. For more details, see [Training & Evaluation in Command Line](https://libai.readthedocs.io/en/latest/tutorials/basics/Train_and_Eval_Command_Line.html) which introduces `weight load` and `resume training` settings in `config.py` or in command line for standard training.
Here we introduce how to load&save weights according to your custom needs. Suppose you have a model trained with LiBai.
```shell
# your model directory
output/finetune_qqp
├── config.yaml
├── last_checkpoint
├── log.txt
├── log.txt.rank1
├── log.txt.rank2
├── log.txt.rank3
├── metrics.json
├── model_0000009
│ ├── graph
│ ├── lr_scheduler
│ └── model
├── model_0000019
│ ├── graph
│ ├── lr_scheduler
│ └── model
├── model_best
│ ├── graph
│ ├── lr_scheduler
│ └── model
└── model_final
├── graph
├── lr_scheduler
└── model
```
The following code shows how to load/save model weights:
```python
fromlibai.utils.checkpointimportCheckpointer
frompath.to.your.build_modelimportbuild_model
model=build_model(cfg.model)
# load model weights
Checkpointer(model).load(path_to_model)# path_to_model should be "output/finetune_qqp/model_final"
checkpointer.save("model_999")# save to output/model_999
```
You can also save other informations (e.g. `optim`, `scheduler`) other than model weights by using `checkpointer`. See [libai.utils.checkpoint](https://libai.readthedocs.io/en/latest/modules/libai.utils.html#module-libai.utils.checkpoint) for more details.
You can set the `--json-keys` argument to select the specific data of per sample, and the other keys will not be used.
Then, Process the JSON file into a binary format for training. To conver the json into mmap, cached index file, or the lazy loader format use `toos/preprocess_data.py`. Set the `--dataset-impl` flag to `mmap`, `cached`, or `lazy` respectively. You can run the following code to prepare you own dataset for training BERT:
Further command line arguments are described in the source file [`preprocess_data.py`](https://github.com/Oneflow-Inc/libai/blob/main/tools/preprocess_data.py).
LiBai provides multiple arguments covering a variety of situations.
## Training
LiBai provides `tools/train.sh` and `tools/train_net.py` for launching training & eval task.
You can modify `tools/train_net.py` according to your own needs.
### Training & Evaluation
To completely train and test, run:
```shell
bash tools/train.sh \
tools/train_net.py \
path_to_your_config.py \ # config.py for your task
4 # number of gpus
```
### Training & Partial Evaluation
If the evaluation process is time consuming, you can set the parameter `train.evaluation.eval_iter` in your `config.py` to a smaller number such as 20, which can make the evaluation process faster by using only part of the testset. You can also set the parameter by the command line directly :
**Note:** the eval metric will be calculated by part of testing dataset.
```shell
bash tools/train.sh \
tools/train_net.py \
path_to_your_config.py \ # config.py for your task
4 \ # number of gpus
train.evaluation.eval_iter=20 # set eval_iter for testing
```
### Training & No Evaluation
To train without evaluation, set `train.evaluation.enabled=False` in your `config.py` or in the command line:
```shell
bash tools/train.sh \
tools/train_net.py \
path_to_your_config.py \ # config.py for your task
4 \ # number of gpus
train.evaluation.enabled=False # set no evaluation
```
### Resume Training
To resume training, set `--resume` in the command line, and set `train.output_dir` in your `config.py` or in the command line
For example, if your training is interrupted unexpectly, and your lastest model path is `output/demo/model_0000019/`, then set `train.output_dir=output/demo` to resume trainig:
```shell
bash tools/train.sh \
tools/train_net.py \
path_to_your_config.py \ # config.py for your task
4 \ # number of gpus
--resume\ # set resume training
train.output_dir=path/task # set resume path, it should be parent directory of model path
```
## Evaluation
To evaluate your model without training, set `--eval-only` in your command line, and set `train.load_weight`.
Besides, `train.evaluation.eval_iter=20` will be valid in `--eval-only` if you set it. You can set `eval_iter` according to your own needs.
```shell
bash tools/train.sh \
tools/train_net.py \
path_to_your_config.py \ # config.py for your task
4 \ # number of gpus
--eval-only\ # set eval without train
train.load_weight=path/task/model_final # set model path
```
## Quickly check in the respective loop
To find out whether there are any bugs in your program, pass `--fast-dev-run` to the command line, which will change config settings to:
```python
train.train_epoch=0
train.train_iter=20
train.evaluation.eval_period=10
train.log_period=1
```
Besides, `train.evaluation.eval_iter=20` will be valid in `--fast-dev-run` if you set it. You can set `eval_iter` according to your own needs.
```shell
bash tools/train.sh \
tools/train_net.py \
path_to_your_config.py \ # config.py for your task
To run training, we highly recommend using the standardized `trainer` in LiBai.
## Trainer Abstraction
LiBai provides a standardized `trainer` abstraction with a hook system to help simplify the standard training behavior.
`DefaultTrainer` is initialized from the lazy config system, used by `tools/train_net.py` and many scripts. It includes many standard default behaviors that you might want to opt in, including default configurations for the optimizer, learning rate scheduler, logging, evaluation, model checkpointing, etc.
For simple customizations (e.g. change optimizer, evaluator, LR scheduler, data loader, etc.), you can just modify the corresponding configuration in `config.py` according to your own needs (refer to [Config_System](https://libai.readthedocs.io/en/latest/tutorials/Config_System.html#configs-in-libai)).
## Customize a DefaultTrainer
For complicated customizations, we recommend you to overwrite function in [DefaultTrainer](https://github.com/Oneflow-Inc/libai/blob/main/libai/engine/default.py).
In `DefaultTrainer`, the training process consists of `run_step in trainer` and `hooks` which can be modified according to your own needs.
The following code indicates how `run_step` and `hooks` work during training:
```python
classDefaultTrainer(TrainerBase):
deftrain(self,start_iter:int,max_iter:int):
...
withEventStorage(self.start_iter)asself.storage:
try:
self.before_train()# in hooks
forself.iterinrange(start_iter,max_iter):
self.before_step()# in hooks
self.run_step()# in self._trainer
self.after_step()# in hooks
self.iter+=1
exceptException:
logger.exception("Exception during training:")
raise
finally:
self.after_train()# in hooks
```
Refer to `tools/train_net.py` to rewrite `tools/my_train_net.py` with your modified `_trainer` and `hooks`. The next subsection will introduce how to modify them.
Using ``trainer & hook`` system means there will always be some non-standard behaviors which is hard to support in LiBai, especially for research. Therefore, we intentionally keep the ``trainer & hook`` system minimal, rather than powerful.
### Customize Hooks in Trainer
You can customize your own hooks for some extra tasks during training.
[HookBase](https://github.com/Oneflow-Inc/libai/blob/ffe5ca0e46544d1cbb4fbe88d9185f96c0dc2c95/libai/engine/trainer.py#L28) in `libai/engine/trainer.py` provides a standard behavior for you to use hook. You can overwirte its function according to your own needs. Please refer to [libai/engine/hooks.py](https://github.com/Oneflow-Inc/libai/blob/main/libai/engine/hooks.py) for more details.
```python
classHookBase:
defbefore_train(self):
"""
Called before the first iteration.
"""
defafter_train(self):
"""
Called after the last iteration.
"""
defbefore_step(self):
"""
Called before each iteration.
"""
defafter_step(self):
"""
Called after each iteration.
"""
```
Depending on the functionality of the hook, you can specify what the hook will do at each stage of the training in ``before_train``, ``after_train``, ``before_step``, ``after_step``. For example, to print `iter` in trainer during training:
```python
classInfoHook(HookBase):
defbefore_train(self):
logger.info(f"start training at {self.trainer.iter}")
defafter_train(self):
logger.info(f"end training ad {self.trainer.iter}")
defafter_step(self):
ifself.trainer.iter%100==0:
logger.info(f"iteration {self.trainer.iter}!")
```
Then you can import your `hook` in `tools/my_train_net.py`
### Modify train_step in Trainer
LiBai provides `EagerTrainer` and `GraphTrainer` in `libai/engine/trainer.py` by default. `EagerTrainer` is used in `eager` mode, while `GraphTrainer` is used in `graph` mode, and the mode is determined by the `graph.enabled` parameter in your `config.py`.
> For more details about `eager` and `graph` mode, please refer to [oneflow doc](https://docs.oneflow.org/en/master/basics/08_nn_graph.html).
For example, using a temp variable to keep the model's output in run_step:
Then you can set your `MyEagerTrainer` as `self.trainer` in `tools/my_train_net.py`
## Logging of Metrics
During training, the trainer put metrics to a centralized [EventStorage](https://libai.readthedocs.io/en/latest/modules/libai.utils.html#module-libai.utils.events). The following code can be used to access it and log metrics to it:
```python
fromlibai.utils.eventsimportget_event_storage
# inside the model:
ifself.training:
value=# compute the value from inputs
storage=get_event_storage()
storage.put_scalar("some_accuracy",value)
```
See [EventStorage](https://libai.readthedocs.io/en/latest/modules/libai.utils.html#module-libai.utils.events) for more details.
Metrics are then written to various destinations with EventWriter. Metrics information will be written to `{cfg.train.output_dir}/metrics.json`. DefaultTrainer enables a few EventWriter with default configurations. See above for how to customize them.
This tutorial explains how the dataset APIs work, and how to customize your own datasets with them.
## Build Common Dataloaders
To build dataloaders in LiBai, we highly recommend users to use the default `build_nlp_train_val_test_loader`, `build_nlp_train_loader`, `build_nlp_test_loader`, `build_image_train_loader` and `build_image_test_loader` which are defined in [`libai/data/build.py`](https://github.com/Oneflow-Inc/libai/blob/main/libai/data/build.py) for most of the common cases.
The only thing you need to do is to write pytorch style `Dataset`, and return `Instance` structure in `__getitem__`. The `Instance` structure stores the attributes of an instance (e.g., image, tokens) as "fields", and the `DistTensorData` structure provides a standard `to_global()`(called in `get_batch()`) function to convert local tensors to global tensors.
The returned instance by `__getitem__` function must contain the same keys with the `args` passed in `forward` function of the `model`. The following shows an example:
**NOTE:** Set `placement_idx=-1` in `DistTensorData` when the `tensor` is **only** used in `loss_function`, it is used for pipeline parallel training.
In particular, the values of `attention_mask` can only be `0` or `1` if you need to generate your own `attention_mask`. Because LiBai has already processed `attention_mask` in [`libai/layers/attention.py`](https://github.com/Oneflow-Inc/libai/blob/main/libai/layers/attention.py) as follows:
After finishing your `MyDataset`, set `dataloader` in your `config.py` depending on your needs. If you have only one training dataset for nlp task and want to split it into `train`, `valid` and `test` datasets automatically, you can choose `build_nlp_train_val_test_loader`, the evaluation will be calculated in `valid` and `test` dataset.
Otherwise, you can choose `build_nlp_train_loader` && `build_nlp_test_loader` or `build_image_train_loader` && `build_image_test_loader` in `config.py` according to your own needs.
see [`libai/data/build.py`](https://github.com/Oneflow-Inc/libai/blob/main/libai/data/build.py) for more details.
This section introduces how to implement a new model entirely from scratch and make it compatible with LiBai.
## Construct Models in LiBai
LiBai uses [LazyConfig](https://libai.readthedocs.io/en/latest/tutorials/Config_System.html) for a more flexible config system, which means you can simply import your own model in your config and train it under LiBai.
For image classification task, the input data is usually a batch of images and labels. The following code shows how to build a toy model for this task. Import in your code:
- For classification models, the ``forward`` function must have ``images`` and ``labels`` as arguments, which correspond to the output in ``__getitem__`` of LiBai's built-in datasets. Please refer to [imagenet.py](https://github.com/Oneflow-Inc/libai/blob/main/libai/data/datasets/imagenet.py) for more details about the dataset.
-**This toy model** will return ``losses`` during training and ``prediction_scores`` during inference, and both of them should be the type of ``dict``, which means you should implement the ``loss function`` in your model, like ``self.loss_func=nn.CrossEntropyLoss()`` as the ToyModel shows above.
## Import the model in config
With ``LazyConfig System``, you can simply import the model in your config file. The following code shows how to use ``ToyModel`` in your config file:
Here we provides our benchmark speed test results of LiBai's models compared with [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) implementations. In LiBai V0.2.0, we only benchmark the speed tests under 32 GPUs in 4 nodes and all of the experiments were conducted under the same settings for a fair comparison.
## Settings
### Environments
- The commit of LiBai for comparison: [commit](https://github.com/Oneflow-Inc/libai/commit/9fc504c457da4fd1e92d854c60b7271c89a55222)
- The commit of OneFlow for comparison: [commit](https://github.com/Oneflow-Inc/oneflow/commit/55b822e4d3c88757d11077d7546981309125c73f)
- The commit of Megatron-LM for comparison: [commit](https://github.com/NVIDIA/Megatron-LM/commit/e156d2fea7fc5c98e645f7742eb86b643956d840)
### Model Hyper-parameters
-**BERT Model**
```python
num_layers=24/48
num_attention_heads=16
hidden_size=1024
seq_length=512
```
-**GPT-2 Model**
```python
num_layers=24/48
num_attention_heads=16
hidden_size=1024
seq_length=1024
```
## Main Results
Here we explain the evaluation indicators in the following tables:
-**fp16**: mixed precision training
-**nl**: num layers (When pipeline parallel size = 8, in order to have a relative number of layers per stage for computation, we adjust the num layers from 24 to 48)
-**ac**: enable activation checkpointing
-**mb**: micro-batch size per gpu
-**gb**: global batch size total
-**d x m x p**:
- d: data-parallel-size
- m: tensor-model-parallel-size
- p: pipeline-model-parallel-size
-**1n1g**: 1 node, 1 gpu
-**2n8g**: 2 nodes, 8 gpus per node, 16 gpus in total
-**4n8g**: 4 nodes, 8 gpus per node, 32 gpus in total
- Create a conda virtual environment and activate it:
```bash
conda create -n libai python=3.8 -y
conda activate libai
```
- Install the stable release of OneFlow with `CUDA` support. See [OneFlow installation guide](https://github.com/Oneflow-Inc/oneflow#install-with-pip-package). To use **latest** LiBai(branch `main`), we highly recommend you install **Nightly** Oneflow
A collection of parallel training strategies is supported in LiBai:
-**Data Parallel Training**
-**Tensor Parallel Training**
-**Pipeline Parallel Training**
You can refer to OneFlow official [tutorial](https://docs.oneflow.org/en/master/parallelism/01_introduction.html) to better understand the basic conception of parallelization techniques.
## Supported Models in LiBai
For more details about the supported parallelism training on different models, please refer to the following table:
✔ means you can train this model under specific parallelism techniques or combine two or three of them with ✔ for 2D or 3D paralleism training.
## Baselines
Here is the collection of baselines trained with LiBai. Due to our resource constraints, we will gradually release the training results in the future.
### Main Results on ImageNet with Pretrained Models
- Download the dataset and move the data file to the folder. The file structure should be like:
```bash
$ tree data
path/to/bert_data
├── bert-base-chinese-vocab.txt
├── loss_compara_content_sentence.bin
└── loss_compara_content_sentence.idx
```
### How to Train Bert_large Model with Parallelism
We provide `train.sh` for execute training. Before invoking the script, perform the following steps.
**Step 1. Set data path and vocab path**
- Update the data path and vocab path in [bert_large_pretrain](https://github.com/Oneflow-Inc/libai/blob/main/configs/bert_large_pretrain.py) config file:
- In the [`configs/bert_large_pretrain.py`](https://github.com/Oneflow-Inc/libai/blob/main/configs/bert_large_pretrain.py) provided, a set of parameters are defined including training scheme, model, etc.
- You can also modify the parameters setting. For example, if you want to use 8 GPUs for training, you can refer to the file [`configs/common/train.py`](https://github.com/Oneflow-Inc/libai/blob/main/configs/common/train.py). If you want to train model with 2D mesh hybrid parallelism (4 groups for data parallel and 2 groups for tensor parallel), you can set the the parameters as follows:
```python
train.dist.data_parallel_size=4
train.dist.tensor_parallel_size=2
```
**Step 3. Invoke parallel training**
- To train `BertForPreTraining` model on a single node with 8 GPUs, run:
- The default vit model in LiBai is set to `vit_tiny_patch16_224`. To train other vit models, update the [vit_imagenet](https://github.com/Oneflow-Inc/libai/blob/main/configs/vit_imagenet.py) config file by importing other vit models in the config file as follows:
```python
# from .common.models.vit.vit_tiny_patch16_224 import model