Commit 5988d2cc authored by yuguo960516's avatar yuguo960516
Browse files

bert-large

parent 478602ba
Pipeline #142 canceled with stages
libai.tokenizer
##############################
.. currentmodule:: libai.tokenizer
.. automodule:: libai.tokenizer
:member-order: bysource
:members:
BertTokenizer,
RobertaTokenizer,
GPT2Tokenizer,
T5Tokenizer,
PreTrainedTokenizer,
libai.utils
##############################
libai.utils.distributed module
---------------------------------
.. currentmodule:: libai.utils
.. automodule:: libai.utils.distributed
:members:
get_dist_util,
setup_dist_util,
ttol,
tton,
synchronize,
convert_to_distributed_default_setting,
get_nd_sbp,
get_layer_placement,
get_world_size,
get_num_nodes,
get_rank,
get_local_rank,
same_sbp,
get_data_parallel_size,
get_data_parallel_rank,
get_hidden_sbp
libai.utils.events module
---------------------------------
.. currentmodule:: libai.utils
.. automodule:: libai.utils.events
:members:
get_event_storage,
JSONWriter,
CommonMetricPrinter,
EventStorage,
libai.utils.logger module
---------------------------------
.. currentmodule:: libai.utils
.. automodule:: libai.utils.logger
:members:
setup_logger,
log_first_n,
log_every_n,
log_every_n_seconds
libai.utils.checkpoint module
---------------------------------
.. currentmodule:: libai.utils
.. automodule:: libai.utils.checkpoint
:members:
Checkpointer,
PeriodicCheckpointer
# Frequently Asked Questions
We list some common problems encountered by users and the corresponding solutions here. Feel free to enrich the list if you find any frequent issues and have ways to help others solve them.
## Training
- "Loss goes NaN or very large"
1. Check if the dataset annotations are valid. Mask must be `{0, 1}` where `1` for tokens that are **not masked** and `0` for tokens that are **masked**.
2. Check `initializer_range` in config file. It can be safely set to `0.02` in most cases. If the model size is very large, decreasing `initializer_range` is a good choice. For example, `initializer_range` can be set to `0.006` when training 175 billion parameter configuration GPT-3 model.
- "AMP enabled goes NaN"
Set `ONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS=1` to check what triggers an overflow of the value range in fp16.
- "GPU out of memory when validation"
Decrease `test_micro_batch_size` and use `--fast-dev-run` for quickly running through training and evaluation to check if memory is sufficient.
## Model
- "`apply_query_key_layer_scaling` in MultiheadAttention"
As the number of attention heads increases, some of the GEMMS inside the self-attention layer become smaller and the number of elements in the self attention softmax also increases.
- "QKV implementation is not consistent with Hugging Face in self attention"
```python
# query_key_value:[batch_size, seq_len, 3*hidden_size]
# QKV in LiBai
query_key_value = query_key_value.view(bsz, -1, self.num_heads, 3 * self.head_size)
query_key_value = query_key_value.permute(0, 2, 1, 3)
query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
# QKV in Huggingface
query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
query = query.view(query.size(0), query.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
key = key.view(key.size(0), key.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
value = value.view(value.size(0), value.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
```
In tensor parallelism, `chunk` dimension and `flow.sbp.split` dimension will be the same in Huggingface's implementation which will occur some unexpected behaviors (i.e., changing the tensor's SBP unexpectedly).
We also provide a tutorial about how to load Huggingface weights correctly. Please refer to [How to use Huggingface's pretrained weights in LiBai](https://libai.readthedocs.io/en/latest/notes/How_to_implement_huggingface%27s_weights_in_LiBai.html) for more details.
- "the order of layer normalization and the residual connection"
This is critical to enable the scaling of the BERT-style models beyond BERT-Large. The architecture with `apply_residual_post_layernorm=False` eliminates instabilities observed using the origin BERT architecture with `apply_residual_post_layernorm=True` and also has a lower training loss according to [Megatron-LM](https://arxiv.org/pdf/1909.08053.pdf).
If you find some troubles hard to understand, feel free to open an issue to collect feedbacks in [OneFlow](https://github.com/Oneflow-Inc/oneflow).
\ No newline at end of file
# Detailed instruction on building Vision Transformer models in LiBai
It's easy for users to build the `transformer-based` models by using LiBai's built-in [layers](https://libai.readthedocs.io/en/latest/modules/libai.layers.html). Let's take a deep dive into the process of building a Vision Transformer model in LiBai.
## Model Architecture
**Vision Transformer** was released in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
A **Vision Transformer** model contains three parts: `Patch Embedding` + `Transformer Block` + `Linear Classification Head`, which can be summarized in the following picture:
![](./assets/vision_transformer.png)
## A simple Torch implementation of Vision Transformer
The following code shows the PyTorch implementation of ViT models modified from [timm.models.vision_transformer](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py):
```python
import torch
import torch.nn as nn
from timm.models.layers import trunc_normal_, PatchEmbed, Mlp, DropPath
"""
1. Build a self-attention module
"""
class Attention(nn.Module):
def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0.0, proj_drop=0.0):
super().__init__()
self.num_heads = num_heads
head_dim = dim // num_heads
self.scale = head_dim ** -0.5
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x):
B, N, C = x.shape
qkv = (
self.qkv(x)
.reshape(B, N, 3, self.num_heads, C // self.num_heads)
.permute(2, 0, 3, 1, 4)
)
q, k, v = qkv[0], qkv[1], qkv[2]
attn = (q @ k.transpose(-2, -1)) * self.scale
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
x = self.proj(x)
x = self.proj_drop(x)
return x
"""
2. Build a transformer block, which contains:
self-attention layer + mlp layer
"""
class Block(nn.Module):
def __init__(
self,
dim,
num_heads,
mlp_ratio=4.0,
qkv_bias=False,
drop=0.0,
attn_drop=0.0,
drop_path=0.0,
act_layer=nn.GELU,
norm_layer=nn.LayerNorm,
):
super().__init__()
self.norm1 = norm_layer(dim)
self.attn = Attention(
dim,
num_heads=num_heads,
qkv_bias=qkv_bias,
attn_drop=attn_drop,
proj_drop=drop,
)
# Use drop_path here
self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
self.norm2 = norm_layer(dim)
mlp_hidden_dim = int(dim * mlp_ratio)
self.mlp = Mlp(
in_features=dim,
hidden_features=mlp_hidden_dim,
act_layer=act_layer,
drop=drop,
)
def forward(self, x):
x = x + self.drop_path(self.attn(self.norm1(x)))
x = x + self.drop_path(self.mlp(self.norm2(x)))
return x
"""
3. Build a Vision Transformer model which contains three parts:
patch embedding + transformer block + mlp classification head
"""
class VisionTransformer(nn.Module):
def __init__(
self,
img_size=224,
patch_size=16,
in_chans=3,
embed_dim=192,
depth=12,
num_heads=3,
mlp_ratio=4.0,
drop_rate=0.0,
attn_drop_rate=0.0,
drop_path_rate=0.0,
num_classes=1000,
):
super().__init__()
self.num_classes = num_classes
# patch embedding
self.patch_embed = PatchEmbed(
img_size=img_size,
patch_size=patch_size,
in_chans=in_chans,
embed_dim=embed_dim
)
num_patches = self.patch_embed.num_patches
# cls token and position embedding
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embed = self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
self.pos_drop = nn.Dropout(p=drop_rate)
# stochastic depth decay rule
dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]
# transformer block
self.blocks = nn.Sequential(*[
Block(
dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=True, drop=drop_rate,
attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=nn.LayerNorm, act_layer=nn.GELU)
for i in range(depth)])
# classification head
self.norm = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, num_classes)
# weight init
trunc_normal_(self.pos_embed, std=0.02)
trunc_normal_(self.cls_token, std=0.02)
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.Linear):
trunc_normal_(m.weight, std=0.02)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
def forward_features(self, x):
# patch embedding
x = self.patch_embed(x)
cls_token = self.cls_token.expand(x.shape[0], -1, -1)
x = torch.cat((cls_token, x), dim=1)
# position embedding
pos_embed = self.pos_embed.expand(x.shape[0], -1, -1)
x = self.pos_drop(x + pos_embed)
# transformer block
x = self.blocks(x)
return x
def forward_head(self, x):
# only use cls token for classification
x = self.norm(x)
outcome = x[:, 0]
return self.head(outcome)
def forward(self, x):
x = self.forward_features(x)
x = self.forward_head(x)
return x
```
We have further decoupled the forward function into `forward_features` and `forward_head`:
- `forward_features`: extract the image features using the `patch_embed` layer and a stack of `transformer` blocks
- `forward_head`: take the `cls_token` of each sample and use `nn.Linear` for classification
## Implement 3D parallel Vision Transformer in LiBai
In this section, we will show users how to use [libai.layers](https://libai.readthedocs.io/en/latest/modules/libai.layers.html) to build a 3D parallel Vision Transformer model with only 100+ lines of code, which is modified from [libai.models.vision_transformer](https://github.com/Oneflow-Inc/libai/blob/main/libai/models/vision_transformer.py)
Here is the LiBai implementation of Vision Transformer models, and users only need to replace the PyTorch modules with the corresponding `libai.layers` as follows:
```python
# LiBai's implementation of Vision Transformer
import oneflow as flow
import oneflow.nn as nn
from flowvision.layers.weight_init import trunc_normal_
import libai.utils.distributed as dist
from libai.config.config import configurable
from libai.layers import LayerNorm, Linear, PatchEmbedding, TransformerLayer
"""
LiBai has already implemented:
1. PatchEmbedding Layer
2. Transformer Layer: Self-Attention + MLP + DropPath + LayerNorm
3. Linear Layer
We can directly build a Vision Transformer model with the built-in layers in LiBai as follows:
"""
class VisionTransformer(nn.Module):
def __init__(
self,
img_size=224,
patch_size=16,
in_chans=3,
embed_dim=192,
depth=12,
num_heads=3,
mlp_ratio=4.0,
drop_rate=0.0,
attn_drop_rate=0.0,
drop_path_rate=0.0,
num_classes=1000,
loss_func=None
):
super().__init__()
self.num_classes = num_classes
# patch embedding
self.patch_embed = PatchEmbedding(
img_size=img_size,
patch_size=patch_size,
in_chans=in_chans,
embed_dim=embed_dim,
)
num_patches = self.patch_embed.num_patches
# cls token and position embedding with sbp signature
self.cls_token = nn.Parameter(
flow.zeros(1, 1, embed_dim,
sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=dist.get_layer_placement(0),)
)
self.pos_embed = nn.Parameter(
flow.zeros(1, num_patches+1, embed_dim,
sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=dist.get_layer_placement(0),)
)
self.pos_drop = nn.Dropout(p=drop_rate)
# stochastic depth decay rule
dpr = [x.item() for x in flow.linspace(0, drop_path_rate, depth)]
# a stack of transformer block
ffn_size = int(embed_dim * mlp_ratio)
self.blocks = nn.Sequential(
*[
TransformerLayer(
hidden_size=embed_dim,
ffn_hidden_size=ffn_size,
num_attention_heads=num_heads,
attention_dropout_prob=attn_drop_rate,
output_dropout_prob=drop_rate,
drop_path_prob=dpr[i],
layer_idx=i,
)
for i in range(depth)
]
)
self.norm = LayerNorm(embed_dim, layer_idx=-1)
self.head = Linear(embed_dim, num_classes, layer_idx=-1)
# implement loss function in nn.Module to match LiBai style
self.loss_func = nn.CrossEntropyLoss() if loss_func is None else loss_func
# weight init function
trunc_normal_(self.pos_embed, std=0.02)
trunc_normal_(self.cls_token, std=0.02)
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, Linear):
trunc_normal_(m.weight, std=0.02)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
def forward_features(self, x):
# patch embedding
x = self.patch_embed(x)
cls_token = self.cls_token.expand(
x.shape[0], -1, -1
) # stole cls_tokens impl from Phil Wang, thanks
cls_token = cls_token.to_global(sbp=x.sbp, placement=cls_token.placement)
x = flow.cat((cls_token, x), dim=1)
# position embedding
pos_embed = self.pos_embed.expand(x.shape[0], -1, -1)
pos_embed = pos_embed.to_global(sbp=x.sbp, placement=pos_embed.placement)
x = self.pos_drop(x + pos_embed)
# transformer block
x = self.blocks(x)
return x
def forward_head(self, x):
x = self.norm(x)
outcome = x[:, 0]
outcome = self.head(outcome)
return outcome
def forward(self, images, labels=None):
x = self.forward_features(images)
x = self.forward_head(x)
if labels is not None and self.training:
losses = self.loss_func(x, labels)
return {"losses": losses}
else:
return {"prediction_scores": x}
@staticmethod
def set_pipeline_stage_id(model):
dist_utils = dist.get_dist_util()
# Set pipeline parallelism stage_id
for module_block in model.modules():
if isinstance(module_block.to(nn.Module), PatchEmbedding):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
elif isinstance(module_block.to(nn.Module), TransformerLayer):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(module_block.layer_idx))
# Set pos_embed and cls_token stage id
model.pos_embed.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
model.cls_token.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
model.pos_drop.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
model.norm.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
model.head.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
model.loss_func.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
```
## Details about LiBai's implementation of the Vision Transformer model
**1. Replace nn.Module with libai.layers**
LiBai has already implemented `PatchEmbedding`, `TransformerLayer`, `Linear`, `LayerNorm` layers, and users only need to replace the module in Torch Vision Transformer models to convert a Torch model into LiBai's style:
- `Block` -> `libai.layers.TransformerLayer`
- `nn.Linear` -> `libai.layers.Linear`
- `nn.LayerNorm` -> `libai.layers.LayerNorm`
- `PatchEmbed` -> `libai.layers.PatchEmbedding`
**2. Manually set the SBP signature of `cls_token` and `pos_embed`**
In order to fit different parallel modes in LiBai, users must manually set the [SBP signature](https://docs.oneflow.org/en/master/parallelism/02_sbp.html#spb-signature) for all the parameters and buffers of those layers not implemented in LiBai, like `cls_token` and `pos_embed` in Vision Transformer:
```python
import oneflow as flow
import oneflow.nn as nn
import libai.utils.distributed as dist
self.cls_token = nn.Parameter(
flow.zeros(
1, 1, embed_dim,
sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=dist.get_layer_placement(0),
)
)
self.pos_embed = nn.Parameter(
flow.zeros(
1, num_patches+1, embed_dim,
sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=dist.get_layer_placement(0),)
)
```
- The SBP signature returned by `dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])` means to broadcast `cls_token` and `pos_embed` across each GPU group.
**3. Use the `to_global()` function to update the SBP signature of `cls_token` and `pos_embed` during forward function**
In forward function, `cls_token` and `pos_embed` will be expanded to fit the input size. For efficiency, we can use the `to_global()` function to match the `cls_token` and `pos_embed` SBP signature with the input SBP signature like this:
```python
def forward_features(self, x):
cls_token = self.cls_token.expand(
x.shape[0], -1, -1
)
# use to_global to update the sbp signature of cls_token
cls_token = cls_token.to_global(sbp=x.sbp, placement=cls_token.placement)
x = flow.cat((cls_token, x), dim=1)
# use to_global to update the sbp signature of pos_embed
pos_embed = self.pos_embed.expand(x.shape[0], -1, -1)
pos_embed = pos_embed.to_global(sbp=x.sbp, placement=pos_embed.placement)
```
**4. Manually set the stage id for pipeline parallel training**
Most of the built-in layers in LiBai has the arg named `layer_idx` for pipeline parallel settings. To configure a **1F1B pipeline parallel** model, users should manually set the stage id for each layers in the model, which will automatically assign different layers on different stages and insert buffer in the process of forward & backward computation for 1F1B pipeline parallel training. With the help of `layer_idx`, we can simply get a pipeline parallel Vision Transformer model like:
```python
import libai.utils.distributed as dist
"""
This is a staticmethod for class inherited from nn.Module,
"""
@staticmethod
def set_pipeline_stage_id(model):
dist_utils = dist.get_dist_util()
# Set pipeline parallelism stage_id
for module_block in model.modules():
# module_block.to(nn.Module) can get the original module
if isinstance(module_block.to(nn.Module), PatchEmbedding):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
elif isinstance(module_block.to(nn.Module), TransformerLayer):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(module_block.layer_idx))
# Set pos_embed and cls_token stage id
model.pos_embed.to(nn.graph.GraphModule).set_stage(sdist_utils.get_layer_stage_id(0))
model.cls_token.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
model.pos_drop.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
model.norm.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
model.head.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
model.loss_func.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
```
Manually set the stage id:
- `PatchEmbedding` should be on the first stage
- Automatically assign the stage for `TransformerLayer` with `layer_idx` args
- `cls_token`, `pos_embed`, `pos_drop` should be on the first stage
- `norm`, `head` and `loss_func` should be on the last stage
Please see [Write your own pipeline parallel model](https://libai.readthedocs.io/en/latest/tutorials/advanced_tutorials/customize_parallel.html#write-your-own-pipeline-parallel-model) for more details about the settings of pipeline parallel training in LiBai.
# How to load pretrained model in LiBai
In this tutorial, we will introduce to users how to instantiate a pretrained oneflow model.
## Steps
Firstly, Prepare pretrained model weights file, which can be the form of `OneFlow` or `HuggingFace`.
- `OneFlow`'s pretrained model weights saved using [`oneflow.save()`].
- `Huggingface`'s pretrained model weights file(`pytorch_model.bin`) can be downloaded from https://huggingface.co/models.
Secondly, Prepare config file.
> The config file is required when loading the `HuggingFace` model.
> `OneFlow`'s config file can be import directly from `configs/common/models`.
- `Huggingface`'s config file(`config.json`) can be downloaded from https://huggingface.co/models.
Lastly, The structure of the pretrained model folder should be like:
```bash
# OneFlow pretrained model
$ tree pretrained_model_dir
path/to/pretrained_model_dir/
└── oneflow_model
# Huggingface pretrained model
$ tree pretrained_model_dir
path/to/pretrained_model_dir/
├── pytorch_model.bin
└── config.json
```
## Start Loading
You can load pretrained BERT as following:
```python
import libai
from libai.models.utils import BertLoaderHuggerFace, BertLoaderLiBai
from libai.config.configs.common.models.bert import cfg
# load huggingface weight
loader = BertLoaderHuggerFace(
model=libai.models.BertModel,
libai_cfg=cfg,
pretrained_model_path='path/to/huggingface_pretrained_model_directory',
hidden_dropout_prob=0,
)
bert = loader.load()
# load libai weight
loader = BertLoaderLiBai(
model=libai.models.BertModel,
libai_cfg=cfg,
pretrained_model_path='path/to/libai_pretrained_model_directory',
hidden_dropout_prob=0,
)
bert = loader.load()
```
# Use Custom ModelLoader
## Model Loader for HuggerFace
If you want to define your own HuggerFace's model loader, you can inherit the base `ModelLoaderHuggerFace` in `libai.models.utils.model_utils.base_loader`.
Then you need to overwrite the `_convert_state_dict` and `_load_config_from_json` method to load HuggingFace's pretrained model in LiBai.
Finally, you need set `base_model_prefix_1` and `base_model_prefix_2` argument, which represent the base model name for HuggingFace and LiBai respectively.
The following code shows how to use custom ModelLoaderHuggerFace:
```python
from libai.models.utils import ModelLoaderHuggerFace
class ToyModelLoaderHuggerFace(ModelLoaderHuggerFace):
def __init__(self, model, libai_cfg, pretrained_model_path, **kwargs):
super().__init__(model, libai_cfg, pretrained_model_path, **kwargs)
"""NOTE: base_model_prefix_1 is ToyModel's prefix in Transformers.
base_model_prefix_2 is ToyModel's prefix in LiBai."""
self.base_model_prefix_1 = "toy_model"
self.base_model_prefix_2 = "toy_model"
def _convert_state_dict(self, flow_state_dict, cfg):
"""Convert state_dict's keys to match model.
Args:
flow_state_dict (OrderedDict): model state dict.
cfg (dict): model's default config dict in LiBai.
Returns:
OrderedDict: flow state dict.
"""
...
def _load_config_from_json(self, config_file):
"""load config from `config.json`, and update default config.
Args:
config_file (str): Path of config file.
"""
...
```
## Model Loader for LiBai
If you want to define your own LiBai's model loader, you can inherit the base `ModelLoaderLiBai` class in `libai.models.utils.model_utils.base_loader`.
You just need to set `base_model_prefix_2` argument to load LiBai's pretrained model.
The following code shows how to use custom ModelLoaderLiBai:
```python
from libai.models.utils import ModelLoaderLiBai
class ToyModelLoaderLiBai(ModelLoaderLiBai):
def __init__(self, model, libai_cfg, pretrained_model_path, **kwargs):
super().__init__(model, libai_cfg, pretrained_model_path, **kwargs)
self.base_model_prefix_2 = "toy_model"
```
\ No newline at end of file
# Detailed instruction for using distributed inference in LiBai
If you want to using distributed inference in LiBai from pretrained `pytorch` model, you can refer to [DALLE2 inferecn doc](https://github.com/Oneflow-Inc/libai/blob/main/docs/source/notes/How_to_use_model_parallel_in_LiBai.md). And [Chinese doc for distributed inference](https://github.com/Oneflow-Inc/libai/discussions/386) is also available.
Here we introduce how to use distributed infenrence in LiBai:
## Check `model.py`
check your `model.py` first:
1. Ensure There are `libai.layers` in your `model.py`:
```python
# NOTE: you don't need to import all layers from libai, if you only use libai.layers.Linear
# in your `model.py`, you model will run model/pipeline parallel only in `Linear` layers
from libai.layers import (
Linear,
LayerNorm,
...
)
```
2. If you want to run pipeline parallel in LiBai, you should additionally insert code `x = x.to_global(placement=target_tensor.placement)` in your `model.forward()`.
It is equal to torch code `x.to(cuda_device)`, which move tensor from gpuA to gpuB. There are many examples in LiBai: [example1](https://github.com/Oneflow-Inc/libai/blob/92dbe7c1b1496290e32e595f8473f9288ea1886e/projects/MT5/layers/attention_layer.py#L220), [example2](https://github.com/Oneflow-Inc/libai/blob/92dbe7c1b1496290e32e595f8473f9288ea1886e/projects/MT5/layers/attention_layer.py#L156) ...
If you don't know where to insert code, you can run your code first, and the it will raise bug in the line which needed `to_global`.
for example:
```shell
File "libai/libai/layers/layer_norm.py", line 129, in forward
return flow._C.rms_layer_norm(hidden_states, self.weight, self.l2norm_epsilon) RuntimeError: return flow._C.rms_layer_norm(hidden_states, self.weight, self.l2norm_epsilon)RuntimeErrorExpected all tensors to be on the same placement, but found at least two placements, oneflow.placement(type="cuda", ranks=[0, 1]) (positional 0) and oneflow.placement(type="cuda", ranks=[2, 3]) (positional 1)!
```
## Build `config.py`
If your model is Trained from LiBai, you can use the same `config.py` from training. refer to [Couplets](https://github.com/Oneflow-Inc/libai/tree/main/projects/Couplets#inference) for more details
If your model is Trainer from other framework, you should build your own `inference_config.py`, you can refer to [`dalle2_config.py`](https://github.com/Oneflow-Inc/libai/blob/main/projects/DALLE2/configs/dalle2_config.py) and [`t5_inference_config.py `](https://github.com/Oneflow-Inc/libai/blob/main/projects/MT5/configs/t5_inference.py)
## Refine `pipeline_inference.py`
The base class [libai/inference/basic.py](https://github.com/Oneflow-Inc/libai/blob/main/libai/inference/basic.py) is already provided in `LiBai` ,
Users only need to overload the functions they need. refer to [text_generation.py](https://github.com/Oneflow-Inc/libai/blob/main/libai/inference/text_generation.py)
If your model is trained from `LiBai`, it will be easy to use, you can refer to [distribute_infer.py](https://github.com/Oneflow-Inc/libai/blob/main/projects/Couplets/distribute_infer.py) in [Couplets](https://github.com/Oneflow-Inc/libai/tree/main/projects/Couplets)
If your model is trained from other framework, you need to build your own `model_loader` to load your model weights in LiBai, refer to [model_loader](https://libai.readthedocs.io/en/latest/notes/How_to_load_huggingface%27s_pretrained_model_in_libai.html) for more details
Give a simple example, the function overloaded in `LiBai`:
```python
from libai.inference.basic import BasePipeline
from libai.utils import distributed as dist
class MyPipeline(BasePipeline):
def _parse_parameters(self, **pipeline_parameters):
# By overloading this function, the input parameters in MyPipeline.__call__() hand out to preprocess/forward/postprocess stages of inference.
preprocess_params = {
"preprocess_param1": pipeline_parameters["preprocess_param1"],
"preprocess_param2": pipeline_parameters["preprocess_param2"],
}
forward_params = {
"forward_param": pipeline_parameters["forward_param"]
}
postprocess_params = {
"postprocess_param": pipeline_parameters["postprocess_param"]
}
return preprocess_params, forward_params, postprocess_params
def load_pretrain_weight(self, libai_cfg_model, model_path, mode="myloader"):
# load your pretrain weight in this functor
# set your own "MyLoader" if your model is pretrained from other framework
# set mode to "libai" if your model is pretrained from libai
if mode == "myloader":
import MyLoader
model_loader = MyLoader(
libai_cfg_model,
libai_cfg_model.cfg,
model_path,
...,
)
return model_loader.load()
else:
return super().load_pretrain_weight(
libai_cfg_model,
model_path,
mode=mode,
)
def preprocess(self, inputs, preprocess_param1, preprocess_param2, **kwargs):
...
# model_input_dict: {"key1": flow.Tensor1, ...}
return model_input_dict
def forward(self, model_input_dict, forward_param, **kwargs):
...
model_output_dict = self.model(**model_input_dict)
return model_output_dict
def postprocess(self, model_output_dict, postprocess_param, **kwargs):
...
return out_dict
if __name__ == "__main__":
pipeline = MyPipeline(
"path/to/myconfig.py",
data_parallel=1,
tensor_parallel=...,
pipeline_parallel=...,
pipeline_stage_id=...,
pipeline_num_layers=...,
model_path=...,
mode=...,
)
out = pipeline(
input_text=...,
preprocess_param1=...,
preprocess_param2=...,
forward_param=...,
postprocess_param=...,
)
if dist.is_main_process():
print(out)
```
## Distributed run `pipeline_inference.py`
To run model on 2 nodes with total 4 GPUs,
in `node0`, run:
```bash
NODE=2 NODE_RANK=1 ADDR=192.168.0.1 PORT=12345 bash tools/infer.sh pipeline_inference.py 2
```
`NODE=2` means total number of nodes
`NODE_RANK=0` means current node is node0
`ADDR=192.168.0.0` means the ip address of node0
`PORT=12345` means the port of node0
in `node1`, run:
```bash
NODE=2 NODE_RANK=1 ADDR=192.168.0.1 PORT=12345 bash tools/infer.sh pipeline_inference.py 2
```
`NODE=2` means total number of nodes
`NODE_RANK=1` means current node is node1
`ADDR=192.168.0.0` means the ip address of node0
`PORT=12345` means the port of node0
# How to use Huggingface's pretrained weights in LiBai
The built-in layers in [LiBai](https://github.com/Oneflow-Inc/libai) adopts the structure which is more suitable for parallel training, therefore the implementation in LiBai may be a little bit different from that in Huggingface. In this tutorial, we will introduce to users how to correctly load Huggingface's pretrained weights into LiBai's model. Let's take BERT as an example.
## LiBai Transformer vs Huggingface Transformer
There are subtle differences in the BERT structure as shown in the following figure (left: LiBai, right: Huggingface), which can be summarized as:
- Location of layernorm: The location of layernorm is different, but the calculation order is the same.
- A different slicing way to get the `query`, `key` and `value` matrix.
- LiBai follows [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) to use the order of the layernorm and the residual connections by default. Megatron-LM shows that this structure will eliminate instabilities and bring a lower training loss. LiBai can also support the original BERT architecture mentioned in [Paper](https://arxiv.org/pdf/1810.04805.pdf) by setting `apply_residual_post_layernorm=True`.
![architecture](./assets/architecture.jpg)
## QKV slicing logic
LiBai's QKV slicing logic is different from that in Huggingface.
```python
# LiBai's QKV slicing logic
query_key_value = query_key_value.view(batch_size, -1, num_heads, 3 * head_size)
query_key_value = query_key_value.permute(0, 2, 1, 3)
query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
# Huggingface's QKV slicing logic
query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
query = query.view(query.size(0), query.size(1), num_heads, -1).permute(0, 2, 1, 3)
key = key.view(key.size(0), key.size(1), num_heads, -1).permute(0, 2, 1, 3)
value = value.view(value.size(0), value.size(1), num_heads, -1).permute(0, 2, 1, 3)
```
## How to correctly load QKV weights
- To correctly load Huggingface's transformer weights, you only need to rearrange the loaded weights as follows:
```python
def convert_qkv_weight(cfg, qkv_weight, qkv_bias):
qkv_weight = qkv_weight.view([3, cfg.num_heads, cfg.head_size, cfg.hidden_size])
qkv_weight = qkv_weight.permute(1, 0, 2, 3).contiguous().view(3*cfg.hidden_size, cfg.hidden_size)
qkv_bias = qkv_bias.view(3, cfg.num_heads, cfg.head_size)
qkv_bias = qkv_bias.permute(1,0,2).contiguous().view(-1)
return qkv_weight, qkv_bias
```
- For detailed examples, please refer to [load-huggingface-bert](https://github.com/Oneflow-Inc/libai/tree/test_bert_load_huggingface_weight/projects/test_bert_load_huggingface_weight). You can verify this by running:
```bash
bash test.sh
```
\ No newline at end of file
# Detailed instruction on using model parallel in LiBai
This document is a tutorial for users to learn how to transer a pytorch model to oneflow, and use model parallel in Libai for inference. We will first take the DALLE2 model for example, and then we will show how to use model parallel which can be easily done in libai.
**Note**: the code of DALLE2 is adapted from [this repo](https://github.com/lucidrains/DALLE2-pytorch), which is an unofficial implementation. The final result may differ from the original generated images in the [paper](https://arxiv.org/abs/2204.06125). You can also try the model in [google colab](https://colab.research.google.com/github/LAION-AI/dalle2-laion/blob/main/notebooks/dalle2_laion_alpha.ipynb).
## Transfer pytroch model to oneflow.
It's easy for user to tansfer a pytorch model into oneflow, since most of oneflow's api is consistent with pytorch. First we change `import torch` to `import oneflow as flow`, and then we can replace all `torch` in the code to `flow`. If the model can work correctly in the originally
pytorch codes, it's likely to be able to work correctly in oneflow. Sometimes the program may raise error like
```
AttributeError: module 'oneflow' has no attribute 'xxx'
```
try install the latest version of oneflow which might help, you can find more details [here](https://github.com/Oneflow-Inc/oneflow#install-oneflow).
**1、Download the pytorch DALLE2 model**:
As show in the [google colab](https://colab.research.google.com/github/LAION-AI/dalle2-laion/blob/main/notebooks/dalle2_laion_alpha.ipynb), we will use the version of 0.15.4,
```
git clone -b v0.15.4 https://github.com/lucidrains/DALLE2-pytorch.git
```
the pretrained model weights can be found in huggingface: [the prior weight](https://huggingface.co/zenglishuci/conditioned-prior/resolve/main/vit-l-14/prior_aes_finetune.pth) and [the decoder weight](https://huggingface.co/laion/DALLE2-PyTorch/resolve/main/decoder/1.5B_laion2B/latest.pth).
A simple inference script can be written as
```python
# inference_dalle2.py
import numpy as np
import torch
import os,sys
from dalle2_pytorch import tokenizer
from dalle2_pytorch import OpenAIClipAdapter, DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder
def generate_images_from_text(texts):
clip=OpenAIClipAdapter("ViT-L-14.pt").to("cuda")
tokens = tokenizer.tokenize(text).to("cuda")
_, text_encodings, text_mask = clip.embed_text(tokens)
prior_network = DiffusionPriorNetwork(
dim = 768,
depth = 24,
num_timesteps = 1000,
num_time_embeds = 1,
num_image_embeds=1,
num_text_embeds = 1,
dim_head = 64,
heads = 32,
ff_mult = 4,
attn_dropout = 0.05,
ff_dropout = 0.05,
normformer = True,
)
diffusion_prior = DiffusionPrior(
net = prior_network,
clip = clip,
image_embed_dim = 768,
timesteps = 1000,
cond_drop_prob = 0.1,
loss_type="l2",
condition_on_text_encodings = True
)
state_dict = torch.load("prior_aes_finetune.pth", map_location="cpu")['ema_model']
diffusion_prior.load_state_dict(state_dict, strict=True)
diffusion_prior.to("cuda")
image_embed = diffusion_prior.sample(tokens, num_samples_per_batch = 2, cond_scale = 1.)
unet = Unet(
dim = 320,
image_embed_dim = 768,
text_embed_dim = 768,
cond_dim = 512,
channels = 3,
dim_mults=(1, 2, 3, 4),
num_resnet_blocks = 4,
attn_heads = 8,
attn_dim_head = 64,
sparse_attn = True,
memory_efficient = True,
cond_on_text_encodings = True, # set to True for any unets that need to be conditioned on text encodings
self_attn = [False, True, True, True]
)
decoder = Decoder(
unet = (unet,),
image_sizes = [64],
clip = clip,
channels = 3,
timesteps = 1000,
loss_type = "l2",
beta_schedule = ["cosine"],
learned_variance = True
)
state_dict = torch.load("latest.pth", map_location = "cpu")
new_dict = {}
for k,v in state_dict.items():
if 'clip.' in k: continue
if ('cross_attn' in k or 'fn.fn.' in k) and k.endswith(".g"):
k = k[:-1] + "gamma"
new_dict[k] = v
assert k in decoder.state_dict().keys(), k
decoder.load_state_dict(new_dict, strict=False)
decoder.to("cuda")
images = decoder.sample(image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, cond_scale = 3.5)
return images
def save_images(images):
import torchvision.transforms as T
to_pil = T.ToPILImage()
images = list(map(to_pil,images.unbind(dim = 0)))
for i,image in enumerate(images):
image.save(f"./result_{i}.png")
def main():
text = ["a dolphin in an astronaut suit on saturn, artstation"]
images = gen_text_and_img_emb(text)
save_images(images)
if __name__ == "__main__":
main()
```
run `python inference_dalle2.py`, this should work.
## 2、Change the deep learning framework to oneflow.
As mentioned above, we replace all the `torch` symbol to `flow` by firstly change `import torch` to `import oneflow as flow` in all python files.
It should be noted that the original pytorch code also import other python packages using pytorch backend like [einops](https://github.com/arogozhnikov/einops)[einops_ext](https://github.com/lucidrains/einops-exts)[kornia](https://github.com/kornia/kornia) etc. which should also be modified at the same time.
Fortunately, only a few api of these packages are used, we can take out the relevant code from the github repos and merge them in a separate file.
For example, we can simplely create the einops_ext.py file adapted from [here](https://github.com/lucidrains/einops-exts/blob/main/einops_exts/einops_exts.py), then we can import einops_ext from the python file which use oneflow instead of python packages using torch.
```python
# einops_ext.py
import re
from oneflow import nn #here change `from torch improt nn` to `from oneflow import nn`
from functools import wraps, partial
from einops import rearrange, reduce, repeat
```
## 3、Using Libai's api.
[LiBai](https://github.com/Oneflow-Inc/libai) is a large-scale open-source model training toolbox based on OneFlow.
Libai provides many efficient api which can be easily used for distributed training and evaluation. It also supports some popular models under the projects folder such as [CLIP](https://github.com/Oneflow-Inc/libai/tree/main/projects/CLIP). To avoid duplication of work, we directly use the clip model implemented in Libai. The relavant code in the original pytorch code is the `OpenAIClipAdapter` class which can be written as follows:
```python
# _clip.py
import os
import sys
import oneflow as flow
import oneflow.nn.functional as F
from oneflow import nn
from collections import namedtuple
def import_flow_clip(fn):
def wrapper(*args, **kwargs):
sys.path.append(os.path.join(os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")), "CLIP"))
fn(*args, **kwargs)
sys.path.pop()
return wrapper
class BaseClipAdapter(nn.Module):
pass
class OpenAIClipAdapter(BaseClipAdapter):
@import_flow_clip
def __init__(
self,
name = 'ViT-L/14'
):
import clip
openai_clip, preprocess = clip.load(name)
super().__init__(openai_clip)
```
[DiffusionPrior](https://github.com/lucidrains/DALLE2-pytorch/blob/v0.15.4/dalle2_pytorch/dalle2_pytorch.py#L873) and [Decoder](https://github.com/lucidrains/DALLE2-pytorch/blob/v0.15.4/dalle2_pytorch/dalle2_pytorch.py#L1802) follow their original implementation.
**Using libai.layers**
LiBai provides multiple parallelisms such as Data Parallelism, Tensor Parallelism, and Pipeline Parallelism. To experience these features, we will use libai.layers like Linear and LayerNorm:
```python
from libai.layers import Linear, LayerNorm
```
the nn.Linear will be replace with `libai.layers.Linear`.
**Compare the outputs** To make sure it is correctly modified from `torch` to `flow`, it's necessary to compare the outputs to see if they are the same after the change. A notable point here is that in the sampling stage, the noise are randomly generated, like
```python
noise = flow.randn(shape)
# or noise = torch.randn(shape) in torch code
```
torch and oneflow will generate different numbers here even if they are set the same random seed. An alternate way is to make a transition through numpy:
```python
import numpy as np
np.random.seed(6666)
noise = flow.tensor(np.randn(shape))
# or noise = torch.tensor(np.randn(shape)) in torch code
```
When the model is fed the same input text, the output images by oneflow or torch code should be same.
**LazyConfig and LazyCall**
Oneflow provides LazyConfig system for more flexible syntax and no predefined structures, find more [here](https://libai.readthedocs.io/en/latest/tutorials/basics/Config_System.html). As for the DALLE2, the config file can be write as
```python
from omegaconf import DictConfig
from libai.config import LazyCall
from dalle2.models import DiffusionPrior, DiffusionPriorNetwork, Unet, Decoder, DALLE2
from dalle2._clip import OpenAIClipAdapter
clip = LazyCall(OpenAIClipAdapter)(name="./dalle2/model_weights/ViT-L-14.pt")
prior = LazyCall(DiffusionPrior)(
net = LazyCall(DiffusionPriorNetwork)(
dim=768,
depth=24,
num_timesteps=1000,
max_text_len=77,
num_time_embeds=1,
num_image_embeds=1,
num_text_embeds=1,
dim_head=64,
heads=32,
ff_mult=4,
attn_dropout=0.05,
ff_dropout=0.05,
normformer=True,
),
clip=clip,
image_embed_dim=768,
timesteps=1000,
cond_drop_prob=0.1,
loss_type="l2",
condition_on_text_encodings=True
)
unet1 = LazyCall(Unet)(
dim=320,
image_embed_dim=768,
text_embed_dim=768,
cond_dim=512,
channels=3,
dim_mults=(1, 2, 3, 4),
num_resnet_blocks=4,
attn_heads=8,
attn_dim_head=64,
sparse_attn=True,
memory_efficient=True,
cond_on_text_encodings=True, # set to True for any unets that need to be conditioned on text encodings
self_attn=[False, True, True, True]
)
decoder = LazyCall(Decoder)(
unet=(unet1,),
image_sizes=[64, ],
clip=None,
channels=3,
timesteps=1000,
loss_type="l2",
beta_schedule=["cosine"],
learned_variance=True
)
dalle2_model = LazyCall(DALLE2)(
prior=prior,
decoder=decoder,
prior_weight_path='',
decoder_weight_path=''
)
```
## 4、Model parallel in Libai.
In order to achieve the model parallel inference under libai, we should set the parallel mode according to your needs. The default value of argument parallel is `data` in libai.layers.Linear, which means data parallel. To achieve model parallel, we need change the parallel to `col` or `row`. The most efficient way is to set the Linear layers in the col -> row -> col order.
A transformer block contains a attention and a feedforward submodule, and each submodule exactly contains 2 Linear layers.
The attention module contains the qkv projection and out projection. Thus we set the qkv projejction as `col`, and the out projection as `row`:
```python
#attention
class Attention(nn.Module):
def __init__(self, *args, **kwargs):
super().__init__()
# 1、 qkv projection
self.to_q = Linear(dim, inner_dim, bias = False, parallel='col')
self.to_kv = Linear(dim, dim_head * 2, bias = False, parallel='col')
#2、 output projection
self.to_out = nn.Sequential(
Linear(inner_dim, dim, bias = False, parallel='row'), #'row'
LayerNorm(dim)
)
```
and feed forward contains in projection and out projection, the former will be set `col` and the later will be set `row`.
```python
def FeedForward(
dim,
mult = 4,
dropout = 0.,
post_activation_norm = False
):
inner_dim = int(mult * dim)
return nn.Sequential(
LayerNorm(dim),
Linear(dim, inner_dim * 2, bias = False, parallel='col'),
SwiGLU(),
LayerNorm(inner_dim) if post_activation_norm else nn.Identity(),
nn.Dropout(dropout),
Linear(inner_dim, dim, bias = False, parallel='row')
)
```
for the single machine with 4 GPUs, the model parallel could be set like:
```python
import libai.utils.distributed as dist
dist.setup_dist_util(
DictConfig(
dict(
data_parallel_size=1,
tensor_parallel_size=4,
pipeline_parallel_size=1,
)
)
)
```
If you successfully complete the above steps, now you can have fun with the (unofficial) dalle2 model.
\ No newline at end of file
Notes
=============
.. toctree::
:glob:
:maxdepth: 2
How_to_use_huggingface's_weights_in_LiBai.md
How_to_build_vision_transformer_model_in_LiBai.md
How_to_load_huggingface's_pretrained_model_in_libai.md
How_to_use_model_parallel_in_LiBai.md
How_to_use_distributed_inference_in_LiBai.md
FAQ.md
# How to Customize Dataloader
Dataloader is the component that provides data to models. Dataloader usually (but not necessarily) takes raw information from [write dataloaders](https://libai.readthedocs.io/en/latest/tutorials/basics/Write_Dataloaders.html), and processes them into the format needed by the model.
## How the Existing Dataloader Works
LiBai contains a built-in data loading pipeline. It's beneficial to understand how it works, in case you need to write a custom one.
LiBai provides some functions [build_{image,nlp}_{train,test}_loader](https://libai.readthedocs.io/en/latest/modules/libai.data.html#libai.data.build.build_nlp_train_loader) that create a default dataloader from a given config. Here is how `build_{image,nlp}_{train,test}_loader` work:
1. It instantiates the `list[flow.utils.Dataset]` (e.g., `BertDataset`) by loading some dataset items with lightweight format. These dataset items are not yet ready to be used by the model (e.g., images are not loaded into memory, random augmentation have not been applied, etc.).
2. The output format of dataset (`__getitem__(...)`) must be a dict whose keys must be consistent with argument names of the dataloader's consumer (usually the `model.forward(...)`). The role of the process is to transform the lightweight representation of a dataset item into a format that is ready for the model to consume (including, e.g., read images, perform random data augmentation and convert to oneflow Tensors). If you would like to perform custom transformations to data, you often want to rewrite it. Details about the dataset format can be found in [write dataloaders](https://libai.readthedocs.io/en/latest/tutorials/basics/Write_Dataloaders.html).
3. The outputs of the dataset are simply batched with the following function.
```python
def trivial_batch_collator(batch):
assert isinstance(batch[0], Instance), "batch[0] must be `instance` for trivial batch collator"
batch = Instance.stack(batch)
return batch
```
4. This batched data is the output of the dataloader. Typically, it's also the input of `get_batch`. After `get_batch(...)`, it becomes the input of `model.forward()`. `get_batch` simply changes the local tensors to global tensors with the given `sbp` and `placement` meta information.
```python
@classmethod
def get_batch(cls, data, mixup_func = None):
...
ret_dict = {}
for key, value in data.get_fields().items():
value.to_global()
ret_dict[key] = value.tensor
return ret_dict
```
## Use Custom Dataloader
If you use `DefaultTrainer`, you can overwrite its `build_train_loader` method to use your own dataloader which can be implemented with any tools you like. But you need to make sure that each rank is reading the data correctly under different parallelism circumstances.
Then you need to overwrite `get_batch` method. `data` argument in `get_batch` is the output of your dataloader. You need to change the local tensors to global tensors manually, which means you should set the `sbp` and `placement` correctly.
Here is an example. Process of rank0 gets all data and redistributes them into the other ranks.
```python
@classmethod
def get_batch(cls, data, mixup_func=None):
if data is None:
# not rank0, set placeholders for data
# Note: make sure imgs and labels have the same shape and dtype on all ranks
imgs = flow.empty(16, 3, 224, 224, dtype=flow.float32)
labels = flow.empty(16, dtype=flow.int64)
else:
# rank0
imgs, labels = data
dist.synchronize()
imgs = imgs.to_global(spb=flow.sbp.broadcast, placement=flow.env.all_device_placement("cuda"))
imgs = imgs.to_global(
spb=dist.get_nd_sbp([flow.sbp.split(0),
flow.sbp.broadcast]),
placement=dist.get_layer_placement(0))
labels = labels.to_global(spb=flow.sbp.broadcast, placement=flow.env.all_device_placement("cuda"))
labels = labels.to_global(
spb=dist.get_nd_sbp([flow.sbp.split(0),
flow.sbp.broadcast]),
placement=dist.get_layer_placement(-1))
return {
"images": imgs,
"labels": labels
}
```
# How to Customize Parallelism
Common parallelisms have already been implemented in LiBai, such as data parallel, tensor parallel and pipeline parallel. But there is also a need for user customized parallel. In this tutorial, we will show you how to customize your own parallelism.
## Define your own Parallel Model with LiBai.layers
### Large-scale FC
Suppose you have a huge fully-connected-layer for large-scale classification (e.g., 1000w classes), which makes it impossible to fit into a single GPU.
Don't worry, with the help of `LiBai.layers`, you can write models in a familiar way that you used to write for a single GPU. Here is a simple example showing how to write a tensor-parallel fully-connected-layer with 2 GPUs.
```python
# huge_fc_example.py
import oneflow as flow
from omegaconf import DictConfig
from oneflow import nn
from libai.layers import Linear
from libai.utils import distributed as dist
cfg = DictConfig(dict(data_parallel_size=1, tensor_parallel_size=2, pipeline_parallel_size=1))
dist.setup_dist_util(cfg)
class Huge_FC(nn.Module):
def __init__(self):
super().__init__()
self.fc = Linear(2048, 32768, parallel="col")
def forward(self, x):
return self.fc(x)
huge_fc = Huge_FC()
x = flow.rand(32, 2048, sbp=flow.sbp.broadcast, placement=flow.placement("cuda", ranks=[0, 1]))
y = huge_fc(x)
print(f"rank: {flow.env.get_rank()}, tensor shape: {y.to_local().shape}")
```
You can run this toy example with command line as follows:
```shell
python3 -m oneflow.distributed.launch --nproc_per_node 2 huge_fc_example.py
>> rank: 0, tensor shape: oneflow.Size([32, 16384])
>> rank: 1, tensor shape: oneflow.Size([32, 16384])
```
In the result, you can find that `y` has been split along with `axis=1` on 2 GPUs.
### Large MLP models
Suppose you have a huge MLP model which is very popular in transformer-based models, with a huge hidden size that makes it difficult to fit into a single GPU.
You can then split the model weights across GPUs in a hybrid parallel mode while you can still write your model in a familiar way.
Here is a simple example about the 2D parallel MLP in the LiBai context.
```python
import oneflow as flow
from omegaconf import DictConfig
from oneflow import nn
from libai.layers import Linear
from libai.utils import distributed as dist
cfg = DictConfig(dict(data_parallel_size=2, tensor_parallel_size=2, pipeline_parallel_size=1))
dist.setup_dist_util(cfg)
# Write a Simple 2D Parallel MLP
class MLP_2D(nn.Module):
def __init__(self):
super().__init__()
self.linear_1 = Linear(in_features=1024, out_features=16384, parallel="col")
self.relu = nn.GELU()
self.linear_2 = Linear(in_features=16384, out_features=1024, parallel="row")
def forward(self, x):
x = self.linear_1(x)
x = self.relu(x)
x = self.linear_2(x)
return x
# define a model
mlp = MLP_2D()
# define input with 2D sbp
x = flow.rand(
32,
1024,
sbp=dist.get_nd_sbp([flow.sbp.split(0), flow.sbp.broadcast]),
placement=dist.get_layer_placement(0)
)
y = mlp(x)
print(f"rank: {flow.env.get_rank()}, tensor shape: {y.to_local().shape}")
```
You can run it with command line as follows:
```shell
python3 -m oneflow.distributed.launch --nproc_per_node 4 huge_mlp_example.py
>> rank: 2, tensor shape: oneflow.Size([16, 1024])
>> rank: 3, tensor shape: oneflow.Size([16, 1024])
>> rank: 1, tensor shape: oneflow.Size([16, 1024])
>> rank: 0, tensor shape: oneflow.Size([16, 1024])
```
From above, you can see that the data are split into 2 groups for data parallel, and weights are split into 2 groups for tensor model parallel. So this simple example just implements a 2D parallel.
For your convenience, we provide some prevalent models such as BERT, GPT-2, and ViT in Mode Zoo. Feel free to customize them into different sizes to fit into your special needs.
## Write your own Pipeline Parallel Model
This tutorial describes how to use pipeline parallel in your own model. LiBai has two pipeline-parallel modes: naive pipeline parallel and (similar) 1F1B pipeline parallel introduced by [Megatron-LM](https://arxiv.org/abs/1909.08053).
### Introduction of Naive Pipeline Parallel
In LiBai, naive pipeline parallel can be implemented by setting layers and parameters `placement`.
You can easily configure their `placement` by `dist.get_layer_placement(idx)`.
Here is an example for `placement` configuration.
```python
# set a free tensor placement to first stage
self.pos_embed = nn.Parameter(
flow.zeros(
1,
num_patches + 1,
embed_dim,
sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=dist.get_layer_placement(0),
)
)
# set a Linear placement to last stage
# set it manually
self.head = Linear(embed_dim, num_classes, layer_idx=-1).to_global(placement=dist.get_layer_placement(-1))
# use `layer_idx` API
self.head = Linear(embed_dim, num_classes, layer_idx=-1)
```
After configuring models placement, add the input placement transition across different stages. LiBai sets a `layer_idx` attribute in each `nn.Module`, so you can simply add `to_global` in `forward` to implement input placement transition.
```python
class MyModule(nn.Module):
def __init__(self, ... *, layer_idx):
...
self.layer_idx = layer_idx
...
def forward(self, hidden_states):
hidden_states = hidden_states.to_global(placement=dist.get_layer_placement(self.layer_idx))
...
```
After configuring models and data placement, you only need to set the distributed configuration before training.
```python
# set pipeline stages to 2
train.dist.pipeline_parallel_size = 2
# set model layers for pipeline
train.dist.pipeline_num_layers = hidden_layers
```
### Introduction of 1F1B Pipeline Parallel
First, we will introduce GPipe to you to get a better understanding of pipeline parallelism. In GPipe, when the forward passes of all microbatches finish, the backward passes would be executed (as shown in below).
![gpipe](../assets/gpipe.png)
1F1B performs one forward pass followed by one backward pass. Finally, at the end of a batch, complete backward passes for all remaining in-flight microbatches. In general, 1F1B is more efficient than GPipe.
There are two schedules of 1F1B pipeline: the non-interleaved and the interleaved. The figures are shown below.
![1f1b](../assets/1f1b.png)
In LiBai, the non-interleaved schedule is supported currently, and this mode is more memory-efficient than GPipe.
You only need to set models stage id except that placement configuration in naive pipeline parallel, and stage id can help create stashed buffers for activation.
This example shows how to configure bert model stage id:
```python
class BertForPreTraining(nn.Module):
def __init__(self, ...):
...
def forward(self, ...):
...
@staticmethod
def set_pipeline_stage_id(model):
dist_utils = dist.get_dist_util()
# Set pipeline parallelism stage_id
for module_block in model.modules():
# module_block.to(nn.Module) can get the original module
if isinstance(module_block.to(nn.Module), BertEmbeddings):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
elif isinstance(module_block.to(nn.Module), BertExtendedAttnMask):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
elif isinstance(module_block.to(nn.Module), TransformerLayer):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(module_block.layer_idx))
elif isinstance(module_block.to(nn.Module), BertPooler):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
elif isinstance(module_block.to(nn.Module), BertPreTrainingHeads):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
# Set the last layernorm stage id
model.bert.final_layernorm.config.stage_id = dist_utils.get_layer_stage_id(-1)
```
In `set_pipeline_stage_id`, `BertEmbeddings` and `BertExtendedAttnMask` are placed in the first stage, then each `TransformerLayer` is uniformly placed in each stages. At last, place `BertPooler` and `BertPreTrainingHeads` in the last stage. But don't forget to place the last `layernorm` in `BertEncoder` which does not belong to any `TransformerLayer` in the last stage.
After adding the `set_pipeline_stage_id` function in a pre-defined `nn.Module`, `GraphBase` will invoke it automatically as below:
```python
def set_pipeline_stage_id(self):
if hasattr(type(self.model.to(nn.Module)), "set_pipeline_stage_id"):
type(self.model.to(nn.Module)).set_pipeline_stage_id(self.model)
```
The last thing left is to set the training configuration as below:
```python
# set pipeline stages to 2
train.dist.pipeline_parallel_size = 2
# set model layers for pipeline
train.dist.pipeline_num_layers = hidden_layers
# enable activation checkpointing
train.activation_checkpoint.enabled = True
# enable gradient accumulation with 8 micro-batches
train.num_accumulation_steps = 8
```
Advanced Tutorials
===================
.. toctree::
:glob:
:maxdepth: 2
customize_dataloader.md
customize_parallel.md
# Auto Parallel Training
LiBai supports **auto-parallel training** which means LiBai will automatically find **an efficient parallel training strategy** for a specific model during training. Users can try out auto-parallel training by the following steps.
## Installation
Install OneFlow nightly
```shell
python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/[PLATFORM]
```
- All available `[PLATFORM]`:
<table class="docutils">
<tbody>
<tr>
<th width="80"> Platform </th>
<th valign="bottom" align="left" width="120">CUDA Driver Version</th>
<th valign="bottom" align="left" width="120">Supported GPUs</th>
</tr>
<tr>
<td align="left"> cu112 </td>
<td align="left"> >= 450.80.02 </td>
<td align="left"> GTX 10xx, RTX 20xx, A100, RTX 30xx</td>
</tr>
<tr>
<td align="left"> cu102 </td>
<td align="left"> >= 440.33 </td>
<td align="left"> GTX 10xx, RTX 20xx</td>
</tr>
<tr>
<td align="left"> cpu </td>
<td align="left"> N/A </td>
<td align="left"> N/A </td>
</tr>
</tbody>
</table>
## Train/Evaluate model in auto-parallel mode
You can train your own model in auto-parallel mode by simply updating the config as follows:
### Modify config file
```python
# your config
from .common.models.graph import graph
graph.auto_parallel.enabled = True
```
Training model with auto-parallel on 4 GPUs:
```shell
bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4
```
### Directly modify the training command line
- auto-parallel training:
```shell
bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 graph.auto_parallel.enabled=True
```
- auto-parallel evaluation:
```shell
bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 --eval graph.auto_parallel.enabled=True
```
### More details with instructions and interface
See [OneFlow Auto-Parallelism](https://oneflow.readthedocs.io/en/master/auto_parallel.html).
# Build New Project on LiBai
This is a basic guide to build new projects based on LiBai. The advantages of using LiBai to start a new project (such as paper reproduction and finetune task) are as follows:
- Avoid redundant work. Developers can directly inherit many built-in modules from LiBai.
- Easily reproduce the experiments already run, because LiBai will save the configuration file automatically.
- Automatically output useful information during training time, such as remaining training time, current iter, throughput, loss information and current learning rate, etc.
- Set a few config params to enjoy distributed training techniques.
## Introduction
Take the [Bert Finetune](https://github.com/Oneflow-Inc/libai/tree/main/projects/QQP) task as an example to introduce LiBai.
The complete file structure of the project is:
```
projects/my_project
├── configs
│ └── config_custom.py
│ └── ...
├── dataset
│ ├── custom_dataset.py
│ └── ...
├── modeling
│ ├── custom_model.py
│ └── ...
├── README.md
```
To start a new project based on LiBai step by step:
Step 1. Prepare an independent config file (such as [config.py](https://github.com/Oneflow-Inc/libai/blob/main/projects/QQP/configs/config_qqp.py)) which contains:
- The relevant parameters of the task.
- The pre-defined related Class, such as `Model`, `Optimizer`, `Scheduler`, `Dataset`.
- You can inherit the default config in `configs/common` and rewrite it, which can greatly reduce the workload.
- Related class defined with LazyCall which returns a dict instead of calling the object.
Step 2. Prepare a model file (such as [model.py](https://github.com/Oneflow-Inc/libai/blob/main/projects/QQP/modeling/model.py)) :
- Build related models in this file. The construction method is similar to OneFlow.
- Because Libai will set up a static diagram by default, the calculation of loss needs to be inside the model.
- The function `forward` must return a dict.
- When defining a tensor in the model, you need to use `to_global`. Turn tensor into a global pattern.
- When defining layers, you can import them directly from `libai.layers`, because it has already pre-defined the SBP signature.
Step 3. Prepare a dataset file (such as [dataset.py](https://github.com/Oneflow-Inc/libai/tree/main/projects/QQP/dataset)) :
- Build `Dataset` in this file. The construction method is similar to OneFlow.
- The difference is that you need to use `DistTensorData` and `Instance`.
- The shape of each batch must be global.
- In `__getitem__` function, the `key` returned by the method must be consistent with the parameter name of the `forward` function in the `model`.
## Main Function Entry
[tools/train_net.py](https://github.com/Oneflow-Inc/libai/blob/main/tools/train_net.py) is the default main function entry provided in LiBai.
## Build Config
The `config.py` in LiBai is special, which takes the form of lazyconfig and will be saved as `.yaml` at runtime. The config has several necessary fields, such as `train`, `model`, `optim`, `lr_scheduler`, `graph`. For more information, please refer to [Config_System.md](https://libai.readthedocs.io/en/latest/tutorials/Config_System.html).
> All imported modules must take LiBai as the root directory. Otherwise, the saved `yaml` file cannot save the correct path of the module, resulting in an error when reading `yaml`, and the experiment cannot be reproduced.
After building the `config.py`, if you want to get the corresponding fields in the project, you need to access like `cfg.my_cfg.***`.
## Start Training
The `train.sh` file contains some parameters, such as `GPUS`, `NODE`, etc.
```bash
#!/usr/bin/env bash
FILE=$1
CONFIG=$2
GPUS=$3
NODE=${NODE:-1}
NODE_RANK=${NODE_RANK:-0}
ADDR=${ADDR:-127.0.0.1}
PORT=${PORT:-12345}
python3 -m oneflow.distributed.launch \
--nproc_per_node $GPUS --nnodes $NODE --node_rank $NODE_RANK --master_addr $ADDR --master_port $PORT \
$FILE --config-file $CONFIG ${@:4}
```
After building the above modules, you can start training with single GPU.
> Config can support both `py` files and generated `yaml` files.
```bash
bash projects/my_projects/train.sh tools/train_net.py projects/my_projects/config.py 1
```
# Config System
Given that the traditional yacs-based config system or python argparse command-line options suffer from providing enough flexibility for the development of new project, we borrowed the [lazy config system](https://detectron2.readthedocs.io/en/latest/tutorials/lazyconfigs.html) design from detectron2 which forms the non-intrusive config system for LiBai.
You can refer to the [d2 tutorial](https://detectron2.readthedocs.io/en/latest/tutorials/lazyconfigs.html) for more details about the syntax and basic usage of lazy config. This section shows some examples of usage in LiBai.
## Configs in LiBai
LiBai defines a standard set of config namespaces for later use. This set of namespaces must be kept if you want to perform the complete training and evaluation process of LiBai.
In summary, this set of namespaces is `model, graph, train, optim, dataloader, tokenization(optional)`. The details are as follows.
### model
This is the configuration for model definition. You can refer to `configs/common/models` for more examples.
A model config file can be loaded like this:
```python
# bert.py:
from libai.config import LazyCall
from libai.models import BertModel
# define a model with lazycall
bert_model = LazyCall(BertModel)(
vocab_size=30522,
hidden_size=768,
hidden_layers=24,
num_attention_heads=12,
intermediate_size=4096,
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
num_tokentypes=2,
add_pooling_layer=True,
initializer_range=0.02,
layernorm_eps=1e-5,
bias_gelu_fusion=True,
bias_dropout_fusion=True,
scale_mask_softmax_fusion=False,
apply_query_key_layer_scaling=True,
add_binary_head=True,
amp_enabled=False,
)
# my_config.py:
from bert import bert_model as model
assert model.hidden_size == 768
model.hidden_layers = 12 # change hidden layers
```
After defining the model config in a python file, you can `import` it in the global scope of the config file. Note that you need to rename it as `model` regardless of the name used in the model config.
You can access and change all keys in the model config after import.
### graph
This is the configuration for static `nn.Graph` mode. For more information about the static graph mode, refer to the official [nn.Graph docs](https://docs.oneflow.org/master/basics/08_nn_graph.html).
LiBai has already defined a `GraphBase` class for almost all models to use. You can simply turn on this option to convert eager mode to graph mode.
The graph config can be found in [graph.py](https://github.com/Oneflow-Inc/libai/blob/main/configs/common/models/graph.py), and two useful options are shown as follows:
```python
# Turn on graph mode, if set to `False`, will use eager mode.
graph.enabled = True
# Set graph debug level, -1 means no debug info, and 0,1,2,3 can be
# set for different debug levels.
# More information can be found in nn.Graph documents.
graph.debug = -1
```
### train
This is the configuration for training and evaluation. The default training config can be found in `configs/common/train.py`.
The convention of training / test specific parameters is as follows:
```python
from libai.config import LazyCall
train = dict(
# Directory where output files are written
output_dir="./output",
# `train_micro_batch_size` is number of samples per batch on each GPU.
# train_mini_batch_size = train_micro_batch_size * num_accumulation_steps.
# This is also the number of training samples per step (i.e. per iteration).
# If we use 8 GPUs for data parallel groups, `train_micro_batch_size = 2` and
# `num_accumulation_steps = 4`, then each GPU will see 2 samples per batch and
# 8 samples per iteration.
# Total 64 samples will be trained per iteration across all GPUs.
# global_batch_size = micro_batch_size * num_grad_acc * data_parallel_groups
train_micro_batch_size=32,
global_batch_size=None,
num_accumulation_steps=None,
# The total training iterations
train_iter=10000,
# The total training epochs, will be scaled to training iterations automatically.
# The actual total training iterations will be calculated by the
# formula `max(train_iter, train_epoch * iter_per_epoch)`.
train_epoch=0,
consumed_train_samples=0,
consumed_valid_samples=0,
train_samples=None,
# Fraction of lr-warmup-iters to use for warmup (as a float)
warmup_ratio=0,
# The start iteration, usually needn't set it manually.
# It can be computed automatically when resuming training.
start_iter=0,
# Enable automatic mixed precision for training which does not
# change model's inference behavior.
amp=dict(enabled=False),
# Enable activation checkpointing to allow for training
# with larger models, sequences, and batch sizes.
# If enabled, checkpoint the input activations of each transformer layers by default.
activation_checkpoint=dict(enabled=False),
# NCCL fusion threshold megabytes, set to 0 to
# compatible with previous version of OneFlow.
nccl_fusion_threshold_mb=16,
# Maximum number of ops of NCCL fusion, set to 0 to
# compatible with previous version of OneFlow.
nccl_fusion_max_ops=24,
# Enable ZeRO Optimization to allow for training with larger models.
# This optimization will reduce optimizer stages memory consumption
# as described in ZeRO https://arxiv.org/abs/1910.02054.
zero_optimization=dict(
enabled=False,
stage=1,
),
# Save a model checkpoint after every this number of iterations,
# and maximum number of checkpoint will be kept.
checkpointer=dict(period=5000, max_to_keep=100),
# Options for evaluation
# `test_micro_batch_size` is number of samples per batch on each GPU for testing.
# If we use 8 GPUs for data parallel groups and `test_micro_batch_size = 2`, then
# total 16 samples will be used per iteration across all GPUs.
test_micro_batch_size=32,
# Enabled evaluation during training, after every `eval_period` number of iterations
# will perform the evaluation process.
# You can set the maximum evaluation iterations to run for validation/test.
# You can also set a customized evaluator for use.
evaluation=dict(
enabled=True,
# evaluator for calculating top-k acc
evaluator=LazyCall(ClsEvaluator)(topk=(1, 5)),
eval_period=5000,
eval_iter=1e9, # running steps for validation/test
# Metrics to be used for best model checkpoint.
eval_metric="Acc@1",
eval_mode="max",
),
# Path to a checkpoint file to be loaded to the model for training or evaluation.
load_weight="",
# Output log to console after every this number of iterations.
log_period=20,
# lr_scheduler arguments
# See libai/scheduler/lr_scheduler.py for definition.
scheduler=LazyCall(WarmupCosineLR)(
# In DefaultTrainer we will automatically set `max_iter`
# and `warmup_iter` by the given train cfg.
warmup_factor=0.001,
alpha=0.01,
warmup_method="linear",
),
# Distributed arguments
# See https://libai.readthedocs.io/en/latest/tutorials/basics/Distributed_Configuration.html for more details.
dist=dict(
data_parallel_size=1,
tensor_parallel_size=1,
pipeline_parallel_size=1,
# users must set the `pipeline_num_layers` attribute when `pipeline_parallel_size > 1`
pipeline_num_layers=None,
# users could customize the number of layers in different stages
# by setting the `custom_pipeline_stage_id ` attribute which is used for
# manually balance calculation between stages when running pipeline parallelism
# e.g. you can set `custom_pipeline_stage_id=[0, 0, 0, 1]`
# for `pipeline_num_layers=4 and pipeline_parallel_size=2`
# which means the first 3 layers will be placed on stage0 and
# the last layer will be placed on stage1
# NOTE: if it is None, LiBai will automatically set pipeline_stage_id
# `auto_pipeline_stage_id` and `actual_pipeline_stage_id` will be saved in `config.yaml`
custom_pipeline_stage_id=None,
),
# the device type of input tensors for model, defaults to "cuda".
# if you want to accelerate the model training when pipeline_parallel > 1
# you can set `input_placement_device="cpu"` then call input_tensor.to_global()
# inside your model.forward() method
# see `libai/models/bert_model.py` as reference
input_placement_device="cuda",
# set to `True` to enable rdma for improving speed of pipeline_parallel
rdma_enabled=True,
# Set seed to positive to use a fixed seed. Note that a fixed seed increases
# reproducibility but does not guarantee fully deterministic behavior.
# Disabling all parallelism further increases reproducibility.
seed=1234,
)
```
**Note:** ``warmup_ratio`` is the ratio of warmup iterations of the total training iterations, and the real ``warmup iterations`` will be calculated by ``wramup_ratio * train_iter`` automatically.
**Example:** If you need to train 300 epochs with 5 warmup epochs, update the config as follows:
```python
# config.py
train.train_epoch = 300
train.warmup_ratio = 5 / 300
```
If you need to train 1000 iters with 200 warmup iters, set the training config like this:
```python
# config.py
train.train_iter = 1000
train.warmup_ratio = 200 / 1000
```
### optim
This is the configuration for optimizer. The default configuration can be found in `configs/common/optim.py`.
LiBai utilizes the function `get_default_optimizer_params`, which needs the `nn.Module` as the argument and returns the parameter groups.
With `LazyConfig`, you can set other arguments in advance and pass the `model` argument later. For more details, refer to [API docs of libai optim](../libai.optim.html#libai.optim.get_default_optimizer_params).
```python
# optim.py:
import oneflow as flow
from libai.config import LazyCall
from libai.optim import get_default_optimizer_params
optim = LazyCall(flow.optim.AdamW)(
params=LazyCall(get_default_optimizer_params)(
# params.model is meant to be set to the model object,
# before instantiating the optimizer.
clip_grad_max_norm=1.0,
clip_grad_norm_type=2.0,
weight_decay_norm=0.0,
weight_decay_bias=0.0,
),
lr=1e-4,
weight_decay=0.01,
betas=(0.9, 0.999),
eps=1e-8,
do_bias_correction=True,
)
# my_config.py:
import oneflow as flow
optim._target_ = flow.optim.SGD
# Remove the incompatible arguments in optim
del optim.do_bias_correction
# Set the need arguments
optim.momentum = 0.9
```
### dataloader
This is the configuration for dataset/dataloader. This component provides data to the model. A dataloader usually takes raw information and processes it into the format required by the model.
See example datasets in `configs/common/data/`, including `cifar100`, `imagenet`, `bert_dataset` and so on. You can also define your customized dataset config as you like.
Take `bert_dataset.py` as an example:
```python
# bert_dataset.py:
from libai.config import LazyCall
from omegaconf import OmegaConf
from libai.data import build_nlp_test_loader, build_nlp_train_val_test_loader
from libai.data.datasets import BertDataset
from libai.data.data_utils import get_indexed_dataset
dataloader = OmegaConf.create()
dataloader.train = LazyCall(build_nlp_train_val_test_loader)(
dataset=[
LazyCall(BertDataset)(
data_prefix="/your/data_prefix/path",
indexed_dataset=LazyCall(get_indexed_dataset)(
data_prefix="/your/data_prefix/path",
data_impl="mmap",
skip_warmup=False,
),
max_seq_length=512,
mask_lm_prob=0.15,
short_seq_prob=0.1,
),
],
splits=[[949.0, 50.0, 1.0]],
weights=[1.0],
num_workers=4,
)
# my_config.py:
dataloader.train.dataset[0].max_seq_length = 256
dataloader.train.num_workers = 2
```
LiBai provides two functions `build_nlp_train_val_test_loader` and `build_image_train_loader` to create a default train data loader from a given config. It takes the list of `dataset_class`(e.g., `BertDataset`) and combines them using `flow.utils.data.dataset.ConcatDataset`.
It is recommended to check out [API docs of libai.data](../libai.data.html#libai.data.build.build_nlp_train_loader) to learn more about the APIs of `build_nlp_train_val_test_loader`.
### tokenization (optional)
You need to configure a tokenizer if you want to train a NLP task. Each NLP dataset has its own tokenizer config in the corresponding data config file.
Here we use:
```python
# bert_dataset.py:
from libai.config import LazyCall
from omegaconf import OmegaConf
from libai.tokenizer import BertTokenizer
tokenization = OmegaConf.create()
tokenization.tokenizer = LazyCall(BertTokenizer)(
vocab_file="bert-base-chinese-vocab.txt",
do_lower_case=True,
do_chinese_wwm=True,
)
tokenization.append_eod = False
tokenization.make_vocab_size_divisible_by = 128
# my_config.py:
tokenization.tokenizer.do_lower_case = False
```
Tokenization config must contain a tokenizer(e.g., `BertTokenizer`). `append_eod` and `make_vocab_size_divisible_by` are not necessary.
`make_vocab_size_divisible_by` is used for padding the vocab size to be divisible by this value. This is added for computational efficiency for tensor parallelism.
## Get the Default Config
You don't need to rewrite all contents in config every time. You can import a config file as a python file or use function [`get_config`](../libai.config.html#libai.config.get_config).
If you build LiBai from source, you can get all default config files in `configs/*`. Then you can import the config files as follows:
```python
# import config
from .common.models.bert import pretrain_model as model
from .common.models.graph import graph
from .common.train import train
from .common.optim import optim
from .common.data.bert_dataset import dataloader, tokenization
# modify it
train.train_iter = 100
...
```
If you install LiBai by `pip`, you can use `get_config` function to get all default config files as follows:
```python
from libai.config import get_config
# get config
model = get_config("common/models/bert.py").pretrain_model
graph = get_config("common/models/graph.py").graph
train = get_config("common/train.py").train
optim = get_config("common/optim.py").optim
dataloader = get_config("common/data/bert_dataset.py").dataloader
tokenization = get_config("common/data/bert_dataset.py").tokenization
# modify it
train.train_iter = 100
...
```
## LazyConfig Best Practices
1. Treat the configs you write as actual "code": Avoid copying them or duplicating them. Import the common parts between configs.
2. Keep the configs you write simple: Don't include keys that do not affect the experimental setting.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment