Commit 9fdb7dab authored by yuguo960516's avatar yuguo960516
Browse files

bloom

parents
Pipeline #150 failed with stages
in 0 seconds
libai.scheduler
##############################
.. currentmodule:: libai.scheduler
.. automodule:: libai.scheduler
:members:
WarmupCosineAnnealingLR,
WarmupCosineLR,
WarmupExponentialLR,
WarmupMultiStepLR,
WarmupPolynomialLR
\ No newline at end of file
libai.tokenizer
##############################
.. currentmodule:: libai.tokenizer
.. automodule:: libai.tokenizer
:member-order: bysource
:members:
BertTokenizer,
RobertaTokenizer,
GPT2Tokenizer,
T5Tokenizer,
PreTrainedTokenizer,
libai.utils
##############################
libai.utils.distributed module
---------------------------------
.. currentmodule:: libai.utils
.. automodule:: libai.utils.distributed
:members:
get_dist_util,
setup_dist_util,
ttol,
tton,
synchronize,
convert_to_distributed_default_setting,
get_nd_sbp,
get_layer_placement,
get_world_size,
get_num_nodes,
get_rank,
get_local_rank,
same_sbp,
get_data_parallel_size,
get_data_parallel_rank,
get_hidden_sbp
libai.utils.events module
---------------------------------
.. currentmodule:: libai.utils
.. automodule:: libai.utils.events
:members:
get_event_storage,
JSONWriter,
CommonMetricPrinter,
EventStorage,
libai.utils.logger module
---------------------------------
.. currentmodule:: libai.utils
.. automodule:: libai.utils.logger
:members:
setup_logger,
log_first_n,
log_every_n,
log_every_n_seconds
libai.utils.checkpoint module
---------------------------------
.. currentmodule:: libai.utils
.. automodule:: libai.utils.checkpoint
:members:
Checkpointer,
PeriodicCheckpointer
# Frequently Asked Questions
We list some common problems encountered by users and the corresponding solutions here. Feel free to enrich the list if you find any frequent issues and have ways to help others solve them.
## Training
- "Loss goes NaN or very large"
1. Check if the dataset annotations are valid. Mask must be `{0, 1}` where `1` for tokens that are **not masked** and `0` for tokens that are **masked**.
2. Check `initializer_range` in config file. It can be safely set to `0.02` in most cases. If the model size is very large, decreasing `initializer_range` is a good choice. For example, `initializer_range` can be set to `0.006` when training 175 billion parameter configuration GPT-3 model.
- "AMP enabled goes NaN"
Set `ONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS=1` to check what triggers an overflow of the value range in fp16.
- "GPU out of memory when validation"
Decrease `test_micro_batch_size` and use `--fast-dev-run` for quickly running through training and evaluation to check if memory is sufficient.
## Model
- "`apply_query_key_layer_scaling` in MultiheadAttention"
As the number of attention heads increases, some of the GEMMS inside the self-attention layer become smaller and the number of elements in the self attention softmax also increases.
- "QKV implementation is not consistent with Hugging Face in self attention"
```python
# query_key_value:[batch_size, seq_len, 3*hidden_size]
# QKV in LiBai
query_key_value = query_key_value.view(bsz, -1, self.num_heads, 3 * self.head_size)
query_key_value = query_key_value.permute(0, 2, 1, 3)
query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
# QKV in Huggingface
query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
query = query.view(query.size(0), query.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
key = key.view(key.size(0), key.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
value = value.view(value.size(0), value.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
```
In tensor parallelism, `chunk` dimension and `flow.sbp.split` dimension will be the same in Huggingface's implementation which will occur some unexpected behaviors (i.e., changing the tensor's SBP unexpectedly).
We also provide a tutorial about how to load Huggingface weights correctly. Please refer to [How to use Huggingface's pretrained weights in LiBai](https://libai.readthedocs.io/en/latest/notes/How_to_implement_huggingface%27s_weights_in_LiBai.html) for more details.
- "the order of layer normalization and the residual connection"
This is critical to enable the scaling of the BERT-style models beyond BERT-Large. The architecture with `apply_residual_post_layernorm=False` eliminates instabilities observed using the origin BERT architecture with `apply_residual_post_layernorm=True` and also has a lower training loss according to [Megatron-LM](https://arxiv.org/pdf/1909.08053.pdf).
If you find some troubles hard to understand, feel free to open an issue to collect feedbacks in [OneFlow](https://github.com/Oneflow-Inc/oneflow).
\ No newline at end of file
# Detailed instruction on building Vision Transformer models in LiBai
It's easy for users to build the `transformer-based` models by using LiBai's built-in [layers](https://libai.readthedocs.io/en/latest/modules/libai.layers.html). Let's take a deep dive into the process of building a Vision Transformer model in LiBai.
## Model Architecture
**Vision Transformer** was released in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
A **Vision Transformer** model contains three parts: `Patch Embedding` + `Transformer Block` + `Linear Classification Head`, which can be summarized in the following picture:
![](./assets/vision_transformer.png)
## A simple Torch implementation of Vision Transformer
The following code shows the PyTorch implementation of ViT models modified from [timm.models.vision_transformer](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py):
```python
import torch
import torch.nn as nn
from timm.models.layers import trunc_normal_, PatchEmbed, Mlp, DropPath
"""
1. Build a self-attention module
"""
class Attention(nn.Module):
def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0.0, proj_drop=0.0):
super().__init__()
self.num_heads = num_heads
head_dim = dim // num_heads
self.scale = head_dim ** -0.5
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x):
B, N, C = x.shape
qkv = (
self.qkv(x)
.reshape(B, N, 3, self.num_heads, C // self.num_heads)
.permute(2, 0, 3, 1, 4)
)
q, k, v = qkv[0], qkv[1], qkv[2]
attn = (q @ k.transpose(-2, -1)) * self.scale
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
x = self.proj(x)
x = self.proj_drop(x)
return x
"""
2. Build a transformer block, which contains:
self-attention layer + mlp layer
"""
class Block(nn.Module):
def __init__(
self,
dim,
num_heads,
mlp_ratio=4.0,
qkv_bias=False,
drop=0.0,
attn_drop=0.0,
drop_path=0.0,
act_layer=nn.GELU,
norm_layer=nn.LayerNorm,
):
super().__init__()
self.norm1 = norm_layer(dim)
self.attn = Attention(
dim,
num_heads=num_heads,
qkv_bias=qkv_bias,
attn_drop=attn_drop,
proj_drop=drop,
)
# Use drop_path here
self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
self.norm2 = norm_layer(dim)
mlp_hidden_dim = int(dim * mlp_ratio)
self.mlp = Mlp(
in_features=dim,
hidden_features=mlp_hidden_dim,
act_layer=act_layer,
drop=drop,
)
def forward(self, x):
x = x + self.drop_path(self.attn(self.norm1(x)))
x = x + self.drop_path(self.mlp(self.norm2(x)))
return x
"""
3. Build a Vision Transformer model which contains three parts:
patch embedding + transformer block + mlp classification head
"""
class VisionTransformer(nn.Module):
def __init__(
self,
img_size=224,
patch_size=16,
in_chans=3,
embed_dim=192,
depth=12,
num_heads=3,
mlp_ratio=4.0,
drop_rate=0.0,
attn_drop_rate=0.0,
drop_path_rate=0.0,
num_classes=1000,
):
super().__init__()
self.num_classes = num_classes
# patch embedding
self.patch_embed = PatchEmbed(
img_size=img_size,
patch_size=patch_size,
in_chans=in_chans,
embed_dim=embed_dim
)
num_patches = self.patch_embed.num_patches
# cls token and position embedding
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embed = self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
self.pos_drop = nn.Dropout(p=drop_rate)
# stochastic depth decay rule
dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]
# transformer block
self.blocks = nn.Sequential(*[
Block(
dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=True, drop=drop_rate,
attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=nn.LayerNorm, act_layer=nn.GELU)
for i in range(depth)])
# classification head
self.norm = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, num_classes)
# weight init
trunc_normal_(self.pos_embed, std=0.02)
trunc_normal_(self.cls_token, std=0.02)
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.Linear):
trunc_normal_(m.weight, std=0.02)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
def forward_features(self, x):
# patch embedding
x = self.patch_embed(x)
cls_token = self.cls_token.expand(x.shape[0], -1, -1)
x = torch.cat((cls_token, x), dim=1)
# position embedding
pos_embed = self.pos_embed.expand(x.shape[0], -1, -1)
x = self.pos_drop(x + pos_embed)
# transformer block
x = self.blocks(x)
return x
def forward_head(self, x):
# only use cls token for classification
x = self.norm(x)
outcome = x[:, 0]
return self.head(outcome)
def forward(self, x):
x = self.forward_features(x)
x = self.forward_head(x)
return x
```
We have further decoupled the forward function into `forward_features` and `forward_head`:
- `forward_features`: extract the image features using the `patch_embed` layer and a stack of `transformer` blocks
- `forward_head`: take the `cls_token` of each sample and use `nn.Linear` for classification
## Implement 3D parallel Vision Transformer in LiBai
In this section, we will show users how to use [libai.layers](https://libai.readthedocs.io/en/latest/modules/libai.layers.html) to build a 3D parallel Vision Transformer model with only 100+ lines of code, which is modified from [libai.models.vision_transformer](https://github.com/Oneflow-Inc/libai/blob/main/libai/models/vision_transformer.py)
Here is the LiBai implementation of Vision Transformer models, and users only need to replace the PyTorch modules with the corresponding `libai.layers` as follows:
```python
# LiBai's implementation of Vision Transformer
import oneflow as flow
import oneflow.nn as nn
from flowvision.layers.weight_init import trunc_normal_
import libai.utils.distributed as dist
from libai.config.config import configurable
from libai.layers import LayerNorm, Linear, PatchEmbedding, TransformerLayer
"""
LiBai has already implemented:
1. PatchEmbedding Layer
2. Transformer Layer: Self-Attention + MLP + DropPath + LayerNorm
3. Linear Layer
We can directly build a Vision Transformer model with the built-in layers in LiBai as follows:
"""
class VisionTransformer(nn.Module):
def __init__(
self,
img_size=224,
patch_size=16,
in_chans=3,
embed_dim=192,
depth=12,
num_heads=3,
mlp_ratio=4.0,
drop_rate=0.0,
attn_drop_rate=0.0,
drop_path_rate=0.0,
num_classes=1000,
loss_func=None
):
super().__init__()
self.num_classes = num_classes
# patch embedding
self.patch_embed = PatchEmbedding(
img_size=img_size,
patch_size=patch_size,
in_chans=in_chans,
embed_dim=embed_dim,
)
num_patches = self.patch_embed.num_patches
# cls token and position embedding with sbp signature
self.cls_token = nn.Parameter(
flow.zeros(1, 1, embed_dim,
sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=dist.get_layer_placement(0),)
)
self.pos_embed = nn.Parameter(
flow.zeros(1, num_patches+1, embed_dim,
sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=dist.get_layer_placement(0),)
)
self.pos_drop = nn.Dropout(p=drop_rate)
# stochastic depth decay rule
dpr = [x.item() for x in flow.linspace(0, drop_path_rate, depth)]
# a stack of transformer block
ffn_size = int(embed_dim * mlp_ratio)
self.blocks = nn.Sequential(
*[
TransformerLayer(
hidden_size=embed_dim,
ffn_hidden_size=ffn_size,
num_attention_heads=num_heads,
attention_dropout_prob=attn_drop_rate,
output_dropout_prob=drop_rate,
drop_path_prob=dpr[i],
layer_idx=i,
)
for i in range(depth)
]
)
self.norm = LayerNorm(embed_dim, layer_idx=-1)
self.head = Linear(embed_dim, num_classes, layer_idx=-1)
# implement loss function in nn.Module to match LiBai style
self.loss_func = nn.CrossEntropyLoss() if loss_func is None else loss_func
# weight init function
trunc_normal_(self.pos_embed, std=0.02)
trunc_normal_(self.cls_token, std=0.02)
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, Linear):
trunc_normal_(m.weight, std=0.02)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
def forward_features(self, x):
# patch embedding
x = self.patch_embed(x)
cls_token = self.cls_token.expand(
x.shape[0], -1, -1
) # stole cls_tokens impl from Phil Wang, thanks
cls_token = cls_token.to_global(sbp=x.sbp, placement=cls_token.placement)
x = flow.cat((cls_token, x), dim=1)
# position embedding
pos_embed = self.pos_embed.expand(x.shape[0], -1, -1)
pos_embed = pos_embed.to_global(sbp=x.sbp, placement=pos_embed.placement)
x = self.pos_drop(x + pos_embed)
# transformer block
x = self.blocks(x)
return x
def forward_head(self, x):
x = self.norm(x)
outcome = x[:, 0]
outcome = self.head(outcome)
return outcome
def forward(self, images, labels=None):
x = self.forward_features(images)
x = self.forward_head(x)
if labels is not None and self.training:
losses = self.loss_func(x, labels)
return {"losses": losses}
else:
return {"prediction_scores": x}
@staticmethod
def set_pipeline_stage_id(model):
dist_utils = dist.get_dist_util()
# Set pipeline parallelism stage_id
for module_block in model.modules():
if isinstance(module_block.to(nn.Module), PatchEmbedding):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
elif isinstance(module_block.to(nn.Module), TransformerLayer):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(module_block.layer_idx))
# Set pos_embed and cls_token stage id
model.pos_embed.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
model.cls_token.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
model.pos_drop.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
model.norm.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
model.head.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
model.loss_func.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
```
## Details about LiBai's implementation of the Vision Transformer model
**1. Replace nn.Module with libai.layers**
LiBai has already implemented `PatchEmbedding`, `TransformerLayer`, `Linear`, `LayerNorm` layers, and users only need to replace the module in Torch Vision Transformer models to convert a Torch model into LiBai's style:
- `Block` -> `libai.layers.TransformerLayer`
- `nn.Linear` -> `libai.layers.Linear`
- `nn.LayerNorm` -> `libai.layers.LayerNorm`
- `PatchEmbed` -> `libai.layers.PatchEmbedding`
**2. Manually set the SBP signature of `cls_token` and `pos_embed`**
In order to fit different parallel modes in LiBai, users must manually set the [SBP signature](https://docs.oneflow.org/en/master/parallelism/02_sbp.html#spb-signature) for all the parameters and buffers of those layers not implemented in LiBai, like `cls_token` and `pos_embed` in Vision Transformer:
```python
import oneflow as flow
import oneflow.nn as nn
import libai.utils.distributed as dist
self.cls_token = nn.Parameter(
flow.zeros(
1, 1, embed_dim,
sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=dist.get_layer_placement(0),
)
)
self.pos_embed = nn.Parameter(
flow.zeros(
1, num_patches+1, embed_dim,
sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=dist.get_layer_placement(0),)
)
```
- The SBP signature returned by `dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])` means to broadcast `cls_token` and `pos_embed` across each GPU group.
**3. Use the `to_global()` function to update the SBP signature of `cls_token` and `pos_embed` during forward function**
In forward function, `cls_token` and `pos_embed` will be expanded to fit the input size. For efficiency, we can use the `to_global()` function to match the `cls_token` and `pos_embed` SBP signature with the input SBP signature like this:
```python
def forward_features(self, x):
cls_token = self.cls_token.expand(
x.shape[0], -1, -1
)
# use to_global to update the sbp signature of cls_token
cls_token = cls_token.to_global(sbp=x.sbp, placement=cls_token.placement)
x = flow.cat((cls_token, x), dim=1)
# use to_global to update the sbp signature of pos_embed
pos_embed = self.pos_embed.expand(x.shape[0], -1, -1)
pos_embed = pos_embed.to_global(sbp=x.sbp, placement=pos_embed.placement)
```
**4. Manually set the stage id for pipeline parallel training**
Most of the built-in layers in LiBai has the arg named `layer_idx` for pipeline parallel settings. To configure a **1F1B pipeline parallel** model, users should manually set the stage id for each layers in the model, which will automatically assign different layers on different stages and insert buffer in the process of forward & backward computation for 1F1B pipeline parallel training. With the help of `layer_idx`, we can simply get a pipeline parallel Vision Transformer model like:
```python
import libai.utils.distributed as dist
"""
This is a staticmethod for class inherited from nn.Module,
"""
@staticmethod
def set_pipeline_stage_id(model):
dist_utils = dist.get_dist_util()
# Set pipeline parallelism stage_id
for module_block in model.modules():
# module_block.to(nn.Module) can get the original module
if isinstance(module_block.to(nn.Module), PatchEmbedding):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
elif isinstance(module_block.to(nn.Module), TransformerLayer):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(module_block.layer_idx))
# Set pos_embed and cls_token stage id
model.pos_embed.to(nn.graph.GraphModule).set_stage(sdist_utils.get_layer_stage_id(0))
model.cls_token.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
model.pos_drop.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
model.norm.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
model.head.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
model.loss_func.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
```
Manually set the stage id:
- `PatchEmbedding` should be on the first stage
- Automatically assign the stage for `TransformerLayer` with `layer_idx` args
- `cls_token`, `pos_embed`, `pos_drop` should be on the first stage
- `norm`, `head` and `loss_func` should be on the last stage
Please see [Write your own pipeline parallel model](https://libai.readthedocs.io/en/latest/tutorials/advanced_tutorials/customize_parallel.html#write-your-own-pipeline-parallel-model) for more details about the settings of pipeline parallel training in LiBai.
# How to load pretrained model in LiBai
In this tutorial, we will introduce to users how to instantiate a pretrained oneflow model.
## Steps
Firstly, Prepare pretrained model weights file, which can be the form of `OneFlow` or `HuggingFace`.
- `OneFlow`'s pretrained model weights saved using [`oneflow.save()`].
- `Huggingface`'s pretrained model weights file(`pytorch_model.bin`) can be downloaded from https://huggingface.co/models.
Secondly, Prepare config file.
> The config file is required when loading the `HuggingFace` model.
> `OneFlow`'s config file can be import directly from `configs/common/models`.
- `Huggingface`'s config file(`config.json`) can be downloaded from https://huggingface.co/models.
Lastly, The structure of the pretrained model folder should be like:
```bash
# OneFlow pretrained model
$ tree pretrained_model_dir
path/to/pretrained_model_dir/
└── oneflow_model
# Huggingface pretrained model
$ tree pretrained_model_dir
path/to/pretrained_model_dir/
├── pytorch_model.bin
└── config.json
```
## Start Loading
You can load pretrained BERT as following:
```python
import libai
from libai.models.utils import BertLoaderHuggerFace, BertLoaderLiBai
from libai.config.configs.common.models.bert import cfg
# load huggingface weight
loader = BertLoaderHuggerFace(
model=libai.models.BertModel,
libai_cfg=cfg,
pretrained_model_path='path/to/huggingface_pretrained_model_directory',
hidden_dropout_prob=0,
)
bert = loader.load()
# load libai weight
loader = BertLoaderLiBai(
model=libai.models.BertModel,
libai_cfg=cfg,
pretrained_model_path='path/to/libai_pretrained_model_directory',
hidden_dropout_prob=0,
)
bert = loader.load()
```
# Use Custom ModelLoader
## Model Loader for HuggerFace
If you want to define your own HuggerFace's model loader, you can inherit the base `ModelLoaderHuggerFace` in `libai.models.utils.model_loader.base_loader`.
Then you need to overwrite the `_convert_state_dict` and `_load_config_from_json` method to load HuggingFace's pretrained model in LiBai.
Finally, you need set `base_model_prefix_1` and `base_model_prefix_2` argument, which represent the base model name for HuggingFace and LiBai respectively.
The following code shows how to use custom ModelLoaderHuggerFace:
```python
from libai.models.utils import ModelLoaderHuggerFace
class ToyModelLoaderHuggerFace(ModelLoaderHuggerFace):
def __init__(self, model, libai_cfg, pretrained_model_path, **kwargs):
super().__init__(model, libai_cfg, pretrained_model_path, **kwargs)
"""NOTE: base_model_prefix_1 is ToyModel's prefix in Transformers.
base_model_prefix_2 is ToyModel's prefix in LiBai."""
self.base_model_prefix_1 = "toy_model"
self.base_model_prefix_2 = "toy_model"
def _convert_state_dict(self, flow_state_dict, cfg):
"""Convert state_dict's keys to match model.
Args:
flow_state_dict (OrderedDict): model state dict.
cfg (dict): model's default config dict in LiBai.
Returns:
OrderedDict: flow state dict.
"""
...
def _load_config_from_json(self, config_file):
"""load config from `config.json`, and update default config.
Args:
config_file (str): Path of config file.
"""
...
```
## Model Loader for LiBai
If you want to define your own LiBai's model loader, you can inherit the base `ModelLoaderLiBai` class in `libai.models.utils.model_loader.base_loader`.
You just need to set `base_model_prefix_2` argument to load LiBai's pretrained model.
The following code shows how to use custom ModelLoaderLiBai:
```python
from libai.models.utils import ModelLoaderLiBai
class ToyModelLoaderLiBai(ModelLoaderLiBai):
def __init__(self, model, libai_cfg, pretrained_model_path, **kwargs):
super().__init__(model, libai_cfg, pretrained_model_path, **kwargs)
self.base_model_prefix_2 = "toy_model"
```
\ No newline at end of file
# Detailed instruction for using distributed inference in LiBai
If you want to using distributed inference in LiBai from pretrained `pytorch` model, you can refer to [DALLE2 inferecn doc](https://github.com/Oneflow-Inc/libai/blob/main/docs/source/notes/How_to_use_model_parallel_in_LiBai.md). And [Chinese doc for distributed inference](https://github.com/Oneflow-Inc/libai/discussions/386) is also available.
Here we introduce how to use distributed infenrence in LiBai:
## Check `model.py`
check your `model.py` first:
1. Ensure There are `libai.layers` in your `model.py`:
```python
# NOTE: you don't need to import all layers from libai, if you only use libai.layers.Linear
# in your `model.py`, you model will run model/pipeline parallel only in `Linear` layers
from libai.layers import (
Linear,
LayerNorm,
...
)
```
2. If you want to run pipeline parallel in LiBai, you should additionally insert code `x = x.to_global(placement=target_tensor.placement)` in your `model.forward()`.
It is equal to torch code `x.to(cuda_device)`, which move tensor from gpuA to gpuB. There are many examples in LiBai: [example1](https://github.com/Oneflow-Inc/libai/blob/92dbe7c1b1496290e32e595f8473f9288ea1886e/projects/MT5/layers/attention_layer.py#L220), [example2](https://github.com/Oneflow-Inc/libai/blob/92dbe7c1b1496290e32e595f8473f9288ea1886e/projects/MT5/layers/attention_layer.py#L156) ...
If you don't know where to insert code, you can run your code first, and the it will raise bug in the line which needed `to_global`.
for example:
```shell
File "libai/libai/layers/layer_norm.py", line 129, in forward
return flow._C.rms_layer_norm(hidden_states, self.weight, self.l2norm_epsilon) RuntimeError: return flow._C.rms_layer_norm(hidden_states, self.weight, self.l2norm_epsilon)RuntimeErrorExpected all tensors to be on the same placement, but found at least two placements, oneflow.placement(type="cuda", ranks=[0, 1]) (positional 0) and oneflow.placement(type="cuda", ranks=[2, 3]) (positional 1)!
```
## Build `config.py`
If your model is Trained from LiBai, you can use the same `config.py` from training. refer to [Couplets](https://github.com/Oneflow-Inc/libai/tree/main/projects/Couplets#inference) for more details
If your model is Trainer from other framework, you should build your own `inference_config.py`, you can refer to [`dalle2_config.py`](https://github.com/Oneflow-Inc/libai/blob/main/projects/DALLE2/configs/dalle2_config.py) and [`t5_inference_config.py `](https://github.com/Oneflow-Inc/libai/blob/main/projects/MT5/configs/t5_inference.py)
## Refine `pipeline_inference.py`
The base class [libai/inference/basic.py](https://github.com/Oneflow-Inc/libai/blob/main/libai/inference/basic.py) is already provided in `LiBai` ,
Users only need to overload the functions they need. refer to [text_generation.py](https://github.com/Oneflow-Inc/libai/blob/main/libai/inference/text_generation.py)
If your model is trained from `LiBai`, it will be easy to use, you can refer to [distribute_infer.py](https://github.com/Oneflow-Inc/libai/blob/main/projects/Couplets/distribute_infer.py) in [Couplets](https://github.com/Oneflow-Inc/libai/tree/main/projects/Couplets)
If your model is trained from other framework, you need to build your own `model_loader` to load your model weights in LiBai, refer to [model_loader](https://libai.readthedocs.io/en/latest/notes/How_to_load_huggingface%27s_pretrained_model_in_libai.html) for more details
Give a simple example, the function overloaded in `LiBai`:
```python
from libai.inference.basic import BasePipeline
from libai.utils import distributed as dist
class MyPipeline(BasePipeline):
def _parse_parameters(self, **pipeline_parameters):
# By overloading this function, the input parameters in MyPipeline.__call__() hand out to preprocess/forward/postprocess stages of inference.
preprocess_params = {
"preprocess_param1": pipeline_parameters["preprocess_param1"],
"preprocess_param2": pipeline_parameters["preprocess_param2"],
}
forward_params = {
"forward_param": pipeline_parameters["forward_param"]
}
postprocess_params = {
"postprocess_param": pipeline_parameters["postprocess_param"]
}
return preprocess_params, forward_params, postprocess_params
def load_pretrain_weight(self, libai_cfg_model, model_path, mode="myloader"):
# load your pretrain weight in this functor
# set your own "MyLoader" if your model is pretrained from other framework
# set mode to "libai" if your model is pretrained from libai
if mode == "myloader":
import MyLoader
model_loader = MyLoader(
libai_cfg_model,
libai_cfg_model.cfg,
model_path,
...,
)
return model_loader.load()
else:
return super().load_pretrain_weight(
libai_cfg_model,
model_path,
mode=mode,
)
def preprocess(self, inputs, preprocess_param1, preprocess_param2, **kwargs):
...
# model_input_dict: {"key1": flow.Tensor1, ...}
return model_input_dict
def forward(self, model_input_dict, forward_param, **kwargs):
...
model_output_dict = self.model(**model_input_dict)
return model_output_dict
def postprocess(self, model_output_dict, postprocess_param, **kwargs):
...
return out_dict
if __name__ == "__main__":
pipeline = MyPipeline(
"path/to/myconfig.py",
data_parallel=1,
tensor_parallel=...,
pipeline_parallel=...,
pipeline_stage_id=...,
pipeline_num_layers=...,
model_path=...,
mode=...,
)
out = pipeline(
input_text=...,
preprocess_param1=...,
preprocess_param2=...,
forward_param=...,
postprocess_param=...,
)
if dist.is_main_process():
print(out)
```
## Distributed run `pipeline_inference.py`
To run model on 2 nodes with total 4 GPUs,
in `node0`, run:
```bash
NODE=2 NODE_RANK=1 ADDR=192.168.0.1 PORT=12345 bash tools/infer.sh pipeline_inference.py 2
```
`NODE=2` means total number of nodes
`NODE_RANK=0` means current node is node0
`ADDR=192.168.0.0` means the ip address of node0
`PORT=12345` means the port of node0
in `node1`, run:
```bash
NODE=2 NODE_RANK=1 ADDR=192.168.0.1 PORT=12345 bash tools/infer.sh pipeline_inference.py 2
```
`NODE=2` means total number of nodes
`NODE_RANK=1` means current node is node1
`ADDR=192.168.0.0` means the ip address of node0
`PORT=12345` means the port of node0
# How to use Huggingface's pretrained weights in LiBai
The built-in layers in [LiBai](https://github.com/Oneflow-Inc/libai) adopts the structure which is more suitable for parallel training, therefore the implementation in LiBai may be a little bit different from that in Huggingface. In this tutorial, we will introduce to users how to correctly load Huggingface's pretrained weights into LiBai's model. Let's take BERT as an example.
## LiBai Transformer vs Huggingface Transformer
There are subtle differences in the BERT structure as shown in the following figure (left: LiBai, right: Huggingface), which can be summarized as:
- Location of layernorm: The location of layernorm is different, but the calculation order is the same.
- A different slicing way to get the `query`, `key` and `value` matrix.
- LiBai follows [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) to use the order of the layernorm and the residual connections by default. Megatron-LM shows that this structure will eliminate instabilities and bring a lower training loss. LiBai can also support the original BERT architecture mentioned in [Paper](https://arxiv.org/pdf/1810.04805.pdf) by setting `apply_residual_post_layernorm=True`.
![architecture](./assets/architecture.jpg)
## QKV slicing logic
LiBai's QKV slicing logic is different from that in Huggingface.
```python
# LiBai's QKV slicing logic
query_key_value = query_key_value.view(batch_size, -1, num_heads, 3 * head_size)
query_key_value = query_key_value.permute(0, 2, 1, 3)
query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
# Huggingface's QKV slicing logic
query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
query = query.view(query.size(0), query.size(1), num_heads, -1).permute(0, 2, 1, 3)
key = key.view(key.size(0), key.size(1), num_heads, -1).permute(0, 2, 1, 3)
value = value.view(value.size(0), value.size(1), num_heads, -1).permute(0, 2, 1, 3)
```
## How to correctly load QKV weights
- To correctly load Huggingface's transformer weights, you only need to rearrange the loaded weights as follows:
```python
def convert_qkv_weight(cfg, qkv_weight, qkv_bias):
qkv_weight = qkv_weight.view([3, cfg.num_heads, cfg.head_size, cfg.hidden_size])
qkv_weight = qkv_weight.permute(1, 0, 2, 3).contiguous().view(3*cfg.hidden_size, cfg.hidden_size)
qkv_bias = qkv_bias.view(3, cfg.num_heads, cfg.head_size)
qkv_bias = qkv_bias.permute(1,0,2).contiguous().view(-1)
return qkv_weight, qkv_bias
```
- For detailed examples, please refer to [load-huggingface-bert](https://github.com/Oneflow-Inc/libai/tree/test_bert_load_huggingface_weight/projects/test_bert_load_huggingface_weight). You can verify this by running:
```bash
bash test.sh
```
\ No newline at end of file
# Detailed instruction on using model parallel in LiBai
This document is a tutorial for users to learn how to transer a pytorch model to oneflow, and use model parallel in Libai for inference. We will first take the DALLE2 model for example, and then we will show how to use model parallel which can be easily done in libai.
**Note**: the code of DALLE2 is adapted from [this repo](https://github.com/lucidrains/DALLE2-pytorch), which is an unofficial implementation. The final result may differ from the original generated images in the [paper](https://arxiv.org/abs/2204.06125). You can also try the model in [google colab](https://colab.research.google.com/github/LAION-AI/dalle2-laion/blob/main/notebooks/dalle2_laion_alpha.ipynb).
## Transfer pytroch model to oneflow.
It's easy for user to tansfer a pytorch model into oneflow, since most of oneflow's api is consistent with pytorch. First we change `import torch` to `import oneflow as flow`, and then we can replace all `torch` in the code to `flow`. If the model can work correctly in the originally
pytorch codes, it's likely to be able to work correctly in oneflow. Sometimes the program may raise error like
```
AttributeError: module 'oneflow' has no attribute 'xxx'
```
try install the latest version of oneflow which might help, you can find more details [here](https://github.com/Oneflow-Inc/oneflow#install-oneflow).
**1、Download the pytorch DALLE2 model**:
As show in the [google colab](https://colab.research.google.com/github/LAION-AI/dalle2-laion/blob/main/notebooks/dalle2_laion_alpha.ipynb), we will use the version of 0.15.4,
```
git clone -b v0.15.4 https://github.com/lucidrains/DALLE2-pytorch.git
```
the pretrained model weights can be found in huggingface: [the prior weight](https://huggingface.co/zenglishuci/conditioned-prior/resolve/main/vit-l-14/prior_aes_finetune.pth) and [the decoder weight](https://huggingface.co/laion/DALLE2-PyTorch/resolve/main/decoder/1.5B_laion2B/latest.pth).
A simple inference script can be written as
```python
# inference_dalle2.py
import numpy as np
import torch
import os,sys
from dalle2_pytorch import tokenizer
from dalle2_pytorch import OpenAIClipAdapter, DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder
def generate_images_from_text(texts):
clip=OpenAIClipAdapter("ViT-L-14.pt").to("cuda")
tokens = tokenizer.tokenize(text).to("cuda")
_, text_encodings, text_mask = clip.embed_text(tokens)
prior_network = DiffusionPriorNetwork(
dim = 768,
depth = 24,
num_timesteps = 1000,
num_time_embeds = 1,
num_image_embeds=1,
num_text_embeds = 1,
dim_head = 64,
heads = 32,
ff_mult = 4,
attn_dropout = 0.05,
ff_dropout = 0.05,
normformer = True,
)
diffusion_prior = DiffusionPrior(
net = prior_network,
clip = clip,
image_embed_dim = 768,
timesteps = 1000,
cond_drop_prob = 0.1,
loss_type="l2",
condition_on_text_encodings = True
)
state_dict = torch.load("prior_aes_finetune.pth", map_location="cpu")['ema_model']
diffusion_prior.load_state_dict(state_dict, strict=True)
diffusion_prior.to("cuda")
image_embed = diffusion_prior.sample(tokens, num_samples_per_batch = 2, cond_scale = 1.)
unet = Unet(
dim = 320,
image_embed_dim = 768,
text_embed_dim = 768,
cond_dim = 512,
channels = 3,
dim_mults=(1, 2, 3, 4),
num_resnet_blocks = 4,
attn_heads = 8,
attn_dim_head = 64,
sparse_attn = True,
memory_efficient = True,
cond_on_text_encodings = True, # set to True for any unets that need to be conditioned on text encodings
self_attn = [False, True, True, True]
)
decoder = Decoder(
unet = (unet,),
image_sizes = [64],
clip = clip,
channels = 3,
timesteps = 1000,
loss_type = "l2",
beta_schedule = ["cosine"],
learned_variance = True
)
state_dict = torch.load("latest.pth", map_location = "cpu")
new_dict = {}
for k,v in state_dict.items():
if 'clip.' in k: continue
if ('cross_attn' in k or 'fn.fn.' in k) and k.endswith(".g"):
k = k[:-1] + "gamma"
new_dict[k] = v
assert k in decoder.state_dict().keys(), k
decoder.load_state_dict(new_dict, strict=False)
decoder.to("cuda")
images = decoder.sample(image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, cond_scale = 3.5)
return images
def save_images(images):
import torchvision.transforms as T
to_pil = T.ToPILImage()
images = list(map(to_pil,images.unbind(dim = 0)))
for i,image in enumerate(images):
image.save(f"./result_{i}.png")
def main():
text = ["a dolphin in an astronaut suit on saturn, artstation"]
images = gen_text_and_img_emb(text)
save_images(images)
if __name__ == "__main__":
main()
```
run `python inference_dalle2.py`, this should work.
## 2、Change the deep learning framework to oneflow.
As mentioned above, we replace all the `torch` symbol to `flow` by firstly change `import torch` to `import oneflow as flow` in all python files.
It should be noted that the original pytorch code also import other python packages using pytorch backend like [einops](https://github.com/arogozhnikov/einops)[einops_ext](https://github.com/lucidrains/einops-exts)[kornia](https://github.com/kornia/kornia) etc. which should also be modified at the same time.
Fortunately, only a few api of these packages are used, we can take out the relevant code from the github repos and merge them in a separate file.
For example, we can simplely create the einops_ext.py file adapted from [here](https://github.com/lucidrains/einops-exts/blob/main/einops_exts/einops_exts.py), then we can import einops_ext from the python file which use oneflow instead of python packages using torch.
```python
# einops_ext.py
import re
from oneflow import nn #here change `from torch improt nn` to `from oneflow import nn`
from functools import wraps, partial
from einops import rearrange, reduce, repeat
```
## 3、Using Libai's api.
[LiBai](https://github.com/Oneflow-Inc/libai) is a large-scale open-source model training toolbox based on OneFlow.
Libai provides many efficient api which can be easily used for distributed training and evaluation. It also supports some popular models under the projects folder such as [CLIP](https://github.com/Oneflow-Inc/libai/tree/main/projects/CLIP). To avoid duplication of work, we directly use the clip model implemented in Libai. The relavant code in the original pytorch code is the `OpenAIClipAdapter` class which can be written as follows:
```python
# _clip.py
import os
import sys
import oneflow as flow
import oneflow.nn.functional as F
from oneflow import nn
from collections import namedtuple
def import_flow_clip(fn):
def wrapper(*args, **kwargs):
sys.path.append(os.path.join(os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")), "CLIP"))
fn(*args, **kwargs)
sys.path.pop()
return wrapper
class BaseClipAdapter(nn.Module):
pass
class OpenAIClipAdapter(BaseClipAdapter):
@import_flow_clip
def __init__(
self,
name = 'ViT-L/14'
):
import clip
openai_clip, preprocess = clip.load(name)
super().__init__(openai_clip)
```
[DiffusionPrior](https://github.com/lucidrains/DALLE2-pytorch/blob/v0.15.4/dalle2_pytorch/dalle2_pytorch.py#L873) and [Decoder](https://github.com/lucidrains/DALLE2-pytorch/blob/v0.15.4/dalle2_pytorch/dalle2_pytorch.py#L1802) follow their original implementation.
**Using libai.layers**
LiBai provides multiple parallelisms such as Data Parallelism, Tensor Parallelism, and Pipeline Parallelism. To experience these features, we will use libai.layers like Linear and LayerNorm:
```python
from libai.layers import Linear, LayerNorm
```
the nn.Linear will be replace with `libai.layers.Linear`.
**Compare the outputs** To make sure it is correctly modified from `torch` to `flow`, it's necessary to compare the outputs to see if they are the same after the change. A notable point here is that in the sampling stage, the noise are randomly generated, like
```python
noise = flow.randn(shape)
# or noise = torch.randn(shape) in torch code
```
torch and oneflow will generate different numbers here even if they are set the same random seed. An alternate way is to make a transition through numpy:
```python
import numpy as np
np.random.seed(6666)
noise = flow.tensor(np.randn(shape))
# or noise = torch.tensor(np.randn(shape)) in torch code
```
When the model is fed the same input text, the output images by oneflow or torch code should be same.
**LazyConfig and LazyCall**
Oneflow provides LazyConfig system for more flexible syntax and no predefined structures, find more [here](https://libai.readthedocs.io/en/latest/tutorials/basics/Config_System.html). As for the DALLE2, the config file can be write as
```python
from omegaconf import DictConfig
from libai.config import LazyCall
from dalle2.models import DiffusionPrior, DiffusionPriorNetwork, Unet, Decoder, DALLE2
from dalle2._clip import OpenAIClipAdapter
clip = LazyCall(OpenAIClipAdapter)(name="./dalle2/model_weights/ViT-L-14.pt")
prior = LazyCall(DiffusionPrior)(
net = LazyCall(DiffusionPriorNetwork)(
dim=768,
depth=24,
num_timesteps=1000,
max_text_len=77,
num_time_embeds=1,
num_image_embeds=1,
num_text_embeds=1,
dim_head=64,
heads=32,
ff_mult=4,
attn_dropout=0.05,
ff_dropout=0.05,
normformer=True,
),
clip=clip,
image_embed_dim=768,
timesteps=1000,
cond_drop_prob=0.1,
loss_type="l2",
condition_on_text_encodings=True
)
unet1 = LazyCall(Unet)(
dim=320,
image_embed_dim=768,
text_embed_dim=768,
cond_dim=512,
channels=3,
dim_mults=(1, 2, 3, 4),
num_resnet_blocks=4,
attn_heads=8,
attn_dim_head=64,
sparse_attn=True,
memory_efficient=True,
cond_on_text_encodings=True, # set to True for any unets that need to be conditioned on text encodings
self_attn=[False, True, True, True]
)
decoder = LazyCall(Decoder)(
unet=(unet1,),
image_sizes=[64, ],
clip=None,
channels=3,
timesteps=1000,
loss_type="l2",
beta_schedule=["cosine"],
learned_variance=True
)
dalle2_model = LazyCall(DALLE2)(
prior=prior,
decoder=decoder,
prior_weight_path='',
decoder_weight_path=''
)
```
## 4、Model parallel in Libai.
In order to achieve the model parallel inference under libai, we should set the parallel mode according to your needs. The default value of argument parallel is `data` in libai.layers.Linear, which means data parallel. To achieve model parallel, we need change the parallel to `col` or `row`. The most efficient way is to set the Linear layers in the col -> row -> col order.
A transformer block contains a attention and a feedforward submodule, and each submodule exactly contains 2 Linear layers.
The attention module contains the qkv projection and out projection. Thus we set the qkv projejction as `col`, and the out projection as `row`:
```python
#attention
class Attention(nn.Module):
def __init__(self, *args, **kwargs):
super().__init__()
# 1、 qkv projection
self.to_q = Linear(dim, inner_dim, bias = False, parallel='col')
self.to_kv = Linear(dim, dim_head * 2, bias = False, parallel='col')
#2、 output projection
self.to_out = nn.Sequential(
Linear(inner_dim, dim, bias = False, parallel='row'), #'row'
LayerNorm(dim)
)
```
and feed forward contains in projection and out projection, the former will be set `col` and the later will be set `row`.
```python
def FeedForward(
dim,
mult = 4,
dropout = 0.,
post_activation_norm = False
):
inner_dim = int(mult * dim)
return nn.Sequential(
LayerNorm(dim),
Linear(dim, inner_dim * 2, bias = False, parallel='col'),
SwiGLU(),
LayerNorm(inner_dim) if post_activation_norm else nn.Identity(),
nn.Dropout(dropout),
Linear(inner_dim, dim, bias = False, parallel='row')
)
```
for the single machine with 4 GPUs, the model parallel could be set like:
```python
import libai.utils.distributed as dist
dist.setup_dist_util(
DictConfig(
dict(
data_parallel_size=1,
tensor_parallel_size=4,
pipeline_parallel_size=1,
)
)
)
```
If you successfully complete the above steps, now you can have fun with the (unofficial) dalle2 model.
\ No newline at end of file
Notes
=============
.. toctree::
:glob:
:maxdepth: 2
How_to_use_huggingface's_weights_in_LiBai.md
How_to_build_vision_transformer_model_in_LiBai.md
How_to_load_huggingface's_pretrained_model_in_libai.md
How_to_use_model_parallel_in_LiBai.md
How_to_use_distributed_inference_in_LiBai.md
FAQ.md
# How to Customize Dataloader
Dataloader is the component that provides data to models. Dataloader usually (but not necessarily) takes raw information from [write dataloaders](https://libai.readthedocs.io/en/latest/tutorials/basics/Write_Dataloaders.html), and processes them into the format needed by the model.
## How the Existing Dataloader Works
LiBai contains a built-in data loading pipeline. It's beneficial to understand how it works, in case you need to write a custom one.
LiBai provides some functions [build_{image,nlp}_{train,test}_loader](https://libai.readthedocs.io/en/latest/modules/libai.data.html#libai.data.build.build_nlp_train_loader) that create a default dataloader from a given config. Here is how `build_{image,nlp}_{train,test}_loader` work:
1. It instantiates the `list[flow.utils.Dataset]` (e.g., `BertDataset`) by loading some dataset items with lightweight format. These dataset items are not yet ready to be used by the model (e.g., images are not loaded into memory, random augmentation have not been applied, etc.).
2. The output format of dataset (`__getitem__(...)`) must be a dict whose keys must be consistent with argument names of the dataloader's consumer (usually the `model.forward(...)`). The role of the process is to transform the lightweight representation of a dataset item into a format that is ready for the model to consume (including, e.g., read images, perform random data augmentation and convert to oneflow Tensors). If you would like to perform custom transformations to data, you often want to rewrite it. Details about the dataset format can be found in [write dataloaders](https://libai.readthedocs.io/en/latest/tutorials/basics/Write_Dataloaders.html).
3. The outputs of the dataset are simply batched with the following function.
```python
def trivial_batch_collator(batch):
assert isinstance(batch[0], Instance), "batch[0] must be `instance` for trivial batch collator"
batch = Instance.stack(batch)
return batch
```
4. This batched data is the output of the dataloader. Typically, it's also the input of `get_batch`. After `get_batch(...)`, it becomes the input of `model.forward()`. `get_batch` simply changes the local tensors to global tensors with the given `sbp` and `placement` meta information.
```python
@classmethod
def get_batch(cls, data, mixup_func = None):
...
ret_dict = {}
for key, value in data.get_fields().items():
value.to_global()
ret_dict[key] = value.tensor
return ret_dict
```
## Use Custom Dataloader
If you use `DefaultTrainer`, you can overwrite its `build_train_loader` method to use your own dataloader which can be implemented with any tools you like. But you need to make sure that each rank is reading the data correctly under different parallelism circumstances.
Then you need to overwrite `get_batch` method. `data` argument in `get_batch` is the output of your dataloader. You need to change the local tensors to global tensors manually, which means you should set the `sbp` and `placement` correctly.
Here is an example. Process of rank0 gets all data and redistributes them into the other ranks.
```python
@classmethod
def get_batch(cls, data, mixup_func=None):
if data is None:
# not rank0, set placeholders for data
# Note: make sure imgs and labels have the same shape and dtype on all ranks
imgs = flow.empty(16, 3, 224, 224, dtype=flow.float32)
labels = flow.empty(16, dtype=flow.int64)
else:
# rank0
imgs, labels = data
dist.synchronize()
imgs = imgs.to_global(spb=flow.sbp.broadcast, placement=flow.env.all_device_placement("cuda"))
imgs = imgs.to_global(
spb=dist.get_nd_sbp([flow.sbp.split(0),
flow.sbp.broadcast]),
placement=dist.get_layer_placement(0))
labels = labels.to_global(spb=flow.sbp.broadcast, placement=flow.env.all_device_placement("cuda"))
labels = labels.to_global(
spb=dist.get_nd_sbp([flow.sbp.split(0),
flow.sbp.broadcast]),
placement=dist.get_layer_placement(-1))
return {
"images": imgs,
"labels": labels
}
```
# How to Customize Parallelism
Common parallelisms have already been implemented in LiBai, such as data parallel, tensor parallel and pipeline parallel. But there is also a need for user customized parallel. In this tutorial, we will show you how to customize your own parallelism.
## Define your own Parallel Model with LiBai.layers
### Large-scale FC
Suppose you have a huge fully-connected-layer for large-scale classification (e.g., 1000w classes), which makes it impossible to fit into a single GPU.
Don't worry, with the help of `LiBai.layers`, you can write models in a familiar way that you used to write for a single GPU. Here is a simple example showing how to write a tensor-parallel fully-connected-layer with 2 GPUs.
```python
# huge_fc_example.py
import oneflow as flow
from omegaconf import DictConfig
from oneflow import nn
from libai.layers import Linear
from libai.utils import distributed as dist
cfg = DictConfig(dict(data_parallel_size=1, tensor_parallel_size=2, pipeline_parallel_size=1))
dist.setup_dist_util(cfg)
class Huge_FC(nn.Module):
def __init__(self):
super().__init__()
self.fc = Linear(2048, 32768, parallel="col")
def forward(self, x):
return self.fc(x)
huge_fc = Huge_FC()
x = flow.rand(32, 2048, sbp=flow.sbp.broadcast, placement=flow.placement("cuda", ranks=[0, 1]))
y = huge_fc(x)
print(f"rank: {flow.env.get_rank()}, tensor shape: {y.to_local().shape}")
```
You can run this toy example with command line as follows:
```shell
python3 -m oneflow.distributed.launch --nproc_per_node 2 huge_fc_example.py
>> rank: 0, tensor shape: oneflow.Size([32, 16384])
>> rank: 1, tensor shape: oneflow.Size([32, 16384])
```
In the result, you can find that `y` has been split along with `axis=1` on 2 GPUs.
### Large MLP models
Suppose you have a huge MLP model which is very popular in transformer-based models, with a huge hidden size that makes it difficult to fit into a single GPU.
You can then split the model weights across GPUs in a hybrid parallel mode while you can still write your model in a familiar way.
Here is a simple example about the 2D parallel MLP in the LiBai context.
```python
import oneflow as flow
from omegaconf import DictConfig
from oneflow import nn
from libai.layers import Linear
from libai.utils import distributed as dist
cfg = DictConfig(dict(data_parallel_size=2, tensor_parallel_size=2, pipeline_parallel_size=1))
dist.setup_dist_util(cfg)
# Write a Simple 2D Parallel MLP
class MLP_2D(nn.Module):
def __init__(self):
super().__init__()
self.linear_1 = Linear(in_features=1024, out_features=16384, parallel="col")
self.relu = nn.GELU()
self.linear_2 = Linear(in_features=16384, out_features=1024, parallel="row")
def forward(self, x):
x = self.linear_1(x)
x = self.relu(x)
x = self.linear_2(x)
return x
# define a model
mlp = MLP_2D()
# define input with 2D sbp
x = flow.rand(
32,
1024,
sbp=dist.get_nd_sbp([flow.sbp.split(0), flow.sbp.broadcast]),
placement=dist.get_layer_placement(0)
)
y = mlp(x)
print(f"rank: {flow.env.get_rank()}, tensor shape: {y.to_local().shape}")
```
You can run it with command line as follows:
```shell
python3 -m oneflow.distributed.launch --nproc_per_node 4 huge_mlp_example.py
>> rank: 2, tensor shape: oneflow.Size([16, 1024])
>> rank: 3, tensor shape: oneflow.Size([16, 1024])
>> rank: 1, tensor shape: oneflow.Size([16, 1024])
>> rank: 0, tensor shape: oneflow.Size([16, 1024])
```
From above, you can see that the data are split into 2 groups for data parallel, and weights are split into 2 groups for tensor model parallel. So this simple example just implements a 2D parallel.
For your convenience, we provide some prevalent models such as BERT, GPT-2, and ViT in Mode Zoo. Feel free to customize them into different sizes to fit into your special needs.
## Write your own Pipeline Parallel Model
This tutorial describes how to use pipeline parallel in your own model. LiBai has two pipeline-parallel modes: naive pipeline parallel and (similar) 1F1B pipeline parallel introduced by [Megatron-LM](https://arxiv.org/abs/1909.08053).
### Introduction of Naive Pipeline Parallel
In LiBai, naive pipeline parallel can be implemented by setting layers and parameters `placement`.
You can easily configure their `placement` by `dist.get_layer_placement(idx)`.
Here is an example for `placement` configuration.
```python
# set a free tensor placement to first stage
self.pos_embed = nn.Parameter(
flow.zeros(
1,
num_patches + 1,
embed_dim,
sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=dist.get_layer_placement(0),
)
)
# set a Linear placement to last stage
# set it manually
self.head = Linear(embed_dim, num_classes, layer_idx=-1).to_global(placement=dist.get_layer_placement(-1))
# use `layer_idx` API
self.head = Linear(embed_dim, num_classes, layer_idx=-1)
```
After configuring models placement, add the input placement transition across different stages. LiBai sets a `layer_idx` attribute in each `nn.Module`, so you can simply add `to_global` in `forward` to implement input placement transition.
```python
class MyModule(nn.Module):
def __init__(self, ... *, layer_idx):
...
self.layer_idx = layer_idx
...
def forward(self, hidden_states):
hidden_states = hidden_states.to_global(placement=dist.get_layer_placement(self.layer_idx))
...
```
After configuring models and data placement, you only need to set the distributed configuration before training.
```python
# set pipeline stages to 2
train.dist.pipeline_parallel_size = 2
# set model layers for pipeline
train.dist.pipeline_num_layers = hidden_layers
```
### Introduction of 1F1B Pipeline Parallel
First, we will introduce GPipe to you to get a better understanding of pipeline parallelism. In GPipe, when the forward passes of all microbatches finish, the backward passes would be executed (as shown in below).
![gpipe](../assets/gpipe.png)
1F1B performs one forward pass followed by one backward pass. Finally, at the end of a batch, complete backward passes for all remaining in-flight microbatches. In general, 1F1B is more efficient than GPipe.
There are two schedules of 1F1B pipeline: the non-interleaved and the interleaved. The figures are shown below.
![1f1b](../assets/1f1b.png)
In LiBai, the non-interleaved schedule is supported currently, and this mode is more memory-efficient than GPipe.
You only need to set models stage id except that placement configuration in naive pipeline parallel, and stage id can help create stashed buffers for activation.
This example shows how to configure bert model stage id:
```python
class BertForPreTraining(nn.Module):
def __init__(self, ...):
...
def forward(self, ...):
...
@staticmethod
def set_pipeline_stage_id(model):
dist_utils = dist.get_dist_util()
# Set pipeline parallelism stage_id
for module_block in model.modules():
# module_block.to(nn.Module) can get the original module
if isinstance(module_block.to(nn.Module), BertEmbeddings):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
elif isinstance(module_block.to(nn.Module), BertExtendedAttnMask):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
elif isinstance(module_block.to(nn.Module), TransformerLayer):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(module_block.layer_idx))
elif isinstance(module_block.to(nn.Module), BertPooler):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
elif isinstance(module_block.to(nn.Module), BertPreTrainingHeads):
module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
# Set the last layernorm stage id
model.bert.final_layernorm.config.stage_id = dist_utils.get_layer_stage_id(-1)
```
In `set_pipeline_stage_id`, `BertEmbeddings` and `BertExtendedAttnMask` are placed in the first stage, then each `TransformerLayer` is uniformly placed in each stages. At last, place `BertPooler` and `BertPreTrainingHeads` in the last stage. But don't forget to place the last `layernorm` in `BertEncoder` which does not belong to any `TransformerLayer` in the last stage.
After adding the `set_pipeline_stage_id` function in a pre-defined `nn.Module`, `GraphBase` will invoke it automatically as below:
```python
def set_pipeline_stage_id(self):
if hasattr(type(self.model.to(nn.Module)), "set_pipeline_stage_id"):
type(self.model.to(nn.Module)).set_pipeline_stage_id(self.model)
```
The last thing left is to set the training configuration as below:
```python
# set pipeline stages to 2
train.dist.pipeline_parallel_size = 2
# set model layers for pipeline
train.dist.pipeline_num_layers = hidden_layers
# enable activation checkpointing
train.activation_checkpoint.enabled = True
# enable gradient accumulation with 8 micro-batches
train.num_accumulation_steps = 8
```
Advanced Tutorials
===================
.. toctree::
:glob:
:maxdepth: 2
customize_dataloader.md
customize_parallel.md
# Auto Parallel Training
LiBai supports **auto-parallel training** which means LiBai will automatically find **an efficient parallel training strategy** for a specific model during training. Users can try out auto-parallel training by the following steps.
## Installation
Install OneFlow nightly
```shell
python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/[PLATFORM]
```
- All available `[PLATFORM]`:
<table class="docutils">
<tbody>
<tr>
<th width="80"> Platform </th>
<th valign="bottom" align="left" width="120">CUDA Driver Version</th>
<th valign="bottom" align="left" width="120">Supported GPUs</th>
</tr>
<tr>
<td align="left"> cu112 </td>
<td align="left"> >= 450.80.02 </td>
<td align="left"> GTX 10xx, RTX 20xx, A100, RTX 30xx</td>
</tr>
<tr>
<td align="left"> cu102 </td>
<td align="left"> >= 440.33 </td>
<td align="left"> GTX 10xx, RTX 20xx</td>
</tr>
<tr>
<td align="left"> cpu </td>
<td align="left"> N/A </td>
<td align="left"> N/A </td>
</tr>
</tbody>
</table>
## Train/Evaluate model in auto-parallel mode
You can train your own model in auto-parallel mode by simply updating the config as follows:
### Modify config file
```python
# your config
from .common.models.graph import graph
graph.auto_parallel.enabled = True
```
Training model with auto-parallel on 4 GPUs:
```shell
bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4
```
### Directly modify the training command line
- auto-parallel training:
```shell
bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 graph.auto_parallel.enabled=True
```
- auto-parallel evaluation:
```shell
bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 --eval graph.auto_parallel.enabled=True
```
### More details with instructions and interface
See [OneFlow Auto-Parallelism](https://oneflow.readthedocs.io/en/master/auto_parallel.html).
# Build New Project on LiBai
This is a basic guide to build new projects based on LiBai. The advantages of using LiBai to start a new project (such as paper reproduction and finetune task) are as follows:
- Avoid redundant work. Developers can directly inherit many built-in modules from LiBai.
- Easily reproduce the experiments already run, because LiBai will save the configuration file automatically.
- Automatically output useful information during training time, such as remaining training time, current iter, throughput, loss information and current learning rate, etc.
- Set a few config params to enjoy distributed training techniques.
## Introduction
Take the [Bert Finetune](https://github.com/Oneflow-Inc/libai/tree/main/projects/QQP) task as an example to introduce LiBai.
The complete file structure of the project is:
```
projects/my_project
├── configs
│ └── config_custom.py
│ └── ...
├── dataset
│ ├── custom_dataset.py
│ └── ...
├── modeling
│ ├── custom_model.py
│ └── ...
├── README.md
```
To start a new project based on LiBai step by step:
Step 1. Prepare an independent config file (such as [config.py](https://github.com/Oneflow-Inc/libai/blob/main/projects/QQP/configs/config_qqp.py)) which contains:
- The relevant parameters of the task.
- The pre-defined related Class, such as `Model`, `Optimizer`, `Scheduler`, `Dataset`.
- You can inherit the default config in `configs/common` and rewrite it, which can greatly reduce the workload.
- Related class defined with LazyCall which returns a dict instead of calling the object.
Step 2. Prepare a model file (such as [model.py](https://github.com/Oneflow-Inc/libai/blob/main/projects/QQP/modeling/model.py)) :
- Build related models in this file. The construction method is similar to OneFlow.
- Because Libai will set up a static diagram by default, the calculation of loss needs to be inside the model.
- The function `forward` must return a dict.
- When defining a tensor in the model, you need to use `to_global`. Turn tensor into a global pattern.
- When defining layers, you can import them directly from `libai.layers`, because it has already pre-defined the SBP signature.
Step 3. Prepare a dataset file (such as [dataset.py](https://github.com/Oneflow-Inc/libai/tree/main/projects/QQP/dataset)) :
- Build `Dataset` in this file. The construction method is similar to OneFlow.
- The difference is that you need to use `DistTensorData` and `Instance`.
- The shape of each batch must be global.
- In `__getitem__` function, the `key` returned by the method must be consistent with the parameter name of the `forward` function in the `model`.
## Main Function Entry
[tools/train_net.py](https://github.com/Oneflow-Inc/libai/blob/main/tools/train_net.py) is the default main function entry provided in LiBai.
## Build Config
The `config.py` in LiBai is special, which takes the form of lazyconfig and will be saved as `.yaml` at runtime. The config has several necessary fields, such as `train`, `model`, `optim`, `lr_scheduler`, `graph`. For more information, please refer to [Config_System.md](https://libai.readthedocs.io/en/latest/tutorials/Config_System.html).
> All imported modules must take LiBai as the root directory. Otherwise, the saved `yaml` file cannot save the correct path of the module, resulting in an error when reading `yaml`, and the experiment cannot be reproduced.
After building the `config.py`, if you want to get the corresponding fields in the project, you need to access like `cfg.my_cfg.***`.
## Start Training
The `train.sh` file contains some parameters, such as `GPUS`, `NODE`, etc.
```bash
#!/usr/bin/env bash
FILE=$1
CONFIG=$2
GPUS=$3
NODE=${NODE:-1}
NODE_RANK=${NODE_RANK:-0}
ADDR=${ADDR:-127.0.0.1}
PORT=${PORT:-12345}
python3 -m oneflow.distributed.launch \
--nproc_per_node $GPUS --nnodes $NODE --node_rank $NODE_RANK --master_addr $ADDR --master_port $PORT \
$FILE --config-file $CONFIG ${@:4}
```
After building the above modules, you can start training with single GPU.
> Config can support both `py` files and generated `yaml` files.
```bash
bash projects/my_projects/train.sh tools/train_net.py projects/my_projects/config.py 1
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment