Commit 2384a2ca authored by chenxj's avatar chenxj
Browse files

initial commit

parents
# stable_diffusion
## 论文
[High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/pdf/2112.10752)
## 模型结构
stable diffusion的核心是latent diffusion model,latent diffusion model结构如下:
![image](http://10.6.10.68/modelzoo/stable_diffusion_ait/-/raw/master/resources/sd_model.png)
## 算法原理
根据模型结构,算法原理简要如下:
![image](http://10.6.10.68/modelzoo/stable_diffusion_ait/-/raw/master/resources/sd_principle.png)
## 数据集
## 环境配置
[光源](https://sourcefind.cn/#/service-list)可拉取推理的docker镜像。stable_diffusion_ait推荐的镜像如下:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:ait-0.0.1_dtk24.04_py310
docker run -d -t -v /opt/hyhal:/opt/hyhal:ro --privileged --device=/dev/kfd --device=/dev/dri/ --network=host --group-add video --name sd-test image.sourcefind.cn:5000/dcu/admin/base/custom:ait-0.0.1_dtk24.04_py310
docker exec -it sd-test bash
source /opt/dtk/env.sh
```
## 推理
**install ait**
```
cd stable_diffusion_ait
pip3 install dist/aitemplate-0.0.1-py3-none-any.whl
```
#### 01_resnet-50
```
cd examples/01_resnet-50
```
下载resnet50 weights(https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-rsb-weights/resnet50_a1_0-14fe96d1.pth)
**benchmark**
```
python3 benchmark_ait.py
python3 benchmark_pt.py
```
**infer**
```
python3 infer_with_torch.py
```
#### 02_bert
```
cd examples/02_bert
```
下载bert-base-uncased weights(https://huggingface.co/google-bert/bert-base-uncased)
**benchmark**
```
python3 benchmark_ait.py
python3 benchmark_pt.py
```
**infer**
```
python3 demo.py
```
#### 03_vit
```
cd examples/03_vit
```
下载vit_base_patch16_224 weights(https://huggingface.co/timm/vit_base_patch16_224.augreg2_in21k_ft_in1k)
**benchmark**
```
python3 benchmark_ait.py
python3 benchmark_pt.py
```
**verification**
```
python3 verification.py
```
#### 04_stable_diffusion
下载stable-diffusion-2-1-base weights(https://huggingface.co/stabilityai/stable-diffusion-2-1-base)
下载clip-vit-large-patch14 weights(https://huggingface.co/openai/clip-vit-large-patch14)
**compile**
```
cd examples/04_stable_diffusion
python3 scripts/compile.py --local-dir stable-diffusion-2-1-base_path
```
**benchmark**
```
python3 src/benchmark.py --local-dir stable-diffusion-2-1-base_path --clip-dir clip-vit-large-patch14_path --benchmark-pt True
```
**infer**
```
python3 scripts/demo.py --local-dir stable-diffusion-2-1-base_path
python3 scripts/demo_pt.py --local-dir stable-diffusion-2-1-base_path
```
## result
![image](http://10.6.10.68/modelzoo/stable_diffusion_ait/-/raw/master/resources/example_ait.png)
### 精度
### 性能数据
01_resnet-50
| batch size | pt latency(ms) | ait latency(ms) |
| :------: | :------: |:------: |
| 1 | 3.50665771484375 | 2.7900346517562866 |
| 2 | 4.198978271484375 | 3.022238612174988 |
| 4 | 5.242999877929687 | 3.6645140647888184 |
| 8 | 7.416472778320313 | 4.517657279968262 |
| 16 | 11.60461181640625 | 6.50670599937439 |
| 32 | 19.8466064453125 | 10.511437177658081 |
| 64 | 36.08590576171875 | 18.35416030883789 |
| 128 | 67.2965625 | 32.82508373260498 |
| 256 | 133.891044921875 | 59.32628345489502 |
02_bert
bert sequence length 64
| batch size | pt latency(ms) | ait latency(ms) |
| :------: | :------: |:------: |
| 1 | 5.492328491210937 | 3.763094484806061 |
| 2 | 5.851014404296875 | 3.934549331665039 |
| 4 | 5.8462060546875 | 7.370500087738037 |
| 8 | 7.20282958984375 | 7.630655765533447 |
| 16 | 10.13709716796875 | 6.997292518615723 |
| 32 | 14.629547119140625 | 15.192972660064697 |
| 64 | 24.83916259765625 | 18.988140106201172 |
| 128 | 45.0836083984375 | 33.51811981201172 |
| 256 | 85.2006640625 | 91.8479995727539 |
bert sequence length 128
| batch size | pt latency(ms) | ait latency(ms) |
| :------: | :------: |:------: |
| 1 | 5.583170776367187 | 3.8969525694847107 |
| 2 | 5.851030883789062 | 7.2915791273117065 |
| 4 | 7.507911376953125 | 7.635279178619385 |
| 8 | 10.716405029296874 | 6.723778605461121 |
| 16 | 16.03172607421875 | 14.665886878967285 |
| 32 | 27.00265869140625 | 18.18143320083618 |
| 64 | 49.812158203125 | 32.23751640319824 |
| 128 | 94.589228515625 | 87.85263633728027 |
| 256 | 179.57365234375 | 107.5546760559082 |
bert sequence length 256
| batch size | pt latency(ms) | ait latency(ms) |
| :------: | :------: |:------: |
| 1 | 5.536416625976562 | 4.418213129043579 |
| 2 | 7.61077392578125 | 5.24817168712616 |
| 4 | 11.207763671875 | 14.24576473236084 |
| 8 | 16.431724853515625 | 12.16104507446289 |
| 16 | 26.9556640625 | 19.11765956878662 |
| 32 | 49.54421875 | 33.73731803894043 |
| 64 | 93.535673828125 | 61.45344161987305 |
| 128 | 178.09998046875 | 113.69585227966309 |
| 256 | 347.1721484375 | 2563.2373657226562 |
03_vit
vit_base_patch16_224
| batch size | pt latency(ms) | ait latency(ms) |
| :------: | :------: |:------: |
| 1 | 4.531996154785157 | 8.322586297988892 |
| 2 | 6.666417846679687 | 8.580682277679443 |
| 4 | 10.00460205078125 | 6.754000902175903 |
| 8 | 13.427578125 | 9.69419240951538 |
| 16 | 21.916123046875 | 17.138832092285156 |
| 32 | 40.23213134765625 | 28.402775287628174 |
| 64 | 72.446611328125 | 53.653794288635254 |
| 128 | 136.889541015625 | 99.72106170654297 |
| 256 | 269.488203125 | 186.07625198364258 |
04_stable_diffusion
single batch
| module | pt latency(ms) | ait latency(ms) |
| :------: | :------: |:------: |
| clip | 11.66868896484375 | 13.639567375183105 |
| unet | 106.440107421875 | 71.8858814239502 |
| vae | 95.6298046875 | 74.00970458984375 |
| pipline | 5429.30386474609375 | 3681.943343162536855 |
batched version
| batch size | pt latency(ms) | ait latency(ms) |
| :------: | :------: |:------: |
| 1 | 5429.30386474609375 | 3681.943343162536855 |
| 2 | 13816.322286962532 | 5283.831155044027 |
| 4 | 23745.903372997418 | 9285.692506004125 |
## 应用场景
### 算法类别
文生图
### 热点应用行业
艺术设计,游戏开发,电影制作
## 源码仓库及问题反馈
http://10.6.10.68/modelzoo/stable_diffusion_ait
## 参考资料
https://github.com/ROCm/AITemplate
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
"""benchmark for resnet50"""
import os
import click
import torch
from aitemplate.compiler import compile_model, Model
from aitemplate.frontend import Tensor
from aitemplate.testing import detect_target
from modeling.resnet import build_resnet_backbone
from weight_utils import export_to_torch_tensor
def mark_output(y):
"""Different to PyTorch, we need to explicit mark output tensor for optimization,
Parameters
----------
y : List[Tensor]
List of output tensors
"""
if type(y) is not tuple:
y = (y,)
for i in range(len(y)):
y[i]._attrs["is_output"] = True
y[i]._attrs["name"] = "output_%d" % (i)
y_shape = [d._attrs["values"][0] for d in y[i]._attrs["shape"]]
print("output_{} shape: {}".format(i, y_shape))
def compile_module(model_name, batch_size, **kwargs):
if model_name != "resnet50":
raise NotImplementedError
model_name = f"{model_name}_{batch_size}"
target = detect_target(**kwargs)
# Create input tensor, need to specify the shape, dtype and is_input flag
x = Tensor(
shape=[batch_size, 224, 224, 3], dtype="float16", name="input0", is_input=True
)
model = build_resnet_backbone(50, activation="ReLU")
# Mark all parameters with name same to PyTorch name convention
model.name_parameter_tensor()
# Forward the input tensor to the model, get output tensor
y = model(x)
# Mark output tensor
mark_output(y)
# Compile the model
module = compile_model(y, target, "./tmp", model_name, profile_devs=[0,1,2,3])
return module
def benchmark(model_name, batch_size, cuda_params, mod=None, graph_mode=True):
# Load compiled model
if mod is None:
model_name = f"{model_name}_{batch_size}"
mod = Model(os.path.join("./tmp", model_name, "test.so"))
# Set params
mod.set_many_constants_with_tensors(cuda_params)
mod.fold_constants(sync=True)
# prepare input/output tensor
x_input = torch.randn([batch_size, 224, 224, 3]).cuda().half()
x_input = x_input.contiguous()
y_output = torch.zeros([batch_size, 1, 1, 1000]).cuda().half()
y_output = y_output.contiguous()
# warm up
t, _, __ = mod.benchmark_with_tensors(
[x_input],
[y_output],
count=100,
repeat=4,
graph_mode=graph_mode,
)
# benchmark
t, _, __ = mod.benchmark_with_tensors(
[x_input],
[y_output],
count=100,
repeat=4,
graph_mode=graph_mode,
)
print(f"batch_size: {batch_size}, latency: {t}")
dev_flag = os.environ.get("HIP_VISIBLE_DEVICES", "-1")
dev_flag = dev_flag.replace(",", "_")
with open(f"resnet50_ait_benchmark_dev_{dev_flag}.txt", "a") as f:
f.write(f"batch_size: {batch_size}, latency: {t}\n")
@click.command()
@click.option(
"--use-fp16-acc",
type=bool,
default=True,
help="Whether to use FP16 for accumulation (similar to TensorRT)",
)
@click.option("--use-graph", type=bool, default=True, help="Whether to use CUDA graph")
@click.option("--batch-size", type=int, default=0, help="Batch size")
def main(use_fp16_acc=True, use_graph=True, batch_size=0):
if detect_target().name() == "rocm":
use_graph = False
pretrained_path = "./resnet50_a1_0-14fe96d1.pth"
cuda_params = export_to_torch_tensor(model_name="resnet50", model_path=pretrained_path)
if batch_size < 1:
for bs in (1, 2, 4, 8, 16, 32, 64, 128, 256):
compile_module("resnet50", bs, use_fp16_acc=use_fp16_acc)
benchmark("resnet50", bs, cuda_params, graph_mode=use_graph)
else:
benchmark("resnet50", batch_size, cuda_params, graph_mode=use_graph)
if __name__ == "__main__":
main()
#!/bin/bash
HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size "$1" &
HIP_VISIBLE_DEVICES=1 python3 benchmark_ait.py --batch-size "$1" && fg
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import os
import click
import timm
import torch
from aitemplate.testing.benchmark_pt import benchmark_torch_function
def benchmark(model, batch_size):
with torch.inference_mode():
input_shape = (batch_size, 3, 224, 224)
input_data = torch.randn(input_shape).cuda().half()
# warm up
benchmark_torch_function(100, model, input_data)
# benchmark
t = benchmark_torch_function(100, model, input_data)
print("batch_size: {}, time: {}".format(batch_size, t))
dev_flag = os.environ.get("HIP_VISIBLE_DEVICES", "-1")
dev_flag = dev_flag.replace(",", "_")
with open(f"resnet50_pt_benchmark_dev_{dev_flag}.txt", "a") as f:
f.write("batch_size: {}, latency: {}\n".format(batch_size, t))
@click.command()
@click.option("--batch-size", default=0, type=int)
def main(batch_size):
model = timm.create_model(
"resnet50", pretrained=True, num_classes=1000,
pretrained_cfg_overlay=dict(file="./resnet50_a1_0-14fe96d1.pth")
).cuda().half()
model.eval()
if batch_size == 0:
for batch_size in [1, 2, 4, 8, 16, 32, 64, 128, 256]:
benchmark(model, batch_size)
else:
benchmark(model, batch_size)
if __name__ == "__main__":
main()
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import os
import numpy as np
import torch
from aitemplate.compiler import compile_model, Model
from aitemplate.frontend import Tensor
from aitemplate.testing import detect_target
from modeling.resnet import build_resnet_backbone
from PIL import Image
from weight_utils import timm_export
def mark_output(y):
"""Different to PyTorch, we need to explicit mark output tensor for optimization,
Parameters
----------
y : List[Tensor]
List of output tensors
"""
if type(y) is not tuple:
y = (y,)
for i in range(len(y)):
y[i]._attrs["is_output"] = True
y[i]._attrs["name"] = "output_%d" % (i)
y_shape = [d._attrs["values"][0] for d in y[i]._attrs["shape"]]
print("output_{} shape: {}".format(i, y_shape))
def compile_module(model_name, **kwargs):
batch_size = 1
if model_name != "resnet50":
raise NotImplementedError
model_name = f"{model_name}_{batch_size}"
target = detect_target(**kwargs)
# Create input tensor, need to specify the shape, dtype and is_input flag
x = Tensor(
shape=[batch_size, 224, 224, 3], dtype="float16", name="input0", is_input=True
)
model = build_resnet_backbone(50, activation="ReLU")
# Mark all parameters with name same to PyTorch name convention
model.name_parameter_tensor()
# Forward the input tensor to the model, get output tensor
y = model(x)
# Mark output tensor
mark_output(y)
# Compile the model
module = compile_model(y, target, "./tmp", model_name)
return module
def prepare_data(img_path=None):
# we find a 224x224 image online for demo purpose:
img_url = "https://github.com/dmlc/mxnet.js/blob/main/data/cat.png?raw=true"
if img_path is None:
if os.path.exists("cat.png") is False:
os.system(f"wget -O cat.png {img_url}")
img_path = "cat.png"
image = Image.open(img_path).resize((224, 224))
image = torch.as_tensor(np.array(image).astype("float32")).cuda().half()
image = image.unsqueeze(0)
mean = torch.tensor([0.485, 0.456, 0.406]).cuda().half()
std = torch.tensor([0.229, 0.224, 0.225]).cuda().half()
image = (image / 255.0 - mean[None, None, None, :]) / std[None, None, None, :]
return image
def export_to_torch_tensor(model_path, model_name="resnet50"):
if model_name != "resnet50":
raise NotImplementedError
timm2ait = timm_export(model_name, pretrained_path=model_path)
params = timm2ait.export_model(half=True)
return params, timm2ait.pt_model
def inference(model_name, mod=None):
# Load params
pretrained_path = "./resnet50_a1_0-14fe96d1.pth"
cuda_params, pt_model = export_to_torch_tensor(model_name=model_name, model_path=pretrained_path)
# Load compiled model
if mod is None:
mod = Model(os.path.join("./tmp", model_name, "test.so"))
# Set torch tensor params to runtime
mod.set_many_constants_with_tensors(cuda_params)
mod.fold_constants(sync=True)
# prepare input/output tensor
x_input = prepare_data("cat.png")
x_input = x_input.contiguous()
y_output = torch.zeros([1, 1, 1, 1000]).cuda().half()
y_output = y_output.contiguous()
# execute
mod.run_with_tensors([x_input], [y_output])
# process output with pytorch
y_label = torch.argmax(y_output, dim=-1)
y_cpu = y_label.cpu().numpy()
print(y_cpu)
# run pytorch
pt_model.eval()
pt_model = pt_model.cuda().half()
pt_output = pt_model(x_input.permute([0, 3, 1, 2]))
# pt_output = pt_model(x_input)
y_label = torch.argmax(pt_output, dim=-1)
y_cpu = y_label.cpu().numpy()
print(y_cpu)
# verify outputs
assert torch.allclose(y_output, pt_output, 1e-1, 1e-1)
print("Verification done!")
if __name__ == "__main__":
np.random.seed(4896)
model_name = "resnet50"
mod = compile_module(model_name, use_fp16_acc=True)
inference(model_name, mod)
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import numpy as np
from aitemplate.frontend import nn
from aitemplate.testing import detect_target
class CNNBlockBase(nn.Module):
"""
A CNN block is assumed to have input channels, output channels and a stride.
The input and output of `forward()` method must be NHWC tensors.
The method can perform arbitrary computation but must match the given
channels and stride specification.
Attribute:
in_channels (int):
out_channels (int):
stride (int):
"""
def __init__(self, in_channels, out_channels, stride):
"""
The `__init__` method of any subclass should also contain these arguments.
Args:
in_channels (int):
out_channels (int):
stride (int):
"""
super().__init__()
self.in_channels = in_channels
self.out_channels = out_channels
self.stride = stride
class BasicStem(CNNBlockBase):
"""
The standard ResNet stem (layers before the first residual block),
with a conv, relu and max_pool.
"""
def __init__(self, in_channels=3, out_channels=64, norm="BN", activation="ReLU"):
super().__init__(in_channels, out_channels, 4)
conv_op = None
if detect_target().name() == "cuda":
if activation == "ReLU":
conv_op = nn.Conv2dBiasReluFewChannels
elif activation == "Hardswish":
conv_op = nn.Conv2dBiasHardswishFewChannels
else:
raise NotImplementedError
else:
if activation == "ReLU":
conv_op = nn.Conv2dBiasRelu
elif activation == "Hardswish":
conv_op = nn.Conv2dBiasHardswish
else:
raise NotImplementedError
self.conv1 = conv_op(in_channels, out_channels, 7, 2, 7 // 2)
self.pool = nn.MaxPool2d(3, 2, 1)
def forward(self, x):
x = self.conv1(x)
x = self.pool(x)
return x
class BasicBlock(CNNBlockBase):
"""
The basic residual block for ResNet-18 and ResNet-34 defined in :paper:`ResNet`,
with two 3x3 conv layers and a projection shortcut if needed.
"""
def __init__(self, in_channels, out_channels, *, stride=1, norm="BN"):
super().__init__(in_channels, out_channels, stride)
def forward(self, x):
raise NotImplementedError()
class BottleneckBlock(CNNBlockBase):
"""
The standard bottleneck residual block used by ResNet-50, 101 and 152
defined in :paper:`ResNet`. It contains 3 conv layers with kernels
1x1, 3x3, 1x1, and a projection shortcut if needed.
"""
def __init__(
self,
in_channels,
out_channels,
*,
bottleneck_channels,
stride=1,
num_groups=1,
norm="BN",
activation="ReLU",
stride_in_1x1=False,
dilation=1,
):
"""
Args:
bottleneck_channels (int): number of output channels for the 3x3
"bottleneck" conv layers.
num_groups (int): number of groups for the 3x3 conv layer.
norm (str or callable): normalization for all conv layers.
See :func:`layers.get_norm` for supported format.
stride_in_1x1 (bool): when stride>1, whether to put stride in the
first 1x1 convolution or the bottleneck 3x3 convolution.
dilation (int): the dilation rate of the 3x3 conv layer.
"""
super().__init__(in_channels, out_channels, stride)
if in_channels != out_channels:
self.downsample_0 = nn.Conv2dBias(in_channels, out_channels, 1, stride, 0)
else:
self.downsample_0 = None
# The original MSRA ResNet models have stride in the first 1x1 conv
# The subsequent fb.torch.resnet and Caffe2 ResNe[X]t implementations have
# stride in the 3x3 conv
stride_1x1, stride_3x3 = (stride, 1) if stride_in_1x1 else (1, stride)
conv_op = None
conv_op_add = None
if activation == "ReLU":
conv_op = nn.Conv2dBiasRelu
conv_op_add = nn.Conv2dBiasAddRelu
elif activation == "Hardswish":
conv_op = nn.Conv2dBiasHardswish
conv_op_add = nn.Conv2dBiasAddHardswish
else:
raise NotImplementedError
self.conv1 = conv_op(in_channels, bottleneck_channels, 1, stride_1x1, 0)
self.conv2 = conv_op(
bottleneck_channels,
bottleneck_channels,
3,
stride_3x3,
1 * dilation,
dilation,
)
self.conv3 = conv_op_add(bottleneck_channels, out_channels, 1, 1, 0)
# for layer in [self.conv1, self.conv2, self.conv3, self.shortcut]:
# if layer is not None: # shortcut can be None
# weight_init.c2_msra_fill(layer)
# Zero-initialize the last normalization in each residual branch,
# so that at the beginning, the residual branch starts with zeros,
# and each residual block behaves like an identity.
# See Sec 5.1 in "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour":
# "For BN layers, the learnable scaling coefficient γ is initialized
# to be 1, except for each residual block's last BN
# where γ is initialized to be 0."
# nn.init.constant_(self.conv3.norm.weight, 0)
# TODO this somehow hurts performance when training GN models from scratch.
# Add it as an option when we need to use this code to train a backbone.
def forward(self, x):
out = self.conv1(x)
out = self.conv2(out)
if self.downsample_0 is not None:
downsample = self.downsample_0(x)
else:
downsample = x
out = self.conv3(out, downsample)
return out
class ResNet(nn.Module):
"""
Implement :paper:`ResNet`.
"""
def __init__(self, stem, stages, num_classes=None, out_features=None, freeze_at=0):
"""
Args:
stem (nn.Module): a stem module
stages (list[list[CNNBlockBase]]): several (typically 4) stages,
each contains multiple :class:`CNNBlockBase`.
activation (str): activation function to use.
num_classes (None or int): if None, will not perform classification.
Otherwise, will create a linear layer.
out_features (list[str]): name of the layers whose outputs should
be returned in forward. Can be anything in "stem", "linear", or "res2" ...
If None, will return the output of the last layer.
freeze_at (int): The number of stages at the beginning to freeze.
see :meth:`freeze` for detailed explanation.
"""
super().__init__()
self.stem = stem
self.num_classes = num_classes
current_stride = self.stem.stride
self._out_feature_strides = {"stem": current_stride}
self._out_feature_channels = {"stem": self.stem.out_channels}
self.stage_names, self.stages = [], []
if out_features is not None:
# Avoid keeping unused layers in this module. They consume extra memory
# and may cause allreduce to fail
num_stages = max(
[
{"layer1": 1, "layer2": 2, "layer3": 3, "layer4": 4}.get(f, 0)
for f in out_features
]
)
stages = stages[:num_stages]
for i, blocks in enumerate(stages):
assert len(blocks) > 0, len(blocks)
for block in blocks:
assert isinstance(block, CNNBlockBase), block
name = "layer" + str(i + 1)
stage = nn.Sequential(*blocks)
self.add_module(name, stage)
self.stage_names.append(name)
self.stages.append(stage)
self._out_feature_strides[name] = current_stride = int(
current_stride * np.prod([k.stride for k in blocks])
)
self._out_feature_channels[name] = curr_channels = blocks[-1].out_channels
self.stage_names = tuple(self.stage_names) # Make it static for scripting
if num_classes is not None:
self.avgpool = nn.AvgPool2d(7, 1, 0)
self.fc = nn.Linear(curr_channels, num_classes)
if out_features is None:
out_features = [name]
self._out_features = out_features
assert len(self._out_features)
children = [x[0] for x in self.named_children()]
for out_feature in self._out_features:
assert out_feature in children, "Available children: {}".format(
", ".join(children)
)
self.reshape = nn.Reshape()
def forward(self, x):
"""
Args:
x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.
Returns:
dict[str->Tensor]: names and the corresponding features
"""
# assert x.dim() == 4, f"ResNet takes an input of shape (N, C, H, W). Got {x.shape} instead!"
outputs = {}
x = self.stem(x)
if "stem" in self._out_features:
outputs["stem"] = x
for name, stage in zip(self.stage_names, self.stages):
x = stage(x)
if name in self._out_features:
outputs[name] = x
if self.num_classes is not None:
x = self.avgpool(x)
x = self.fc(x)
if x._rank() == 2:
x = self.reshape(x, [x._size(0), 1, 1, x._size(1)])
return x
return outputs
@staticmethod
def make_stage(block_class, num_blocks, *, in_channels, out_channels, **kwargs):
"""
Create a list of blocks of the same type that forms one ResNet stage.
Args:
block_class (type): a subclass of CNNBlockBase that's used to create all blocks in this
stage. A module of this type must not change spatial resolution of inputs unless its
stride != 1.
num_blocks (int): number of blocks in this stage
in_channels (int): input channels of the entire stage.
out_channels (int): output channels of **every block** in the stage.
kwargs: other arguments passed to the constructor of
`block_class`. If the argument name is "xx_per_block", the
argument is a list of values to be passed to each block in the
stage. Otherwise, the same argument is passed to every block
in the stage.
Returns:
list[CNNBlockBase]: a list of block module.
Examples:
::
stage = ResNet.make_stage(
BottleneckBlock, 3, in_channels=16, out_channels=64,
bottleneck_channels=16, num_groups=1,
stride_per_block=[2, 1, 1],
dilations_per_block=[1, 1, 2]
)
Usually, layers that produce the same feature map spatial size are defined as one
"stage" (in :paper:`FPN`). Under such definition, ``stride_per_block[1:]`` should
all be 1.
"""
blocks = []
for i in range(num_blocks):
curr_kwargs = {}
for k, v in kwargs.items():
if k.endswith("_per_block"):
assert len(v) == num_blocks, (
f"Argument '{k}' of make_stage should have the "
f"same length as num_blocks={num_blocks}."
)
newk = k[: -len("_per_block")]
assert (
newk not in kwargs
), f"Cannot call make_stage with both {k} and {newk}!"
curr_kwargs[newk] = v[i]
else:
curr_kwargs[k] = v
blocks.append(
block_class(
in_channels=in_channels, out_channels=out_channels, **curr_kwargs
)
)
in_channels = out_channels
return blocks
@staticmethod
def make_default_stages(depth, block_class=None, **kwargs):
"""
Created list of ResNet stages from pre-defined depth (one of 18, 34, 50, 101, 152).
If it doesn't create the ResNet variant you need, please use :meth:`make_stage`
instead for fine-grained customization.
Args:
depth (int): depth of ResNet
block_class (type): the CNN block class. Has to accept
`bottleneck_channels` argument for depth > 50.
By default it is BasicBlock or BottleneckBlock, based on the
depth.
kwargs:
other arguments to pass to `make_stage`. Should not contain
stride and channels, as they are predefined for each depth.
Returns:
list[list[CNNBlockBase]]: modules in all stages; see arguments of
:class:`ResNet.__init__`.
"""
num_blocks_per_stage = {
18: [2, 2, 2, 2],
34: [3, 4, 6, 3],
50: [3, 4, 6, 3],
101: [3, 4, 23, 3],
152: [3, 8, 36, 3],
}[depth]
if block_class is None:
block_class = BasicBlock if depth < 50 else BottleneckBlock
if depth < 50:
in_channels = [64, 64, 128, 256]
out_channels = [64, 128, 256, 512]
else:
in_channels = [64, 256, 512, 1024]
out_channels = [256, 512, 1024, 2048]
ret = []
for (n, s, i, o) in zip(
num_blocks_per_stage, [1, 2, 2, 2], in_channels, out_channels
):
if depth >= 50:
kwargs["bottleneck_channels"] = o // 4
ret.append(
ResNet.make_stage(
block_class=block_class,
num_blocks=n,
stride_per_block=[s] + [1] * (n - 1),
in_channels=i,
out_channels=o,
**kwargs,
)
)
return ret
def make_stage(*args, **kwargs):
"""
Deprecated alias for backward compatibiltiy.
"""
return ResNet.make_stage(*args, **kwargs)
def build_resnet_backbone(depth, activation):
"""
Create a ResNet instance from config.
Returns:
ResNet: a :class:`ResNet` instance.
"""
norm = "BN"
activation = activation
num_groups = 1
stride_in_1x1 = False
num_groups = 1
width_per_group = 64
bottleneck_channels = num_groups * width_per_group
in_channels = 64
out_channels = 256
stem = BasicStem(in_channels=3, out_channels=64, norm=norm, activation=activation)
num_blocks_per_stage = {
18: [2, 2, 2, 2],
34: [3, 4, 6, 3],
50: [3, 4, 6, 3],
101: [3, 4, 23, 3],
152: [3, 8, 36, 3],
}[depth]
stages = []
for idx, stage_idx in enumerate(range(2, 6)):
# res5_dilation is used this way as a convention in R-FCN & Deformable Conv paper
dilation = 1
first_stride = 1 if idx == 0 or (stage_idx == 5 and dilation == 2) else 2
stage_kargs = {
"num_blocks": num_blocks_per_stage[idx],
"stride_per_block": [first_stride] + [1] * (num_blocks_per_stage[idx] - 1),
"in_channels": in_channels,
"out_channels": out_channels,
"norm": norm,
"activation": activation,
}
# Use BasicBlock for R18 and R34.
if depth in [18, 34]:
stage_kargs["block_class"] = BasicBlock
else:
stage_kargs["bottleneck_channels"] = bottleneck_channels
stage_kargs["stride_in_1x1"] = stride_in_1x1
stage_kargs["dilation"] = dilation
stage_kargs["num_groups"] = num_groups
stage_kargs["block_class"] = BottleneckBlock
blocks = ResNet.make_stage(**stage_kargs)
in_channels = out_channels
out_channels *= 2
bottleneck_channels *= 2
stages.append(blocks)
return ResNet(stem, stages, num_classes=1000)
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
"""
script for converting model from timm to aitemplate
Only tested on resnet50
"""
import pickle
import re
import click
import numpy as np
import timm
import torch
from aitemplate.testing import detect_target
CONV_WEIGHT_PATTERN = re.compile(r"conv\d+\.weight")
class timm_export:
def __init__(self, model_name, pretrained_path, pretrained=True):
self.model_name = model_name
if model_name != "resnet50":
raise NotImplementedError
with torch.no_grad():
self.pt_model = timm.create_model(
model_name, pretrained=pretrained, num_classes=1000,
pretrained_cfg_overlay=dict(file=pretrained_path)
)
self.pt_state = self.pt_model.state_dict()
def export_model(self, half=True):
fused_model = {}
for param_name in self.pt_state.keys():
self.transform_params(param_name, fused_model)
ait_model = {k.replace(".", "_"): weight for k, weight in fused_model.items()}
if detect_target().name() == "cuda":
self.export_conv0(ait_model, fused_model)
if half:
half_params = {}
for k, v in ait_model.items():
half_params[k] = v.detach().cuda().half().contiguous()
return half_params
return ait_model
def fuse_conv_bn_weights(
self, conv_w, conv_b, bn_rm, bn_rv, bn_eps, bn_w, bn_b, transpose=False
):
conv_w = torch.tensor(conv_w)
bn_rm = torch.tensor(bn_rm)
bn_rv = torch.tensor(bn_rv)
conv_b = torch.tensor(conv_b) if conv_b is not None else torch.zeros_like(bn_rm)
bn_w = torch.tensor(bn_w) if bn_w is not None else torch.ones_like(bn_rm)
bn_b = torch.tensor(bn_b) if bn_b is not None else torch.zeros_like(bn_rm)
bn_eps = torch.tensor(bn_eps)
bn_var_rsqrt = torch.rsqrt(bn_rv + bn_eps)
if transpose:
shape = [1, -1] + [1] * (len(conv_w.shape) - 2)
else:
shape = [-1, 1] + [1] * (len(conv_w.shape) - 2)
conv_w = conv_w * (bn_w * bn_var_rsqrt).reshape(shape)
conv_b = (conv_b - bn_rm) * bn_var_rsqrt * bn_w + bn_b
# NCHW -> NHWC
conv_w = conv_w.permute(0, 2, 3, 1).contiguous()
for arr in [conv_w.numpy(), conv_b.numpy()]:
if np.isnan(arr).any():
print("fuse bn error")
return conv_w, conv_b
def transform_conv0(self):
conv_w = self.pt_state["conv1.weight"]
bn_w = self.pt_state["bn1.weight"]
bn_b = self.pt_state["bn1.bias"]
bn_rm = self.pt_state["bn1.running_mean"]
bn_rv = self.pt_state["bn1.running_var"]
fused_w, fused_b = self.fuse_conv_bn_weights(
conv_w, None, bn_rm, bn_rv, 1e-5, bn_w, bn_b
)
return fused_w, fused_b
def transform_params(self, param_name, fused_model):
if param_name == "conv1.weight":
fused_w, fused_b = self.transform_conv0()
fused_model["stem.conv1.weight"] = fused_w
fused_model["stem.conv1.bias"] = fused_b
elif "downsample.0.weight" in param_name:
fused_w, fused_b = self.transform_downsample(param_name)
fused_model[param_name] = fused_w
fused_model[param_name.replace("weight", "bias")] = fused_b
elif param_name == "fc.weight":
fused_model["fc.weight"] = self.pt_state["fc.weight"]
fused_model["fc.bias"] = self.pt_state["fc.bias"]
elif CONV_WEIGHT_PATTERN.search(param_name) is not None:
bn_w_name = param_name.replace("conv", "bn")
conv_w = self.pt_state[param_name]
bn_w = self.pt_state[bn_w_name]
bn_b = self.pt_state[bn_w_name.replace("weight", "bias")]
bn_rm = self.pt_state[bn_w_name.replace("weight", "running_mean")]
bn_rv = self.pt_state[bn_w_name.replace("weight", "running_var")]
fused_w, fused_b = self.fuse_conv_bn_weights(
conv_w, None, bn_rm, bn_rv, 1e-5, bn_w, bn_b
)
fused_model[param_name] = fused_w
fused_model[param_name.replace("weight", "bias")] = fused_b
else:
pass
def transform_downsample(self, param_name):
assert "downsample" in param_name
tags = param_name.split(".")
block_tag = ".".join(tags[:2])
conv_w = self.pt_state[f"{block_tag}.downsample.0.weight"]
bn_w = self.pt_state[f"{block_tag}.downsample.1.weight"]
bn_b = self.pt_state[f"{block_tag}.downsample.1.bias"]
bn_rm = self.pt_state[f"{block_tag}.downsample.1.running_mean"]
bn_rv = self.pt_state[f"{block_tag}.downsample.1.running_var"]
fused_w, fused_b = self.fuse_conv_bn_weights(
conv_w, None, bn_rm, bn_rv, 1e-5, bn_w, bn_b
)
return fused_w, fused_b
def export_conv0(self, ait_model, fuse_model):
pt_name = "stem.conv1.weight"
x = fuse_model[pt_name]
conv_w = torch.zeros((64, 7, 7, 4))
conv_w[:, :, :, :3] = x
ait_model[pt_name.replace(".", "_")] = conv_w
def export_to_torch_tensor(model_path, model_name="resnet50"):
if model_name != "resnet50":
raise NotImplementedError
timm2ait = timm_export(model_name, pretrained_path=model_path)
ait_model = timm2ait.export_model(half=True)
return ait_model
@click.command()
@click.option("--param-path", type=str, default="resnet50.pkl")
def export_to_numpy(param_path):
ait_model = export_to_torch_tensor()
np_weights = {}
for k, v in ait_model.items():
np_weights[k] = v.detach().cpu().numpy().astype(np.float16)
with open(param_path, "wb") as f:
pickle.dump(np_weights, f)
if __name__ == "__main__":
export_to_numpy()
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import os
from typing import Dict, List
import click
import numpy as np
import torch
from aitemplate.compiler import compile_model, Model
from aitemplate.frontend import Tensor
from aitemplate.testing import detect_target
from modeling.bert import BertBaseEncodersOnly, BertBaseUncased
from modeling.torch_model import BertBaseUncased as BertPt
def mark_output(y: Tensor) -> None:
if type(y) is not tuple:
y = (y,)
for i in range(len(y)):
y[i]._attrs["is_output"] = True
y[i]._attrs["name"] = "output_%d" % (i)
y_shape = [d._attrs["values"][0] for d in y[i]._attrs["shape"]]
print("output_{} shape: {}".format(i, y_shape))
def create_bert_inputs(
batch_size: int, seq_length: int, dtype: str = "int64"
) -> List[Tensor]:
input_ids = Tensor(
shape=[batch_size, seq_length],
name="input_ids",
dtype=dtype,
is_input=True,
)
token_type_ids = Tensor(
shape=[batch_size, seq_length],
name="token_type_ids",
dtype=dtype,
is_input=True,
)
position_ids = Tensor(
shape=[batch_size, seq_length],
name="position_ids",
dtype=dtype,
is_input=True,
)
return [input_ids, token_type_ids, position_ids]
def create_bert_encoders_input(
batch_size: int, seq_length: int, hidden: int, dtype: str = "float16"
):
encoder_input = Tensor(
shape=[batch_size, seq_length, hidden],
name="input",
dtype=dtype,
is_input=True,
)
return [encoder_input]
def create_bert_inputs_pt(
batch_size: int,
seq_length: int,
vocab_size: int = 30522,
type_vocab_size: int = 2,
dtype: torch.dtype = torch.int64,
) -> Dict[str, torch.Tensor]:
input_ids = torch.randint(
0, vocab_size, (batch_size, seq_length), dtype=dtype
).cuda()
token_type_ids = torch.randint(
0, type_vocab_size, input_ids.size(), dtype=dtype
).cuda()
position_ids = (
torch.arange(seq_length, dtype=dtype)
.reshape((1, -1))
.expand(batch_size, -1)
.contiguous()
.cuda()
)
return {
"input_ids": input_ids,
"token_type_ids": token_type_ids,
"position_ids": position_ids,
}
def create_bert_encoders_inputs_pt(
batch_size: int, seq_length: int, hidden_size: int
) -> Dict[str, torch.Tensor]:
encoder_input = torch.randn([batch_size, seq_length, hidden_size]).cuda().half()
return {"input": encoder_input}
def map_pt_params(
ait_bert, pt_bert, batch_size: int, seq_length: int
) -> Dict[str, torch.Tensor]:
pt_params = dict(pt_bert.named_parameters())
mapped_pt_params = {}
for name, _ in ait_bert.named_parameters():
ait_name = name.replace(".", "_")
if name in pt_params:
mapped_pt_params[ait_name] = pt_params[name]
continue
if name.endswith("self.qkv.weight"):
prefix = name[: -len("qkv.weight")]
q_weight = pt_params[prefix + "query.weight"]
k_weight = pt_params[prefix + "key.weight"]
v_weight = pt_params[prefix + "value.weight"]
qkv_weight = torch.cat([q_weight, k_weight, v_weight], dim=0)
mapped_pt_params[ait_name] = qkv_weight
elif name.endswith("self.qkv.bias"):
prefix = name[: -len("qkv.bias")]
q_bias = pt_params[prefix + "query.bias"]
k_bias = pt_params[prefix + "key.bias"]
v_bias = pt_params[prefix + "value.bias"]
qkv_bias = torch.cat([q_bias, k_bias, v_bias], dim=0)
mapped_pt_params[ait_name] = qkv_bias
elif name.endswith("self.proj.weight"):
prefix = name[: -len("self.proj.weight")]
pt_name = prefix + "output.dense.weight"
mapped_pt_params[ait_name] = pt_params[pt_name]
elif name.endswith("self.proj.bias"):
prefix = name[: -len("self.proj.bias")]
pt_name = prefix + "output.dense.bias"
mapped_pt_params[ait_name] = pt_params[pt_name]
elif name.endswith("cu_length"):
cu_len = np.cumsum([0] + [seq_length] * batch_size).astype("int32")
mapped_pt_params[ait_name] = torch.from_numpy(cu_len).cuda()
else:
pt_param = pt_bert.get_parameter(name)
mapped_pt_params[ait_name] = pt_param
return mapped_pt_params
def benchmark(
batch_size: int,
seq_length: int,
hidden_size: int,
mod: Model,
graph_mode: bool,
encoders_only: bool,
):
if encoders_only:
inputs = create_bert_encoders_inputs_pt(batch_size, seq_length, hidden_size)
else:
inputs = create_bert_inputs_pt(batch_size, seq_length)
outputs = [torch.empty(mod.get_output_maximum_shape(0)).cuda().half()]
# warm up
t, _, __ = mod.benchmark_with_tensors(
inputs,
outputs,
count=100,
repeat=4,
graph_mode=graph_mode,
)
# benchmark
t, _, __ = mod.benchmark_with_tensors(
inputs,
outputs,
count=100,
repeat=4,
graph_mode=graph_mode,
)
print(f"batch_size: {batch_size}, seq_length: {seq_length}, latency: {t}")
dev_flag = os.environ.get("HIP_VISIBLE_DEVICES", "-1")
dev_flag = dev_flag.replace(",", "_")
with open(f"bert_ait_benchmark_dev_{dev_flag}.txt", "a") as f:
f.write(f"batch_size: {batch_size}, seq_length: {seq_length}, latency: {t}\n")
def compile_module(
batch_size: int,
seq_length: int,
hidden_size: int,
activation: str,
use_fp16_acc: bool,
encoders_only: bool,
pt_model: torch.nn.Module,
) -> None:
model_name = f"BERT_{activation}_{batch_size}_{seq_length}"
target = detect_target(use_fp16_acc=use_fp16_acc)
if encoders_only:
inputs = create_bert_encoders_input(batch_size, seq_length, hidden_size)
else:
inputs = create_bert_inputs(batch_size, seq_length)
if encoders_only:
model = BertBaseEncodersOnly(batch_size, seq_length, hidden_act=activation)
else:
model = BertBaseUncased(batch_size, seq_length, hidden_act=activation)
# Mark all parameters with name same to PyTorch name convention
model.name_parameter_tensor()
# Forward the input tensor to the model, get output tensor
y = model(*inputs)
# Mark output tensor
mark_output(y)
params = map_pt_params(model, pt_model, batch_size, seq_length)
mod = compile_model(y, target, "./tmp", model_name, profile_devs=[0,1,2,3])
mod.set_many_constants_with_tensors(params)
mod.fold_constants(sync=True)
return mod
@click.command()
@click.option("--batch-size", type=int, default=0, help="Inference batch size")
@click.option("--seq-length", type=int, default=0, help="Inference sequence length")
@click.option(
"--activation",
type=str,
default="fast_gelu",
help="Activation function applied on BERT, currently only support fast_gelu on Rocm. CUDA supports both gelu and fast_gelu. No effect if framework is pt.",
)
@click.option(
"--graph-mode",
type=bool,
default=True,
help="Use CUDA graph or not. hipGraph is not supported yet. No effect if framework is pt.",
)
@click.option(
"--use-fp16-acc",
type=bool,
default=True,
help="Use fp16 accumulation or not (TensorRT is using fp16_acc). No effect if framework is pt.",
)
@click.option(
"--use-pretrained-pt-model",
type=bool,
default=True,
help="Whether or not to use the pretrained BERT model weights.",
)
@click.option(
"--encoders-only",
type=bool,
default=True,
help="Whether or not to run the BERT benchmark with encoders only. If enabled, only the transformer blocks without BERT embeddings are benchmarked.",
)
def compile_and_benchmark(
batch_size: int,
seq_length: int,
activation: str,
graph_mode: bool,
use_fp16_acc: bool,
use_pretrained_pt_model: bool,
encoders_only: bool,
):
if detect_target().name() == "rocm":
graph_mode = False
assert activation in (
"fast_gelu"
), f"Unsupported activation: {activation} on rocm"
pt_model = BertPt(pretrained=use_pretrained_pt_model, model_path="./bert-base-uncased")._model
pt_model.eval()
hidden_size = pt_model.config.hidden_size
if batch_size < 1:
batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]
else:
batch_sizes = [batch_size]
if seq_length < 1:
seq_lengths = (
[64, 128, 256] if encoders_only else [64, 128]
)
else:
seq_lengths = [seq_length]
for seq_length in seq_lengths:
for bs in batch_sizes:
mod = compile_module(
bs,
seq_length,
hidden_size,
activation,
use_fp16_acc,
encoders_only,
pt_model,
)
benchmark(bs, seq_length, hidden_size, mod, graph_mode, encoders_only)
if __name__ == "__main__":
torch.manual_seed(4896)
compile_and_benchmark()
#!/bin/bash
#profile
HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 benchmark_ait.py
#1GCD
HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size "$1"
#2GCD
HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size "$1" &
HIP_VISIBLE_DEVICES=1 python3 benchmark_ait.py --batch-size "$1" && fg
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import click
import torch
from aitemplate.testing.benchmark_pt import benchmark_torch_function
from modeling.torch_model import BertBaseUncased
def benchmark_pt(pretrained=True, batchsize=0):
bert = BertBaseUncased(pretrained=pretrained)
model = bert._model
model.eval()
if batchsize == 0:
candidate_batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]
else:
candidate_batch_sizes = [batchsize]
with torch.inference_mode():
for seq_length in [64, 128]:
for batch_size in candidate_batch_sizes:
try:
input_ids, token_type_ids, position_ids = bert.generate_inputs(
batch_size, seq_length
)
bert.forward(
input_ids=input_ids,
token_type_ids=token_type_ids,
position_ids=position_ids,
)
# warmup
t = benchmark_torch_function(
100,
bert.forward,
input_ids=input_ids,
token_type_ids=token_type_ids,
position_ids=position_ids,
)
# benchmark
t = benchmark_torch_function(
100,
bert.forward,
input_ids=input_ids,
token_type_ids=token_type_ids,
position_ids=position_ids,
)
print(
f"bert pt: batch_size: {batch_size}, seq_length: {seq_length}, {t} ms",
)
with open("bert_pt_benchmark.txt", "a") as f:
f.write(
f"batch_size: {batch_size}, seq_length: {seq_length} latency: {t} ms\n"
)
except RuntimeError:
# pt runs out of memory
break
def benchmark_pt_encoders_only(pretrained=True, batchsize=0):
model = BertBaseUncased(pretrained=pretrained)
pt_bert = model._model
pt_bert.eval()
encoder = pt_bert.bert.encoder
hidden_size = pt_bert.config.hidden_size
if batchsize == 0:
candidate_batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]
else:
candidate_batch_sizes = [batchsize]
for seq_length in [64, 128, 256]:
for batch_size in candidate_batch_sizes:
try:
encoder_input = (
torch.randn([batch_size, seq_length, hidden_size]).cuda().half()
)
encoder.forward(encoder_input)
# warmup
t = benchmark_torch_function(
100,
encoder.forward,
encoder_input,
)
# benchmark
t = benchmark_torch_function(
100,
encoder.forward,
encoder_input,
)
print(
f"bert encoders pt: batch_size: {batch_size}, seq_length: {seq_length}, {t} ms",
)
with open("bert_encoders_pt_benchmark.txt", "a") as f:
f.write(
f"batch_size: {batch_size}, seq_length: {seq_length} latency: {t} ms\n"
)
except RuntimeError:
# pt runs out of memory
break
@click.command()
@click.option(
"--use-pretrained-pt-model",
type=bool,
default=True,
help="Whether or not to use the pretrained BERT model weights.",
)
@click.option(
"--encoders-only",
type=bool,
default=True,
help="Whether or not to run the BERT benchmark with encoders only. If enabled, only the transformer blocks without BERT embeddings are benchmarked.",
)
@click.option(
"--batch-size",
type=int,
default=0,
help="The batch size to use for the benchmark. If 0, the batch size is default [1 : 128].",
)
def benchmark(
use_pretrained_pt_model: bool,
encoders_only: bool,
batch_size: int,
):
if encoders_only:
benchmark_pt_encoders_only(use_pretrained_pt_model, batch_size)
else:
benchmark_pt(use_pretrained_pt_model, batch_size)
if __name__ == "__main__":
torch.manual_seed(4896)
benchmark()
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import click
import torch
from transformers import BertTokenizer
from benchmark_ait import compile_module
from modeling.torch_model import BertBaseUncased as BertPt
def prepare_data(prompt: str, model_path: str):
tokenizer = BertTokenizer.from_pretrained(model_path)
result = tokenizer(prompt, return_attention_mask=False, return_tensors="pt")
target_size = result["input_ids"].size()
if target_size[1] > 512:
raise ValueError("Sequence length > 512 is not supported")
result["position_ids"] = (
torch.arange(target_size[1], dtype=torch.int64)
.reshape(result["input_ids"].size())
.contiguous()
.cuda()
)
return result
def run_model(
prompt: str,
activation: str,
graph_mode: bool,
use_fp16_acc: bool,
verify: bool,
model_path="bert-base-uncased",
):
inputs = prepare_data(prompt, model_path)
inputs_pt = {name: data.cuda() for name, data in inputs.items()}
batch_size, seq_len = inputs["input_ids"].size()
pt_model = BertPt(model_path=model_path, pretrained=True)._model
pt_model.eval()
hidden_size = pt_model.config.hidden_size
mod = compile_module(
batch_size, seq_len, hidden_size, activation, use_fp16_acc, False, pt_model
)
outputs = [torch.empty(mod.get_output_maximum_shape(0)).half().cuda()]
mod.run_with_tensors(inputs_pt, outputs, graph_mode=graph_mode)
print(f"Logits: {outputs[0]}")
if verify:
pt_outputs = pt_model.bert(**inputs_pt)
torch.allclose(outputs[0], pt_outputs.last_hidden_state, 1e-1, 1e-1)
print("Verification done!")
@click.command()
@click.option(
"--prompt",
type=str,
default="The quick brown fox jumps over the lazy dog.",
help="The prompt to give BERT.",
)
@click.option(
"--activation",
type=str,
default="fast_gelu",
help="Activation function applied on BERT, currently only support gelu and fast_gelu",
)
@click.option(
"--graph_mode",
type=bool,
default=True,
help="Use CUDA graph or not. (hipGraph is not supported yet)",
)
@click.option(
"--use_fp16_acc",
type=bool,
default=True,
help="Use fp16 accumulation or not (TensorRT is using fp16_acc)",
)
@click.option(
"--verify",
type=bool,
default=True,
help="Verify AIT outputs against PT",
)
def run_demo(
prompt: str,
activation: str,
graph_mode: bool,
use_fp16_acc: bool,
verify: bool,
):
run_model(prompt, activation, graph_mode, use_fp16_acc, verify, model_path="./bert-base-uncased")
if __name__ == "__main__":
torch.manual_seed(4896)
run_demo()
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment