initial commit

2384a2ca · chenxj · 2384a2ca · 2384a2ca · 2384a2ca · 2384a2ca
Commit 2384a2ca authored Jun 03, 2024 by chenxj
20 changed files
--- a/README.md
+++ b/README.md
+# stable_diffusion
+## 论文
+[High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/pdf/2112.10752)
+## 模型结构
+stable diffusion的核心是latent diffusion model,latent diffusion model结构如下：
+
+![image](http://10.6.10.68/modelzoo/stable_diffusion_ait/-/raw/master/resources/sd_model.png)
+## 算法原理
+根据模型结构，算法原理简要如下：
+
+![image](http://10.6.10.68/modelzoo/stable_diffusion_ait/-/raw/master/resources/sd_principle.png)
+## 数据集
+无
+## 环境配置
+在[光源](https://sourcefind.cn/#/service-list)可拉取推理的docker镜像。stable_diffusion_ait推荐的镜像如下：
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:ait-0.0.1_dtk24.04_py310
+docker run -d -t -v /opt/hyhal:/opt/hyhal:ro --privileged --device=/dev/kfd --device=/dev/dri/ --network=host --group-add video --name sd-test image.sourcefind.cn:5000/dcu/admin/base/custom:ait-0.0.1_dtk24.04_py310
+docker exec -it sd-test bash
+source /opt/dtk/env.sh
+```
+## 推理
+**install ait**
+```
+cd stable_diffusion_ait
+pip3 install dist/aitemplate-0.0.1-py3-none-any.whl
+```
+#### 01_resnet-50
+```
+cd examples/01_resnet-50
+```
+下载resnet50 weights(https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-rsb-weights/resnet50_a1_0-14fe96d1.pth)
+
+**benchmark**
+```
+python3 benchmark_ait.py
+python3 benchmark_pt.py
+```
+**infer**
+```
+python3 infer_with_torch.py
+```
+#### 02_bert
+```
+cd examples/02_bert
+```
+下载bert-base-uncased weights(https://huggingface.co/google-bert/bert-base-uncased)
+
+**benchmark**
+```
+python3 benchmark_ait.py
+python3 benchmark_pt.py
+```
+**infer**
+```
+python3 demo.py
+```
+#### 03_vit
+```
+cd examples/03_vit
+```
+下载vit_base_patch16_224 weights(https://huggingface.co/timm/vit_base_patch16_224.augreg2_in21k_ft_in1k)
+
+**benchmark**
+```
+python3 benchmark_ait.py
+python3 benchmark_pt.py
+```
+**verification**
+```
+python3 verification.py
+```
+#### 04_stable_diffusion
+
+下载stable-diffusion-2-1-base weights(https://huggingface.co/stabilityai/stable-diffusion-2-1-base)
+
+下载clip-vit-large-patch14 weights(https://huggingface.co/openai/clip-vit-large-patch14)
+
+**compile**
+```
+cd examples/04_stable_diffusion
+python3 scripts/compile.py --local-dir stable-diffusion-2-1-base_path
+```
+**benchmark**
+```
+python3 src/benchmark.py --local-dir stable-diffusion-2-1-base_path --clip-dir clip-vit-large-patch14_path --benchmark-pt True
+```
+**infer**
+```
+python3 scripts/demo.py --local-dir stable-diffusion-2-1-base_path
+python3 scripts/demo_pt.py --local-dir stable-diffusion-2-1-base_path
+```
+## result
+![image](http://10.6.10.68/modelzoo/stable_diffusion_ait/-/raw/master/resources/example_ait.png)
+### 精度
+无
+### 性能数据
+01_resnet-50
+| batch size | pt latency(ms) | ait latency(ms) | 
+| :------: | :------: |:------: |
+| 1 | 3.50665771484375 | 2.7900346517562866 | 
+| 2 | 4.198978271484375 | 3.022238612174988 | 
+| 4 | 5.242999877929687 | 3.6645140647888184 | 
+| 8 | 7.416472778320313 | 4.517657279968262 | 
+| 16 | 11.60461181640625 | 6.50670599937439 | 
+| 32 | 19.8466064453125 | 10.511437177658081 | 
+| 64 | 36.08590576171875 | 18.35416030883789 | 
+| 128 | 67.2965625 | 32.82508373260498 | 
+| 256 | 133.891044921875 | 59.32628345489502 | 
+
+02_bert
+bert sequence length 64
+| batch size | pt latency(ms) | ait latency(ms) | 
+| :------: | :------: |:------: |
+| 1 | 5.492328491210937 | 3.763094484806061 | 
+| 2 | 5.851014404296875 | 3.934549331665039 | 
+| 4 | 5.8462060546875 | 7.370500087738037 | 
+| 8 | 7.20282958984375 | 7.630655765533447 | 
+| 16 | 10.13709716796875 | 6.997292518615723 | 
+| 32 | 14.629547119140625 | 15.192972660064697 | 
+| 64 | 24.83916259765625 | 18.988140106201172 | 
+| 128 | 45.0836083984375 | 33.51811981201172 | 
+| 256 | 85.2006640625 | 91.8479995727539 | 
+
+bert sequence length 128
+| batch size | pt latency(ms) | ait latency(ms) | 
+| :------: | :------: |:------: |
+| 1 | 5.583170776367187 | 3.8969525694847107 | 
+| 2 | 5.851030883789062 | 7.2915791273117065 | 
+| 4 | 7.507911376953125 | 7.635279178619385 | 
+| 8 | 10.716405029296874 | 6.723778605461121 | 
+| 16 | 16.03172607421875 | 14.665886878967285 | 
+| 32 | 27.00265869140625 | 18.18143320083618 | 
+| 64 | 49.812158203125 | 32.23751640319824 | 
+| 128 | 94.589228515625 | 87.85263633728027 | 
+| 256 | 179.57365234375 | 107.5546760559082 | 
+
+bert sequence length 256
+| batch size | pt latency(ms) | ait latency(ms) | 
+| :------: | :------: |:------: |
+| 1 | 5.536416625976562 | 4.418213129043579 | 
+| 2 | 7.61077392578125 | 5.24817168712616 | 
+| 4 | 11.207763671875 | 14.24576473236084 | 
+| 8 | 16.431724853515625 | 12.16104507446289 | 
+| 16 | 26.9556640625 | 19.11765956878662 | 
+| 32 | 49.54421875 | 33.73731803894043 | 
+| 64 | 93.535673828125 | 61.45344161987305 | 
+| 128 | 178.09998046875 | 113.69585227966309 | 
+| 256 | 347.1721484375 | 2563.2373657226562 | 
+
+03_vit
+vit_base_patch16_224
+| batch size | pt latency(ms) | ait latency(ms) | 
+| :------: | :------: |:------: |
+| 1 | 4.531996154785157 | 8.322586297988892 | 
+| 2 | 6.666417846679687 | 8.580682277679443 | 
+| 4 | 10.00460205078125 | 6.754000902175903 | 
+| 8 | 13.427578125 | 9.69419240951538 | 
+| 16 | 21.916123046875 | 17.138832092285156 | 
+| 32 | 40.23213134765625 | 28.402775287628174 | 
+| 64 | 72.446611328125 | 53.653794288635254 | 
+| 128 | 136.889541015625 | 99.72106170654297 | 
+| 256 | 269.488203125 | 186.07625198364258 | 
+
+04_stable_diffusion
+single batch
+| module | pt latency(ms) | ait latency(ms) | 
+| :------: | :------: |:------: |
+| clip | 11.66868896484375 | 13.639567375183105 | 
+| unet | 106.440107421875 | 71.8858814239502 | 
+| vae | 95.6298046875 | 74.00970458984375 | 
+| pipline | 5429.30386474609375 | 3681.943343162536855 | 
+
+batched version
+| batch size | pt latency(ms) | ait latency(ms) | 
+| :------: | :------: |:------: |
+| 1 | 5429.30386474609375 | 3681.943343162536855 | 
+| 2 | 13816.322286962532 | 5283.831155044027 | 
+| 4 | 23745.903372997418 | 9285.692506004125 | 
+## 应用场景
+### 算法类别
+文生图
+### 热点应用行业
+艺术设计,游戏开发,电影制作
+## 源码仓库及问题反馈
+http://10.6.10.68/modelzoo/stable_diffusion_ait
+## 参考资料
+https://github.com/ROCm/AITemplate
+
--- a/dist/aitemplate-0.0.1-py3-none-any.whl
+++ b/dist/aitemplate-0.0.1-py3-none-any.whl
--- a/examples/01_resnet-50/benchmark_ait.py
+++ b/examples/01_resnet-50/benchmark_ait.py
+#  Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+"""benchmark for resnet50"""
+
+import os
+
+import click
+
+import torch
+from aitemplate.compiler import compile_model, Model
+
+from aitemplate.frontend import Tensor
+from aitemplate.testing import detect_target
+from modeling.resnet import build_resnet_backbone
+from weight_utils import export_to_torch_tensor
+
+
+def mark_output(y):
+    """Different to PyTorch, we need to explicit mark output tensor for optimization,
+
+    Parameters
+    ----------
+    y : List[Tensor]
+        List of output tensors
+    """
+    if type(y) is not tuple:
+        y = (y,)
+    for i in range(len(y)):
+        y[i]._attrs["is_output"] = True
+        y[i]._attrs["name"] = "output_%d" % (i)
+        y_shape = [d._attrs["values"][0] for d in y[i]._attrs["shape"]]
+        print("output_{} shape: {}".format(i, y_shape))
+
+
+def compile_module(model_name, batch_size, **kwargs):
+
+    if model_name != "resnet50":
+        raise NotImplementedError
+
+    model_name = f"{model_name}_{batch_size}"
+    target = detect_target(**kwargs)
+    # Create input tensor, need to specify the shape, dtype and is_input flag
+    x = Tensor(
+        shape=[batch_size, 224, 224, 3], dtype="float16", name="input0", is_input=True
+    )
+    model = build_resnet_backbone(50, activation="ReLU")
+    # Mark all parameters with name same to PyTorch name convention
+    model.name_parameter_tensor()
+    # Forward the input tensor to the model, get output tensor
+    y = model(x)
+    # Mark output tensor
+    mark_output(y)
+    # Compile the model
+    module = compile_model(y, target, "./tmp", model_name, profile_devs=[0,1,2,3])
+    return module
+
+
+def benchmark(model_name, batch_size, cuda_params, mod=None, graph_mode=True):
+    # Load compiled model
+    if mod is None:
+        model_name = f"{model_name}_{batch_size}"
+        mod = Model(os.path.join("./tmp", model_name, "test.so"))
+
+    # Set params
+    mod.set_many_constants_with_tensors(cuda_params)
+    mod.fold_constants(sync=True)
+
+    # prepare input/output tensor
+    x_input = torch.randn([batch_size, 224, 224, 3]).cuda().half()
+    x_input = x_input.contiguous()
+    y_output = torch.zeros([batch_size, 1, 1, 1000]).cuda().half()
+    y_output = y_output.contiguous()
+
+    # warm up
+    t, _, __ = mod.benchmark_with_tensors(
+        [x_input],
+        [y_output],
+        count=100,
+        repeat=4,
+        graph_mode=graph_mode,
+    )
+    # benchmark
+    t, _, __ = mod.benchmark_with_tensors(
+        [x_input],
+        [y_output],
+        count=100,
+        repeat=4,
+        graph_mode=graph_mode,
+    )
+    print(f"batch_size: {batch_size}, latency: {t}")
+    dev_flag = os.environ.get("HIP_VISIBLE_DEVICES", "-1")
+    dev_flag = dev_flag.replace(",", "_")
+    with open(f"resnet50_ait_benchmark_dev_{dev_flag}.txt", "a") as f:
+        f.write(f"batch_size: {batch_size}, latency: {t}\n")
+
+
+@click.command()
+@click.option(
+    "--use-fp16-acc",
+    type=bool,
+    default=True,
+    help="Whether to use FP16 for accumulation (similar to TensorRT)",
+)
+@click.option("--use-graph", type=bool, default=True, help="Whether to use CUDA graph")
+@click.option("--batch-size", type=int, default=0, help="Batch size")
+def main(use_fp16_acc=True, use_graph=True, batch_size=0):
+    if detect_target().name() == "rocm":
+        use_graph = False
+        
+    pretrained_path = "./resnet50_a1_0-14fe96d1.pth"
+    cuda_params = export_to_torch_tensor(model_name="resnet50", model_path=pretrained_path)
+    
+    if batch_size < 1:
+        for bs in (1, 2, 4, 8, 16, 32, 64, 128, 256):
+            compile_module("resnet50", bs, use_fp16_acc=use_fp16_acc)
+            benchmark("resnet50", bs, cuda_params, graph_mode=use_graph)
+    else:
+        benchmark("resnet50", batch_size, cuda_params, graph_mode=use_graph)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/01_resnet-50/benchmark_mi250.sh
+++ b/examples/01_resnet-50/benchmark_mi250.sh
+#!/bin/bash
+
+HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size "$1" &
+HIP_VISIBLE_DEVICES=1 python3 benchmark_ait.py --batch-size "$1" && fg
--- a/examples/01_resnet-50/benchmark_pt.py
+++ b/examples/01_resnet-50/benchmark_pt.py
+#  Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+import os
+
+import click
+import timm
+import torch
+from aitemplate.testing.benchmark_pt import benchmark_torch_function
+
+
+def benchmark(model, batch_size):
+    with torch.inference_mode():
+        input_shape = (batch_size, 3, 224, 224)
+        input_data = torch.randn(input_shape).cuda().half()
+        # warm up
+        benchmark_torch_function(100, model, input_data)
+        # benchmark
+        t = benchmark_torch_function(100, model, input_data)
+        print("batch_size: {}, time: {}".format(batch_size, t))
+        dev_flag = os.environ.get("HIP_VISIBLE_DEVICES", "-1")
+        dev_flag = dev_flag.replace(",", "_")
+        with open(f"resnet50_pt_benchmark_dev_{dev_flag}.txt", "a") as f:
+            f.write("batch_size: {}, latency: {}\n".format(batch_size, t))
+
+
+@click.command()
+@click.option("--batch-size", default=0, type=int)
+def main(batch_size):
+    model = timm.create_model(
+                "resnet50", pretrained=True, num_classes=1000, 
+                pretrained_cfg_overlay=dict(file="./resnet50_a1_0-14fe96d1.pth")
+            ).cuda().half()
+    model.eval()
+    if batch_size == 0:
+        for batch_size in [1, 2, 4, 8, 16, 32, 64, 128, 256]:
+            benchmark(model, batch_size)
+    else:
+        benchmark(model, batch_size)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/01_resnet-50/cat.png
+++ b/examples/01_resnet-50/cat.png
--- a/examples/01_resnet-50/infer_with_torch.py
+++ b/examples/01_resnet-50/infer_with_torch.py
+#  Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+import os
+
+import numpy as np
+import torch
+from aitemplate.compiler import compile_model, Model
+
+from aitemplate.frontend import Tensor
+from aitemplate.testing import detect_target
+from modeling.resnet import build_resnet_backbone
+from PIL import Image
+from weight_utils import timm_export
+
+
+def mark_output(y):
+    """Different to PyTorch, we need to explicit mark output tensor for optimization,
+
+    Parameters
+    ----------
+    y : List[Tensor]
+        List of output tensors
+    """
+    if type(y) is not tuple:
+        y = (y,)
+    for i in range(len(y)):
+        y[i]._attrs["is_output"] = True
+        y[i]._attrs["name"] = "output_%d" % (i)
+        y_shape = [d._attrs["values"][0] for d in y[i]._attrs["shape"]]
+        print("output_{} shape: {}".format(i, y_shape))
+
+
+def compile_module(model_name, **kwargs):
+    batch_size = 1
+
+    if model_name != "resnet50":
+        raise NotImplementedError
+
+    model_name = f"{model_name}_{batch_size}"
+    target = detect_target(**kwargs)
+    # Create input tensor, need to specify the shape, dtype and is_input flag
+    x = Tensor(
+        shape=[batch_size, 224, 224, 3], dtype="float16", name="input0", is_input=True
+    )
+    model = build_resnet_backbone(50, activation="ReLU")
+    # Mark all parameters with name same to PyTorch name convention
+    model.name_parameter_tensor()
+    # Forward the input tensor to the model, get output tensor
+    y = model(x)
+    # Mark output tensor
+    mark_output(y)
+    # Compile the model
+    module = compile_model(y, target, "./tmp", model_name)
+    return module
+
+
+def prepare_data(img_path=None):
+    # we find a 224x224 image online for demo purpose:
+    img_url = "https://github.com/dmlc/mxnet.js/blob/main/data/cat.png?raw=true"
+    if img_path is None:
+        if os.path.exists("cat.png") is False:
+            os.system(f"wget -O cat.png {img_url}")
+        img_path = "cat.png"
+    image = Image.open(img_path).resize((224, 224))
+    image = torch.as_tensor(np.array(image).astype("float32")).cuda().half()
+    image = image.unsqueeze(0)
+    mean = torch.tensor([0.485, 0.456, 0.406]).cuda().half()
+    std = torch.tensor([0.229, 0.224, 0.225]).cuda().half()
+    image = (image / 255.0 - mean[None, None, None, :]) / std[None, None, None, :]
+    return image
+
+
+def export_to_torch_tensor(model_path, model_name="resnet50"):
+    if model_name != "resnet50":
+        raise NotImplementedError
+    timm2ait = timm_export(model_name, pretrained_path=model_path)
+    params = timm2ait.export_model(half=True)
+    return params, timm2ait.pt_model
+
+
+def inference(model_name, mod=None):
+    # Load params
+    pretrained_path = "./resnet50_a1_0-14fe96d1.pth"
+    cuda_params, pt_model = export_to_torch_tensor(model_name=model_name, model_path=pretrained_path)
+    # Load compiled model
+    if mod is None:
+        mod = Model(os.path.join("./tmp", model_name, "test.so"))
+
+    # Set torch tensor params to runtime
+    mod.set_many_constants_with_tensors(cuda_params)
+    mod.fold_constants(sync=True)
+
+    # prepare input/output tensor
+    x_input = prepare_data("cat.png")
+    x_input = x_input.contiguous()
+    y_output = torch.zeros([1, 1, 1, 1000]).cuda().half()
+    y_output = y_output.contiguous()
+
+    # execute
+    mod.run_with_tensors([x_input], [y_output])
+
+    # process output with pytorch
+    y_label = torch.argmax(y_output, dim=-1)
+    y_cpu = y_label.cpu().numpy()
+    print(y_cpu)
+
+    # run pytorch
+    pt_model.eval()
+    pt_model = pt_model.cuda().half()
+    pt_output = pt_model(x_input.permute([0, 3, 1, 2]))
+    # pt_output = pt_model(x_input)
+    y_label = torch.argmax(pt_output, dim=-1)
+    y_cpu = y_label.cpu().numpy()
+    print(y_cpu)
+
+    # verify outputs
+    assert torch.allclose(y_output, pt_output, 1e-1, 1e-1)
+    print("Verification done!")
+    
+
+if __name__ == "__main__":
+    np.random.seed(4896)
+    model_name = "resnet50"
+    mod = compile_module(model_name, use_fp16_acc=True)
+    inference(model_name, mod)
--- a/examples/01_resnet-50/modeling/__init__.py
+++ b/examples/01_resnet-50/modeling/__init__.py
+#  Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
--- a/examples/01_resnet-50/modeling/__pycache__/__init__.cpython-310.pyc
+++ b/examples/01_resnet-50/modeling/__pycache__/__init__.cpython-310.pyc
--- a/examples/01_resnet-50/modeling/__pycache__/resnet.cpython-310.pyc
+++ b/examples/01_resnet-50/modeling/__pycache__/resnet.cpython-310.pyc
--- a/examples/01_resnet-50/modeling/resnet.py
+++ b/examples/01_resnet-50/modeling/resnet.py
+#  Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+import numpy as np
+from aitemplate.frontend import nn
+from aitemplate.testing import detect_target
+
+
+class CNNBlockBase(nn.Module):
+    """
+    A CNN block is assumed to have input channels, output channels and a stride.
+    The input and output of `forward()` method must be NHWC tensors.
+    The method can perform arbitrary computation but must match the given
+    channels and stride specification.
+    Attribute:
+        in_channels (int):
+        out_channels (int):
+        stride (int):
+    """
+
+    def __init__(self, in_channels, out_channels, stride):
+        """
+        The `__init__` method of any subclass should also contain these arguments.
+        Args:
+            in_channels (int):
+            out_channels (int):
+            stride (int):
+        """
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.stride = stride
+
+
+class BasicStem(CNNBlockBase):
+    """
+    The standard ResNet stem (layers before the first residual block),
+    with a conv, relu and max_pool.
+    """
+
+    def __init__(self, in_channels=3, out_channels=64, norm="BN", activation="ReLU"):
+        super().__init__(in_channels, out_channels, 4)
+        conv_op = None
+        if detect_target().name() == "cuda":
+            if activation == "ReLU":
+                conv_op = nn.Conv2dBiasReluFewChannels
+            elif activation == "Hardswish":
+                conv_op = nn.Conv2dBiasHardswishFewChannels
+            else:
+                raise NotImplementedError
+        else:
+            if activation == "ReLU":
+                conv_op = nn.Conv2dBiasRelu
+            elif activation == "Hardswish":
+                conv_op = nn.Conv2dBiasHardswish
+            else:
+                raise NotImplementedError
+        self.conv1 = conv_op(in_channels, out_channels, 7, 2, 7 // 2)
+        self.pool = nn.MaxPool2d(3, 2, 1)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.pool(x)
+        return x
+
+
+class BasicBlock(CNNBlockBase):
+    """
+    The basic residual block for ResNet-18 and ResNet-34 defined in :paper:`ResNet`,
+    with two 3x3 conv layers and a projection shortcut if needed.
+    """
+
+    def __init__(self, in_channels, out_channels, *, stride=1, norm="BN"):
+        super().__init__(in_channels, out_channels, stride)
+
+    def forward(self, x):
+        raise NotImplementedError()
+
+
+class BottleneckBlock(CNNBlockBase):
+    """
+    The standard bottleneck residual block used by ResNet-50, 101 and 152
+    defined in :paper:`ResNet`.  It contains 3 conv layers with kernels
+    1x1, 3x3, 1x1, and a projection shortcut if needed.
+    """
+
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        *,
+        bottleneck_channels,
+        stride=1,
+        num_groups=1,
+        norm="BN",
+        activation="ReLU",
+        stride_in_1x1=False,
+        dilation=1,
+    ):
+        """
+        Args:
+            bottleneck_channels (int): number of output channels for the 3x3
+                "bottleneck" conv layers.
+            num_groups (int): number of groups for the 3x3 conv layer.
+            norm (str or callable): normalization for all conv layers.
+                See :func:`layers.get_norm` for supported format.
+            stride_in_1x1 (bool): when stride>1, whether to put stride in the
+                first 1x1 convolution or the bottleneck 3x3 convolution.
+            dilation (int): the dilation rate of the 3x3 conv layer.
+        """
+        super().__init__(in_channels, out_channels, stride)
+
+        if in_channels != out_channels:
+            self.downsample_0 = nn.Conv2dBias(in_channels, out_channels, 1, stride, 0)
+        else:
+            self.downsample_0 = None
+
+        # The original MSRA ResNet models have stride in the first 1x1 conv
+        # The subsequent fb.torch.resnet and Caffe2 ResNe[X]t implementations have
+        # stride in the 3x3 conv
+        stride_1x1, stride_3x3 = (stride, 1) if stride_in_1x1 else (1, stride)
+
+        conv_op = None
+        conv_op_add = None
+        if activation == "ReLU":
+            conv_op = nn.Conv2dBiasRelu
+            conv_op_add = nn.Conv2dBiasAddRelu
+        elif activation == "Hardswish":
+            conv_op = nn.Conv2dBiasHardswish
+            conv_op_add = nn.Conv2dBiasAddHardswish
+        else:
+            raise NotImplementedError
+
+        self.conv1 = conv_op(in_channels, bottleneck_channels, 1, stride_1x1, 0)
+
+        self.conv2 = conv_op(
+            bottleneck_channels,
+            bottleneck_channels,
+            3,
+            stride_3x3,
+            1 * dilation,
+            dilation,
+        )
+
+        self.conv3 = conv_op_add(bottleneck_channels, out_channels, 1, 1, 0)
+
+        # for layer in [self.conv1, self.conv2, self.conv3, self.shortcut]:
+        #     if layer is not None:  # shortcut can be None
+        #         weight_init.c2_msra_fill(layer)
+
+        # Zero-initialize the last normalization in each residual branch,
+        # so that at the beginning, the residual branch starts with zeros,
+        # and each residual block behaves like an identity.
+        # See Sec 5.1 in "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour":
+        # "For BN layers, the learnable scaling coefficient γ is initialized
+        # to be 1, except for each residual block's last BN
+        # where γ is initialized to be 0."
+
+        # nn.init.constant_(self.conv3.norm.weight, 0)
+        # TODO this somehow hurts performance when training GN models from scratch.
+        # Add it as an option when we need to use this code to train a backbone.
+
+    def forward(self, x):
+        out = self.conv1(x)
+        out = self.conv2(out)
+
+        if self.downsample_0 is not None:
+            downsample = self.downsample_0(x)
+        else:
+            downsample = x
+
+        out = self.conv3(out, downsample)
+        return out
+
+
+class ResNet(nn.Module):
+    """
+    Implement :paper:`ResNet`.
+    """
+
+    def __init__(self, stem, stages, num_classes=None, out_features=None, freeze_at=0):
+        """
+        Args:
+            stem (nn.Module): a stem module
+            stages (list[list[CNNBlockBase]]): several (typically 4) stages,
+                each contains multiple :class:`CNNBlockBase`.
+            activation (str): activation function to use.
+            num_classes (None or int): if None, will not perform classification.
+                Otherwise, will create a linear layer.
+            out_features (list[str]): name of the layers whose outputs should
+                be returned in forward. Can be anything in "stem", "linear", or "res2" ...
+                If None, will return the output of the last layer.
+            freeze_at (int): The number of stages at the beginning to freeze.
+                see :meth:`freeze` for detailed explanation.
+        """
+        super().__init__()
+        self.stem = stem
+        self.num_classes = num_classes
+
+        current_stride = self.stem.stride
+        self._out_feature_strides = {"stem": current_stride}
+        self._out_feature_channels = {"stem": self.stem.out_channels}
+
+        self.stage_names, self.stages = [], []
+
+        if out_features is not None:
+            # Avoid keeping unused layers in this module. They consume extra memory
+            # and may cause allreduce to fail
+            num_stages = max(
+                [
+                    {"layer1": 1, "layer2": 2, "layer3": 3, "layer4": 4}.get(f, 0)
+                    for f in out_features
+                ]
+            )
+            stages = stages[:num_stages]
+
+        for i, blocks in enumerate(stages):
+            assert len(blocks) > 0, len(blocks)
+            for block in blocks:
+                assert isinstance(block, CNNBlockBase), block
+
+            name = "layer" + str(i + 1)
+            stage = nn.Sequential(*blocks)
+
+            self.add_module(name, stage)
+            self.stage_names.append(name)
+            self.stages.append(stage)
+
+            self._out_feature_strides[name] = current_stride = int(
+                current_stride * np.prod([k.stride for k in blocks])
+            )
+            self._out_feature_channels[name] = curr_channels = blocks[-1].out_channels
+
+        self.stage_names = tuple(self.stage_names)  # Make it static for scripting
+
+        if num_classes is not None:
+            self.avgpool = nn.AvgPool2d(7, 1, 0)
+            self.fc = nn.Linear(curr_channels, num_classes)
+
+        if out_features is None:
+            out_features = [name]
+        self._out_features = out_features
+        assert len(self._out_features)
+        children = [x[0] for x in self.named_children()]
+        for out_feature in self._out_features:
+            assert out_feature in children, "Available children: {}".format(
+                ", ".join(children)
+            )
+        self.reshape = nn.Reshape()
+
+    def forward(self, x):
+        """
+        Args:
+            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.
+        Returns:
+            dict[str->Tensor]: names and the corresponding features
+        """
+        # assert x.dim() == 4, f"ResNet takes an input of shape (N, C, H, W). Got {x.shape} instead!"
+        outputs = {}
+        x = self.stem(x)
+        if "stem" in self._out_features:
+            outputs["stem"] = x
+        for name, stage in zip(self.stage_names, self.stages):
+            x = stage(x)
+            if name in self._out_features:
+                outputs[name] = x
+        if self.num_classes is not None:
+            x = self.avgpool(x)
+            x = self.fc(x)
+            if x._rank() == 2:
+                x = self.reshape(x, [x._size(0), 1, 1, x._size(1)])
+            return x
+        return outputs
+
+    @staticmethod
+    def make_stage(block_class, num_blocks, *, in_channels, out_channels, **kwargs):
+        """
+        Create a list of blocks of the same type that forms one ResNet stage.
+        Args:
+            block_class (type): a subclass of CNNBlockBase that's used to create all blocks in this
+                stage. A module of this type must not change spatial resolution of inputs unless its
+                stride != 1.
+            num_blocks (int): number of blocks in this stage
+            in_channels (int): input channels of the entire stage.
+            out_channels (int): output channels of **every block** in the stage.
+            kwargs: other arguments passed to the constructor of
+                `block_class`. If the argument name is "xx_per_block", the
+                argument is a list of values to be passed to each block in the
+                stage. Otherwise, the same argument is passed to every block
+                in the stage.
+        Returns:
+            list[CNNBlockBase]: a list of block module.
+        Examples:
+        ::
+            stage = ResNet.make_stage(
+                BottleneckBlock, 3, in_channels=16, out_channels=64,
+                bottleneck_channels=16, num_groups=1,
+                stride_per_block=[2, 1, 1],
+                dilations_per_block=[1, 1, 2]
+            )
+        Usually, layers that produce the same feature map spatial size are defined as one
+        "stage" (in :paper:`FPN`). Under such definition, ``stride_per_block[1:]`` should
+        all be 1.
+        """
+        blocks = []
+        for i in range(num_blocks):
+            curr_kwargs = {}
+            for k, v in kwargs.items():
+                if k.endswith("_per_block"):
+                    assert len(v) == num_blocks, (
+                        f"Argument '{k}' of make_stage should have the "
+                        f"same length as num_blocks={num_blocks}."
+                    )
+                    newk = k[: -len("_per_block")]
+                    assert (
+                        newk not in kwargs
+                    ), f"Cannot call make_stage with both {k} and {newk}!"
+                    curr_kwargs[newk] = v[i]
+                else:
+                    curr_kwargs[k] = v
+
+            blocks.append(
+                block_class(
+                    in_channels=in_channels, out_channels=out_channels, **curr_kwargs
+                )
+            )
+            in_channels = out_channels
+        return blocks
+
+    @staticmethod
+    def make_default_stages(depth, block_class=None, **kwargs):
+        """
+        Created list of ResNet stages from pre-defined depth (one of 18, 34, 50, 101, 152).
+        If it doesn't create the ResNet variant you need, please use :meth:`make_stage`
+        instead for fine-grained customization.
+        Args:
+            depth (int): depth of ResNet
+            block_class (type): the CNN block class. Has to accept
+                `bottleneck_channels` argument for depth > 50.
+                By default it is BasicBlock or BottleneckBlock, based on the
+                depth.
+            kwargs:
+                other arguments to pass to `make_stage`. Should not contain
+                stride and channels, as they are predefined for each depth.
+        Returns:
+            list[list[CNNBlockBase]]: modules in all stages; see arguments of
+                :class:`ResNet.__init__`.
+        """
+        num_blocks_per_stage = {
+            18: [2, 2, 2, 2],
+            34: [3, 4, 6, 3],
+            50: [3, 4, 6, 3],
+            101: [3, 4, 23, 3],
+            152: [3, 8, 36, 3],
+        }[depth]
+        if block_class is None:
+            block_class = BasicBlock if depth < 50 else BottleneckBlock
+        if depth < 50:
+            in_channels = [64, 64, 128, 256]
+            out_channels = [64, 128, 256, 512]
+        else:
+            in_channels = [64, 256, 512, 1024]
+            out_channels = [256, 512, 1024, 2048]
+        ret = []
+        for (n, s, i, o) in zip(
+            num_blocks_per_stage, [1, 2, 2, 2], in_channels, out_channels
+        ):
+            if depth >= 50:
+                kwargs["bottleneck_channels"] = o // 4
+            ret.append(
+                ResNet.make_stage(
+                    block_class=block_class,
+                    num_blocks=n,
+                    stride_per_block=[s] + [1] * (n - 1),
+                    in_channels=i,
+                    out_channels=o,
+                    **kwargs,
+                )
+            )
+        return ret
+
+
+def make_stage(*args, **kwargs):
+    """
+    Deprecated alias for backward compatibiltiy.
+    """
+    return ResNet.make_stage(*args, **kwargs)
+
+
+def build_resnet_backbone(depth, activation):
+    """
+    Create a ResNet instance from config.
+    Returns:
+        ResNet: a :class:`ResNet` instance.
+    """
+    norm = "BN"
+    activation = activation
+    num_groups = 1
+    stride_in_1x1 = False
+    num_groups = 1
+    width_per_group = 64
+    bottleneck_channels = num_groups * width_per_group
+    in_channels = 64
+    out_channels = 256
+
+    stem = BasicStem(in_channels=3, out_channels=64, norm=norm, activation=activation)
+
+    num_blocks_per_stage = {
+        18: [2, 2, 2, 2],
+        34: [3, 4, 6, 3],
+        50: [3, 4, 6, 3],
+        101: [3, 4, 23, 3],
+        152: [3, 8, 36, 3],
+    }[depth]
+
+    stages = []
+
+    for idx, stage_idx in enumerate(range(2, 6)):
+        # res5_dilation is used this way as a convention in R-FCN & Deformable Conv paper
+        dilation = 1
+        first_stride = 1 if idx == 0 or (stage_idx == 5 and dilation == 2) else 2
+        stage_kargs = {
+            "num_blocks": num_blocks_per_stage[idx],
+            "stride_per_block": [first_stride] + [1] * (num_blocks_per_stage[idx] - 1),
+            "in_channels": in_channels,
+            "out_channels": out_channels,
+            "norm": norm,
+            "activation": activation,
+        }
+        # Use BasicBlock for R18 and R34.
+        if depth in [18, 34]:
+            stage_kargs["block_class"] = BasicBlock
+        else:
+            stage_kargs["bottleneck_channels"] = bottleneck_channels
+            stage_kargs["stride_in_1x1"] = stride_in_1x1
+            stage_kargs["dilation"] = dilation
+            stage_kargs["num_groups"] = num_groups
+            stage_kargs["block_class"] = BottleneckBlock
+        blocks = ResNet.make_stage(**stage_kargs)
+        in_channels = out_channels
+        out_channels *= 2
+        bottleneck_channels *= 2
+        stages.append(blocks)
+
+    return ResNet(stem, stages, num_classes=1000)
--- a/examples/01_resnet-50/weight_utils.py
+++ b/examples/01_resnet-50/weight_utils.py
+#  Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+"""
+script for converting model from timm to aitemplate
+Only tested on resnet50
+"""
+
+
+import pickle
+import re
+
+import click
+import numpy as np
+import timm
+import torch
+from aitemplate.testing import detect_target
+
+CONV_WEIGHT_PATTERN = re.compile(r"conv\d+\.weight")
+
+
+class timm_export:
+    def __init__(self, model_name, pretrained_path, pretrained=True):
+        self.model_name = model_name
+        if model_name != "resnet50":
+            raise NotImplementedError
+
+        with torch.no_grad():
+            self.pt_model = timm.create_model(
+                model_name, pretrained=pretrained, num_classes=1000, 
+                pretrained_cfg_overlay=dict(file=pretrained_path)
+            )
+        self.pt_state = self.pt_model.state_dict()
+
+    def export_model(self, half=True):
+        fused_model = {}
+        for param_name in self.pt_state.keys():
+            self.transform_params(param_name, fused_model)
+        ait_model = {k.replace(".", "_"): weight for k, weight in fused_model.items()}
+        if detect_target().name() == "cuda":
+            self.export_conv0(ait_model, fused_model)
+        if half:
+            half_params = {}
+            for k, v in ait_model.items():
+                half_params[k] = v.detach().cuda().half().contiguous()
+            return half_params
+        return ait_model
+
+    def fuse_conv_bn_weights(
+        self, conv_w, conv_b, bn_rm, bn_rv, bn_eps, bn_w, bn_b, transpose=False
+    ):
+        conv_w = torch.tensor(conv_w)
+        bn_rm = torch.tensor(bn_rm)
+        bn_rv = torch.tensor(bn_rv)
+        conv_b = torch.tensor(conv_b) if conv_b is not None else torch.zeros_like(bn_rm)
+        bn_w = torch.tensor(bn_w) if bn_w is not None else torch.ones_like(bn_rm)
+        bn_b = torch.tensor(bn_b) if bn_b is not None else torch.zeros_like(bn_rm)
+        bn_eps = torch.tensor(bn_eps)
+
+        bn_var_rsqrt = torch.rsqrt(bn_rv + bn_eps)
+
+        if transpose:
+            shape = [1, -1] + [1] * (len(conv_w.shape) - 2)
+        else:
+            shape = [-1, 1] + [1] * (len(conv_w.shape) - 2)
+
+        conv_w = conv_w * (bn_w * bn_var_rsqrt).reshape(shape)
+        conv_b = (conv_b - bn_rm) * bn_var_rsqrt * bn_w + bn_b
+
+        # NCHW -> NHWC
+        conv_w = conv_w.permute(0, 2, 3, 1).contiguous()
+        for arr in [conv_w.numpy(), conv_b.numpy()]:
+            if np.isnan(arr).any():
+                print("fuse bn error")
+        return conv_w, conv_b
+
+    def transform_conv0(self):
+        conv_w = self.pt_state["conv1.weight"]
+        bn_w = self.pt_state["bn1.weight"]
+        bn_b = self.pt_state["bn1.bias"]
+        bn_rm = self.pt_state["bn1.running_mean"]
+        bn_rv = self.pt_state["bn1.running_var"]
+        fused_w, fused_b = self.fuse_conv_bn_weights(
+            conv_w, None, bn_rm, bn_rv, 1e-5, bn_w, bn_b
+        )
+        return fused_w, fused_b
+
+    def transform_params(self, param_name, fused_model):
+        if param_name == "conv1.weight":
+            fused_w, fused_b = self.transform_conv0()
+            fused_model["stem.conv1.weight"] = fused_w
+            fused_model["stem.conv1.bias"] = fused_b
+        elif "downsample.0.weight" in param_name:
+            fused_w, fused_b = self.transform_downsample(param_name)
+            fused_model[param_name] = fused_w
+            fused_model[param_name.replace("weight", "bias")] = fused_b
+        elif param_name == "fc.weight":
+            fused_model["fc.weight"] = self.pt_state["fc.weight"]
+            fused_model["fc.bias"] = self.pt_state["fc.bias"]
+        elif CONV_WEIGHT_PATTERN.search(param_name) is not None:
+            bn_w_name = param_name.replace("conv", "bn")
+            conv_w = self.pt_state[param_name]
+            bn_w = self.pt_state[bn_w_name]
+            bn_b = self.pt_state[bn_w_name.replace("weight", "bias")]
+            bn_rm = self.pt_state[bn_w_name.replace("weight", "running_mean")]
+            bn_rv = self.pt_state[bn_w_name.replace("weight", "running_var")]
+            fused_w, fused_b = self.fuse_conv_bn_weights(
+                conv_w, None, bn_rm, bn_rv, 1e-5, bn_w, bn_b
+            )
+            fused_model[param_name] = fused_w
+            fused_model[param_name.replace("weight", "bias")] = fused_b
+        else:
+            pass
+
+    def transform_downsample(self, param_name):
+        assert "downsample" in param_name
+        tags = param_name.split(".")
+        block_tag = ".".join(tags[:2])
+        conv_w = self.pt_state[f"{block_tag}.downsample.0.weight"]
+        bn_w = self.pt_state[f"{block_tag}.downsample.1.weight"]
+        bn_b = self.pt_state[f"{block_tag}.downsample.1.bias"]
+        bn_rm = self.pt_state[f"{block_tag}.downsample.1.running_mean"]
+        bn_rv = self.pt_state[f"{block_tag}.downsample.1.running_var"]
+        fused_w, fused_b = self.fuse_conv_bn_weights(
+            conv_w, None, bn_rm, bn_rv, 1e-5, bn_w, bn_b
+        )
+        return fused_w, fused_b
+
+    def export_conv0(self, ait_model, fuse_model):
+        pt_name = "stem.conv1.weight"
+        x = fuse_model[pt_name]
+        conv_w = torch.zeros((64, 7, 7, 4))
+        conv_w[:, :, :, :3] = x
+        ait_model[pt_name.replace(".", "_")] = conv_w
+
+
+def export_to_torch_tensor(model_path, model_name="resnet50"):
+    if model_name != "resnet50":
+        raise NotImplementedError
+    timm2ait = timm_export(model_name, pretrained_path=model_path)
+    ait_model = timm2ait.export_model(half=True)
+    return ait_model
+
+
+@click.command()
+@click.option("--param-path", type=str, default="resnet50.pkl")
+def export_to_numpy(param_path):
+    ait_model = export_to_torch_tensor()
+    np_weights = {}
+    for k, v in ait_model.items():
+        np_weights[k] = v.detach().cpu().numpy().astype(np.float16)
+
+    with open(param_path, "wb") as f:
+        pickle.dump(np_weights, f)
+
+
+if __name__ == "__main__":
+    export_to_numpy()
--- a/examples/02_bert/benchmark_ait.py
+++ b/examples/02_bert/benchmark_ait.py
+#  Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+import os
+
+from typing import Dict, List
+
+import click
+import numpy as np
+import torch
+from aitemplate.compiler import compile_model, Model
+
+from aitemplate.frontend import Tensor
+from aitemplate.testing import detect_target
+
+from modeling.bert import BertBaseEncodersOnly, BertBaseUncased
+from modeling.torch_model import BertBaseUncased as BertPt
+
+
+def mark_output(y: Tensor) -> None:
+    if type(y) is not tuple:
+        y = (y,)
+    for i in range(len(y)):
+        y[i]._attrs["is_output"] = True
+        y[i]._attrs["name"] = "output_%d" % (i)
+        y_shape = [d._attrs["values"][0] for d in y[i]._attrs["shape"]]
+        print("output_{} shape: {}".format(i, y_shape))
+
+
+def create_bert_inputs(
+    batch_size: int, seq_length: int, dtype: str = "int64"
+) -> List[Tensor]:
+    input_ids = Tensor(
+        shape=[batch_size, seq_length],
+        name="input_ids",
+        dtype=dtype,
+        is_input=True,
+    )
+    token_type_ids = Tensor(
+        shape=[batch_size, seq_length],
+        name="token_type_ids",
+        dtype=dtype,
+        is_input=True,
+    )
+    position_ids = Tensor(
+        shape=[batch_size, seq_length],
+        name="position_ids",
+        dtype=dtype,
+        is_input=True,
+    )
+    return [input_ids, token_type_ids, position_ids]
+
+
+def create_bert_encoders_input(
+    batch_size: int, seq_length: int, hidden: int, dtype: str = "float16"
+):
+    encoder_input = Tensor(
+        shape=[batch_size, seq_length, hidden],
+        name="input",
+        dtype=dtype,
+        is_input=True,
+    )
+    return [encoder_input]
+
+
+def create_bert_inputs_pt(
+    batch_size: int,
+    seq_length: int,
+    vocab_size: int = 30522,
+    type_vocab_size: int = 2,
+    dtype: torch.dtype = torch.int64,
+) -> Dict[str, torch.Tensor]:
+    input_ids = torch.randint(
+        0, vocab_size, (batch_size, seq_length), dtype=dtype
+    ).cuda()
+    token_type_ids = torch.randint(
+        0, type_vocab_size, input_ids.size(), dtype=dtype
+    ).cuda()
+    position_ids = (
+        torch.arange(seq_length, dtype=dtype)
+        .reshape((1, -1))
+        .expand(batch_size, -1)
+        .contiguous()
+        .cuda()
+    )
+    return {
+        "input_ids": input_ids,
+        "token_type_ids": token_type_ids,
+        "position_ids": position_ids,
+    }
+
+
+def create_bert_encoders_inputs_pt(
+    batch_size: int, seq_length: int, hidden_size: int
+) -> Dict[str, torch.Tensor]:
+    encoder_input = torch.randn([batch_size, seq_length, hidden_size]).cuda().half()
+    return {"input": encoder_input}
+
+
+def map_pt_params(
+    ait_bert, pt_bert, batch_size: int, seq_length: int
+) -> Dict[str, torch.Tensor]:
+    pt_params = dict(pt_bert.named_parameters())
+    mapped_pt_params = {}
+    for name, _ in ait_bert.named_parameters():
+        ait_name = name.replace(".", "_")
+        if name in pt_params:
+            mapped_pt_params[ait_name] = pt_params[name]
+            continue
+
+        if name.endswith("self.qkv.weight"):
+            prefix = name[: -len("qkv.weight")]
+            q_weight = pt_params[prefix + "query.weight"]
+            k_weight = pt_params[prefix + "key.weight"]
+            v_weight = pt_params[prefix + "value.weight"]
+            qkv_weight = torch.cat([q_weight, k_weight, v_weight], dim=0)
+            mapped_pt_params[ait_name] = qkv_weight
+        elif name.endswith("self.qkv.bias"):
+            prefix = name[: -len("qkv.bias")]
+            q_bias = pt_params[prefix + "query.bias"]
+            k_bias = pt_params[prefix + "key.bias"]
+            v_bias = pt_params[prefix + "value.bias"]
+            qkv_bias = torch.cat([q_bias, k_bias, v_bias], dim=0)
+            mapped_pt_params[ait_name] = qkv_bias
+        elif name.endswith("self.proj.weight"):
+            prefix = name[: -len("self.proj.weight")]
+            pt_name = prefix + "output.dense.weight"
+            mapped_pt_params[ait_name] = pt_params[pt_name]
+        elif name.endswith("self.proj.bias"):
+            prefix = name[: -len("self.proj.bias")]
+            pt_name = prefix + "output.dense.bias"
+            mapped_pt_params[ait_name] = pt_params[pt_name]
+        elif name.endswith("cu_length"):
+            cu_len = np.cumsum([0] + [seq_length] * batch_size).astype("int32")
+            mapped_pt_params[ait_name] = torch.from_numpy(cu_len).cuda()
+        else:
+            pt_param = pt_bert.get_parameter(name)
+            mapped_pt_params[ait_name] = pt_param
+
+    return mapped_pt_params
+
+
+def benchmark(
+    batch_size: int,
+    seq_length: int,
+    hidden_size: int,
+    mod: Model,
+    graph_mode: bool,
+    encoders_only: bool,
+):
+    if encoders_only:
+        inputs = create_bert_encoders_inputs_pt(batch_size, seq_length, hidden_size)
+    else:
+        inputs = create_bert_inputs_pt(batch_size, seq_length)
+
+    outputs = [torch.empty(mod.get_output_maximum_shape(0)).cuda().half()]
+
+    # warm up
+    t, _, __ = mod.benchmark_with_tensors(
+        inputs,
+        outputs,
+        count=100,
+        repeat=4,
+        graph_mode=graph_mode,
+    )
+    # benchmark
+    t, _, __ = mod.benchmark_with_tensors(
+        inputs,
+        outputs,
+        count=100,
+        repeat=4,
+        graph_mode=graph_mode,
+    )
+    print(f"batch_size: {batch_size}, seq_length: {seq_length}, latency: {t}")
+    dev_flag = os.environ.get("HIP_VISIBLE_DEVICES", "-1")
+    dev_flag = dev_flag.replace(",", "_")
+    with open(f"bert_ait_benchmark_dev_{dev_flag}.txt", "a") as f:
+        f.write(f"batch_size: {batch_size}, seq_length: {seq_length}, latency: {t}\n")
+
+
+def compile_module(
+    batch_size: int,
+    seq_length: int,
+    hidden_size: int,
+    activation: str,
+    use_fp16_acc: bool,
+    encoders_only: bool,
+    pt_model: torch.nn.Module,
+) -> None:
+    model_name = f"BERT_{activation}_{batch_size}_{seq_length}"
+    target = detect_target(use_fp16_acc=use_fp16_acc)
+
+    if encoders_only:
+        inputs = create_bert_encoders_input(batch_size, seq_length, hidden_size)
+    else:
+        inputs = create_bert_inputs(batch_size, seq_length)
+
+    if encoders_only:
+        model = BertBaseEncodersOnly(batch_size, seq_length, hidden_act=activation)
+    else:
+        model = BertBaseUncased(batch_size, seq_length, hidden_act=activation)
+
+    # Mark all parameters with name same to PyTorch name convention
+    model.name_parameter_tensor()
+    # Forward the input tensor to the model, get output tensor
+    y = model(*inputs)
+    # Mark output tensor
+    mark_output(y)
+
+    params = map_pt_params(model, pt_model, batch_size, seq_length)
+
+    mod = compile_model(y, target, "./tmp", model_name, profile_devs=[0,1,2,3])
+
+    mod.set_many_constants_with_tensors(params)
+    mod.fold_constants(sync=True)
+
+    return mod
+
+
+
+
+@click.command()
+@click.option("--batch-size", type=int, default=0, help="Inference batch size")
+@click.option("--seq-length", type=int, default=0, help="Inference sequence length")
+@click.option(
+    "--activation",
+    type=str,
+    default="fast_gelu",
+    help="Activation function applied on BERT, currently only support fast_gelu on Rocm. CUDA supports both gelu and fast_gelu. No effect if framework is pt.",
+)
+@click.option(
+    "--graph-mode",
+    type=bool,
+    default=True,
+    help="Use CUDA graph or not. hipGraph is not supported yet. No effect if framework is pt.",
+)
+@click.option(
+    "--use-fp16-acc",
+    type=bool,
+    default=True,
+    help="Use fp16 accumulation or not (TensorRT is using fp16_acc). No effect if framework is pt.",
+)
+@click.option(
+    "--use-pretrained-pt-model",
+    type=bool,
+    default=True,
+    help="Whether or not to use the pretrained BERT model weights.",
+)
+@click.option(
+    "--encoders-only",
+    type=bool,
+    default=True,
+    help="Whether or not to run the BERT benchmark with encoders only. If enabled, only the transformer blocks without BERT embeddings are benchmarked.",
+)
+def compile_and_benchmark(
+    batch_size: int,
+    seq_length: int,
+    activation: str,
+    graph_mode: bool,
+    use_fp16_acc: bool,
+    use_pretrained_pt_model: bool,
+    encoders_only: bool,
+):
+    if detect_target().name() == "rocm":
+        graph_mode = False
+        assert activation in (
+            "fast_gelu"
+        ), f"Unsupported activation: {activation} on rocm"
+
+    pt_model = BertPt(pretrained=use_pretrained_pt_model, model_path="./bert-base-uncased")._model
+    pt_model.eval()
+    hidden_size = pt_model.config.hidden_size
+
+    if batch_size < 1:
+        batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]
+    else:
+        batch_sizes = [batch_size]
+
+    if seq_length < 1:
+        seq_lengths = (
+            [64, 128, 256] if encoders_only else [64, 128]
+        )
+    else:
+        seq_lengths = [seq_length]
+
+    for seq_length in seq_lengths:
+        for bs in batch_sizes:
+            mod = compile_module(
+                bs,
+                seq_length,
+                hidden_size,
+                activation,
+                use_fp16_acc,
+                encoders_only,
+                pt_model,
+            )
+            benchmark(bs, seq_length, hidden_size, mod, graph_mode, encoders_only)
+
+
+if __name__ == "__main__":
+    torch.manual_seed(4896)
+    compile_and_benchmark()
--- a/examples/02_bert/benchmark_mi250.sh
+++ b/examples/02_bert/benchmark_mi250.sh
+#!/bin/bash
+
+#profile
+HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 benchmark_ait.py
+
+#1GCD
+HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size "$1"
+
+#2GCD
+HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size "$1" &
+HIP_VISIBLE_DEVICES=1 python3 benchmark_ait.py --batch-size "$1" && fg
--- a/examples/02_bert/benchmark_pt.py
+++ b/examples/02_bert/benchmark_pt.py
+#  Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+
+import click
+import torch
+from aitemplate.testing.benchmark_pt import benchmark_torch_function
+from modeling.torch_model import BertBaseUncased
+
+
+def benchmark_pt(pretrained=True, batchsize=0):
+    bert = BertBaseUncased(pretrained=pretrained)
+    model = bert._model
+    model.eval()
+
+    if batchsize == 0:
+        candidate_batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]
+    else:
+        candidate_batch_sizes = [batchsize]
+
+    with torch.inference_mode():
+        for seq_length in [64, 128]:
+            for batch_size in candidate_batch_sizes:
+                try:
+                    input_ids, token_type_ids, position_ids = bert.generate_inputs(
+                        batch_size, seq_length
+                    )
+                    bert.forward(
+                        input_ids=input_ids,
+                        token_type_ids=token_type_ids,
+                        position_ids=position_ids,
+                    )
+                    # warmup
+                    t = benchmark_torch_function(
+                        100,
+                        bert.forward,
+                        input_ids=input_ids,
+                        token_type_ids=token_type_ids,
+                        position_ids=position_ids,
+                    )
+                    # benchmark
+                    t = benchmark_torch_function(
+                        100,
+                        bert.forward,
+                        input_ids=input_ids,
+                        token_type_ids=token_type_ids,
+                        position_ids=position_ids,
+                    )
+                    print(
+                        f"bert pt: batch_size: {batch_size}, seq_length: {seq_length}, {t} ms",
+                    )
+                    with open("bert_pt_benchmark.txt", "a") as f:
+                        f.write(
+                            f"batch_size: {batch_size}, seq_length: {seq_length} latency: {t} ms\n"
+                        )
+                except RuntimeError:
+                    # pt runs out of memory
+                    break
+
+
+def benchmark_pt_encoders_only(pretrained=True, batchsize=0):
+    model = BertBaseUncased(pretrained=pretrained)
+    pt_bert = model._model
+    pt_bert.eval()
+
+    encoder = pt_bert.bert.encoder
+    hidden_size = pt_bert.config.hidden_size
+
+    if batchsize == 0:
+        candidate_batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]
+    else:
+        candidate_batch_sizes = [batchsize]
+
+    for seq_length in [64, 128, 256]:
+        for batch_size in candidate_batch_sizes:
+            try:
+                encoder_input = (
+                    torch.randn([batch_size, seq_length, hidden_size]).cuda().half()
+                )
+                encoder.forward(encoder_input)
+                # warmup
+                t = benchmark_torch_function(
+                    100,
+                    encoder.forward,
+                    encoder_input,
+                )
+                # benchmark
+                t = benchmark_torch_function(
+                    100,
+                    encoder.forward,
+                    encoder_input,
+                )
+                print(
+                    f"bert encoders pt: batch_size: {batch_size}, seq_length: {seq_length}, {t} ms",
+                )
+                with open("bert_encoders_pt_benchmark.txt", "a") as f:
+                    f.write(
+                        f"batch_size: {batch_size}, seq_length: {seq_length} latency: {t} ms\n"
+                    )
+            except RuntimeError:
+                # pt runs out of memory
+                break
+
+
+@click.command()
+@click.option(
+    "--use-pretrained-pt-model",
+    type=bool,
+    default=True,
+    help="Whether or not to use the pretrained BERT model weights.",
+)
+@click.option(
+    "--encoders-only",
+    type=bool,
+    default=True,
+    help="Whether or not to run the BERT benchmark with encoders only. If enabled, only the transformer blocks without BERT embeddings are benchmarked.",
+)
+@click.option(
+    "--batch-size",
+    type=int,
+    default=0,
+    help="The batch size to use for the benchmark. If 0, the batch size is default [1 : 128].",
+)
+def benchmark(
+    use_pretrained_pt_model: bool,
+    encoders_only: bool,
+    batch_size: int,
+):
+    if encoders_only:
+        benchmark_pt_encoders_only(use_pretrained_pt_model, batch_size)
+    else:
+        benchmark_pt(use_pretrained_pt_model, batch_size)
+
+
+if __name__ == "__main__":
+    torch.manual_seed(4896)
+    benchmark()
--- a/examples/02_bert/demo.py
+++ b/examples/02_bert/demo.py
+#  Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+import click
+
+import torch
+
+from transformers import BertTokenizer
+
+from benchmark_ait import compile_module
+from modeling.torch_model import BertBaseUncased as BertPt
+
+
+def prepare_data(prompt: str, model_path: str):
+    tokenizer = BertTokenizer.from_pretrained(model_path)
+    result = tokenizer(prompt, return_attention_mask=False, return_tensors="pt")
+    target_size = result["input_ids"].size()
+    if target_size[1] > 512:
+        raise ValueError("Sequence length > 512 is not supported")
+
+    result["position_ids"] = (
+        torch.arange(target_size[1], dtype=torch.int64)
+        .reshape(result["input_ids"].size())
+        .contiguous()
+        .cuda()
+    )
+    return result
+
+
+def run_model(
+    prompt: str,
+    activation: str,
+    graph_mode: bool,
+    use_fp16_acc: bool,
+    verify: bool,
+    model_path="bert-base-uncased",
+):
+    inputs = prepare_data(prompt, model_path)
+    inputs_pt = {name: data.cuda() for name, data in inputs.items()}
+    batch_size, seq_len = inputs["input_ids"].size()
+
+    pt_model = BertPt(model_path=model_path, pretrained=True)._model
+    pt_model.eval()
+    hidden_size = pt_model.config.hidden_size
+
+    mod = compile_module(
+        batch_size, seq_len, hidden_size, activation, use_fp16_acc, False, pt_model
+    )
+
+    outputs = [torch.empty(mod.get_output_maximum_shape(0)).half().cuda()]
+    mod.run_with_tensors(inputs_pt, outputs, graph_mode=graph_mode)
+
+    print(f"Logits: {outputs[0]}")
+    if verify:
+        pt_outputs = pt_model.bert(**inputs_pt)
+        torch.allclose(outputs[0], pt_outputs.last_hidden_state, 1e-1, 1e-1)
+        print("Verification done!")
+
+
+@click.command()
+@click.option(
+    "--prompt",
+    type=str,
+    default="The quick brown fox jumps over the lazy dog.",
+    help="The prompt to give BERT.",
+)
+@click.option(
+    "--activation",
+    type=str,
+    default="fast_gelu",
+    help="Activation function applied on BERT, currently only support gelu and fast_gelu",
+)
+@click.option(
+    "--graph_mode",
+    type=bool,
+    default=True,
+    help="Use CUDA graph or not. (hipGraph is not supported yet)",
+)
+@click.option(
+    "--use_fp16_acc",
+    type=bool,
+    default=True,
+    help="Use fp16 accumulation or not (TensorRT is using fp16_acc)",
+)
+@click.option(
+    "--verify",
+    type=bool,
+    default=True,
+    help="Verify AIT outputs against PT",
+)
+def run_demo(
+    prompt: str,
+    activation: str,
+    graph_mode: bool,
+    use_fp16_acc: bool,
+    verify: bool,
+):
+    run_model(prompt, activation, graph_mode, use_fp16_acc, verify, model_path="./bert-base-uncased")
+
+
+if __name__ == "__main__":
+    torch.manual_seed(4896)
+    run_demo()
--- a/examples/02_bert/modeling/__init__.py
+++ b/examples/02_bert/modeling/__init__.py
+#  Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
--- a/examples/02_bert/modeling/__pycache__/__init__.cpython-310.pyc
+++ b/examples/02_bert/modeling/__pycache__/__init__.cpython-310.pyc
--- a/examples/02_bert/modeling/__pycache__/bert.cpython-310.pyc
+++ b/examples/02_bert/modeling/__pycache__/bert.cpython-310.pyc
--- a/examples/02_bert/modeling/__pycache__/torch_model.cpython-310.pyc
+++ b/examples/02_bert/modeling/__pycache__/torch_model.cpython-310.pyc