init

24b257f1 · sunzhq2 · 920b3c0f · 24b257f1 · 24b257f1 · 24b257f1
Commit 24b257f1 authored Nov 19, 2024 by sunzhq2
20 changed files
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/README.md
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/README.md
+<div align="center">
+  <img src="Graphcore-Chinese-Wordmark-Horizontal.svg">
+</div>
+
+[ [中文](README.zh_CN.md) ]
+
+# Graphcore® C600
+
+The Graphcore® C600 IPU-Processor PCIe Card is a high-performance acceleration server card targeted for machine learning inference and training. Powered by the Graphcore Mk2 IPU Processor with FP8 support, the C600 is a dual-slot, full height PCI Express Gen4 card designed for mounting in industry standard server chassis to accelerate machine intelligence workloads.
+
+Up to eight C600 IPU-Processor PCIe Cards can be networked together using IPU-Link™ high-bandwidth interconnect cables, delivering enhanced IPU compute capability.
+
+## Product Specs
+
+| Name | Description |
+| :-----| :-----|
+| IPU Processor | Graphcore Mk2 IPU Processor with FP8 support |
+| IPU-Cores™ | 1,472 IPU-Cores, each one a high-performance processor capable of multi-thread, independent code execution |
+| In-Processor Memory™ | Each IPU-Core is paired with fast, local, tightly-coupled In-Processor Memory. The C600 accelerator includes 900MB of In-Processor Memory |
+| Compute | Up to 560 teraFLOPS of FP8 compute <br> Up to 280 teraFLOPS of FP16 compute <br> Up to 70 teraFLOPS of FP32 compute |
+| System Interface | Dual PCIe Gen4 8-lane interfaces |
+| Thermal Solution | Passive |
+| Form Factor | PCIe full-height/length; double-slot |
+| System Dimensions |	Length: 267mm (10.50”); Height: 111mm (4.37”); Width: 27.6mm (1.09”); Mass: 1.27kg (2.8lbs) |
+| IPU-Link™ | Support	32 lanes, 128 GB/s bandwidth (64 GB/s in each direction) IPU-Links |
+| TDP |	185W |
+| Auxiliary Power Supply | 8-pin |
+| Quality Level | Server grade |
+
+For more information of the Graphcore® C600, please refer to [C600 cards](https://docs.graphcore.ai/en/latest/hardware.html#c600-cards).
+
+# PopRT
+
+PopRT is a high-performance inference framework specifically for Graphcore IPUs. It is responsible for deeply optimizing the trained models, generating executable programs that can run on the Graphcore IPUs, and performing low-latency, high-throughput inference.
+
+You can get PopRT and related documents from [graphcore/PopRT](https://graphcore.github.io/PopRT/1.4.0/).
+
+Docker images are provided at [graphcorecn/poprt](https://hub.docker.com/r/graphcorecn/poprt).
+
+# Models supported
+
+| Model name |  Precision | QPS | Dataset | Metric name | Metric value | report |
+| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
+| albert-torch-fp32 | FP16 | 3,280 | Open Squad 1.1 | F1 Score | 87.69675 | [report](../../reports/IPU/albert-torch-fp32/) |
+| bert-torch-fp32 | FP8 | 4,464 | Open Squad 1.1 | F1 Score | 85.71465 | [report](../../reports/IPU/bert-torch-fp32/) |
+| bert-torch-fp32 | FP16 | 3,134 | Open Squad 1.1 | F1 Score | 85.85797 | [report](../../reports/IPU/bert-torch-fp32/) |
+| clip-onnx-fp32 | FP16 | 7,305 | Fake Dataset | Mean Diff | 0.00426 | [report](../../reports/IPU/clip-onnx-fp32/) |
+| conformer-encoder-onnx-fp32 | FP16 | 9,341 | Fake Dataset | Mean Diff | 0.00161 | [report](../../reports/IPU/conformer-encoder-onnx-fp32/) |
+| deberta-torch-fp32 | FP16 | 1,702 | Open Squad 1.1 | F1 Score | 81.24629 | [report](../../reports/IPU/deberta-torch-fp32/) |
+| resnet50-torch-fp32 | FP8 | 18,851 | Open Imagenet | Top-1 | 0.76824 | [report](../../reports/IPU/resnet50-torch-fp32/) |
+| resnet50-torch-fp32 | FP16 | 13,499 | Open Imagenet | Top-1 | 0.76963 | [report](../../reports/IPU/resnet50-torch-fp32/) |
+| roberta-torch-fp32 | FP16 | 3,088 | Open Squad 1.1 | F1 Score | 83.1606 | [report](../../reports/IPU/roberta-torch-fp32/) |
+| roformer-tf-fp32 | FP16 | 2,520 | OPEN_CAIL2019 | Top-1 | 0.64323 | [report](../../reports/IPU/roformer-tf-fp32/) |
+| swin-large-torch-fp32 | FP8 | 480 | Open Imagenet | Top-1 | 0.8552 | [report](../../reports/IPU/swin-large-torch-fp32/) |
+| swin-large-torch-fp32 | FP16 | 315 | Open Imagenet | Top-1 | 0.8536 | [report](../../reports/IPU/swin-large-torch-fp32/) |
+| videobert-onnx-fp32 | FP16 | 3,691 | OPEN_CIFAR | Top-1 | 0.6169 | [report](../../reports/IPU/videobert-onnx-fp32/) |
+| widedeep-tf-fp32 | FP16 | 31,446,195 | Open Criteo Kaggle | Top-1 | 0.77392 | [report](../../reports/IPU/widedeep-tf-fp32/) |
+
+# How to run
+
+## Download and enable Poplar SDK
+
+```
+wget -O 'poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz' 'https://downloads.graphcore.ai/direct?package=poplar-poplar_sdk_ubuntu_20_04_3.3.0_208993bbb7-3.3.0&file=poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz'
+
+tar xzf poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz
+
+source poplar_sdk-ubuntu_20_04-3.3.0+1403-208993bbb7/enable
+```
+
+## Start PopRT docker container
+
+```
+docker pull graphcorecn/poprt:1.4.0
+
+gc-docker -- -it \
+              -v `pwd -P`:/workspace \
+              -w /workspace \
+              --entrypoint /bin/bash \
+              graphcorecn/poprt:1.4.0
+```
+
+## Install dependencies in docker container
+
+```
+apt-get update && \
+apt-get install wget libglib2.0-0 -y
+```
+
+## Run byte-mlperf task
+
+For example,
+
+```
+python3 launch.py --task widedeep-tf-fp32 --hardware IPU
+```
+
+For more information of the command to run the task, please refer to [ByteMLPerf](../../../README.md#usage).
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/README.zh_CN.md
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/README.zh_CN.md
+<div align="center">
+  <img src="Graphcore-Chinese-Wordmark-Horizontal.svg">
+</div>
+
+[ [English](README.md) ]
+
+# Graphcore® C600
+
+C600 是 Graphcore 为云和数据中心打造的高端推训一体加速卡，主打推理，兼做训练，可以支持各种主流的 AI 应用，在搜索和推荐等业务上别具优势。C600 在提供低延时、高吞吐量的同时不损失精度，帮助 AI 开发人员解决”精度与速度难两全”的痛点，为 AI 应用提供解锁 IPU 强大算力的新路径，以满足客户和机器智能从业者对于易用、高效以及更优 TCO 推理产品的强烈需求。
+
+C600 是一张 PCIe Gen 4 双插槽卡，使用一个 IPU，每个 IPU 具有 1472 个处理核心，能够并行运行 8832 个独立程序线程。每个 IPU 都有 900MB 的片上 SRAM 存储。用户可以在单个机箱中直接连接多达 8 块卡，通过高带宽的 IPU-Links 进行桥接。C600 可搭配市场上主流的 AI 服务器使用，比如浪潮信息 NF5468M6 等。
+
+## 产品规格
+
+| 规格 | 说明 |
+| :-----| :-----|
+| **IPU 处理器** | 支持 FP8 的 Graphcore® MK2 IPU 处理器 |
+| **IPU 核心**	| 1472 个 IPU 核心，每个核心都是一个高性能处理器，支持多线程和独立代码执行 |
+| **处理器内存储** | 每个 IPU 核心都配有快速且紧密耦合的本地处理器内存储 <br> C600加速器包括 900MB 的处理器内存储 |
+| **计算** | 高达 560 teraFLOPS 的 FP8 计算 <br> 高达 280 teraFLOPS 的 FP16 计算 |
+| **系统接口** | 2 个分叉 16 位 PCIe 接口的 8 路端口 |
+| **散热方案** | 被动散热 |
+| **外形** | PCIe 全高/全长；双插槽 |
+| **尺寸** | 长度：267 毫米（10.5 英寸）<br> 高度：111 毫米（4.37 英寸）<br> 宽度：27.6 毫米（1.09 英寸）|
+| **重量** | 1.27 千克（2.8 磅) |
+| **IPU-Link™ 支持** | 64 路，256GB/s 的双 IPU-Links |
+| **电源** | 185 瓦 |
+| **辅助电源** | 8 针 |
+| **质量级别** | 服务器级别 |
+
+关于 Graphcore® C600 的更多信息，请访问 [Graphcore 中文网站](https://www.graphcore.cn/c600-pcie%e5%8d%a1/)。
+
+# Graphcore® PopRT
+PopRT 是一个针对 IPU 处理器的高性能推理引擎，负责把训练完导出的模型，针对推理进行深度编译优化，生成能在 IPU 上运行的可执行程序 PopEF，并提供灵活的 Runtime，实现对 PopEF 进行低延时，高吞吐的推理。
+
+PopRT 提供了易于集成的 Python 和 C++ API，ByteMLPerf 模型在 IPU 上的运行即通过 PopRT Python API 进行模型的优化，编译和运行。
+
+更多关于 PopRT 的资料，请访问 [PopRT 用户指南](https://graphcore.github.io/PopRT/1.4.0/)。
+
+获取 PopRT 的 Docker 镜像，请访问 [graphcorecn/poprt](https://hub.docker.com/r/graphcorecn/poprt)。
+
+# 支持的模型
+
+| Model name |  Precision | QPS | Dataset | Metric name | Metric value | report |
+| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
+| albert-torch-fp32 | FP16 | 3,280 | Open Squad 1.1 | F1 Score | 87.69675 | [report](../../reports/IPU/albert-torch-fp32/) |
+| bert-torch-fp32 | FP8 | 4,464 | Open Squad 1.1 | F1 Score | 85.71465 | [report](../../reports/IPU/bert-torch-fp32/) |
+| bert-torch-fp32 | FP16 | 3,134 | Open Squad 1.1 | F1 Score | 85.85797 | [report](../../reports/IPU/bert-torch-fp32/) |
+| clip-onnx-fp32 | FP16 | 7,305 | Fake Dataset | Mean Diff | 0.00426 | [report](../../reports/IPU/clip-onnx-fp32/) |
+| conformer-encoder-onnx-fp32 | FP16 | 9,341 | Fake Dataset | Mean Diff | 0.00161 | [report](../../reports/IPU/conformer-encoder-onnx-fp32/) |
+| deberta-torch-fp32 | FP16 | 1,702 | Open Squad 1.1 | F1 Score | 81.24629 | [report](../../reports/IPU/deberta-torch-fp32/) |
+| resnet50-torch-fp32 | FP8 | 18,851 | Open Imagenet | Top-1 | 0.76824 | [report](../../reports/IPU/resnet50-torch-fp32/) |
+| resnet50-torch-fp32 | FP16 | 13,499 | Open Imagenet | Top-1 | 0.76963 | [report](../../reports/IPU/resnet50-torch-fp32/) |
+| roberta-torch-fp32 | FP16 | 3,088 | Open Squad 1.1 | F1 Score | 83.1606 | [report](../../reports/IPU/roberta-torch-fp32/) |
+| roformer-tf-fp32 | FP16 | 2,520 | OPEN_CAIL2019 | Top-1 | 0.64323 | [report](../../reports/IPU/roformer-tf-fp32/) |
+| swin-large-torch-fp32 | FP8 | 480 | Open Imagenet | Top-1 | 0.8552 | [report](../../reports/IPU/swin-large-torch-fp32/) |
+| swin-large-torch-fp32 | FP16 | 315 | Open Imagenet | Top-1 | 0.8536 | [report](../../reports/IPU/swin-large-torch-fp32/) |
+| videobert-onnx-fp32 | FP16 | 3,691 | OPEN_CIFAR | Top-1 | 0.6169 | [report](../../reports/IPU/videobert-onnx-fp32/) |
+| widedeep-tf-fp32 | FP16 | 31,446,195 | Open Criteo Kaggle | Top-1 | 0.77392 | [report](../../reports/IPU/widedeep-tf-fp32/) |
+
+# 如何运行
+
+## 下载并安装 Poplar SDK
+
+```
+wget -O 'poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz' 'https://downloads.graphcore.ai/direct?package=poplar-poplar_sdk_ubuntu_20_04_3.3.0_208993bbb7-3.3.0&file=poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz'
+
+tar xzf poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz
+
+source poplar_sdk-ubuntu_20_04-3.3.0+1403-208993bbb7/enable
+```
+
+## 启动 PopRT Docker 容器
+
+```
+docker pull graphcorecn/poprt:1.4.0
+
+gc-docker -- -it \
+              -v `pwd -P`:/workspace \
+              -w /workspace \
+              --entrypoint /bin/bash \
+              graphcorecn/poprt:1.4.0
+```
+
+## 安装 ByteMLPerf 的依赖
+
+```
+apt-get update && \
+apt-get install wget libglib2.0-0 -y
+```
+
+## 运行 ByteMLPerf 的任务
+
+使用如下命令运行：
+
+```
+python3 launch.py --task widedeep-tf-fp32 --hardware IPU
+```
+
+更多关于 ByteMLPerf 运行命令的说明，请参考 [ByteMLPerf](../../../README.zh_CN.md#usage)。
+
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/__init__.py
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/__init__.py
+# Copyright 2023 Graphcore Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+
+current_dir = os.path.split(os.path.abspath(__file__))[0]
+byte_mlperf_dir = current_dir.rsplit("/", 2)[0]
+sys.path.append(byte_mlperf_dir)
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/compile_backend_ipu.py
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/compile_backend_ipu.py
+# Copyright 2023 Graphcore Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+import logging
+import os
+from pathlib import Path
+from typing import Any, Dict
+
+import onnx
+import poprt
+from poprt import runtime
+from poprt.compiler import Compiler, CompilerOptions
+from poprt.converter import Converter
+from tools import saved_to_onnx, torch_to_onnx
+
+from general_perf.backends import compile_backend
+from general_perf.backends.IPU.passes import *
+
+log = logging.getLogger("CompileBackendIPU")
+
+
+class CompileBackendIPU(compile_backend.CompileBackend):
+    def __init__(self):
+        super(CompileBackendIPU, self).__init__()
+        self.hardware_type = "IPU"
+        self.need_reload = False
+        self.model_runtimes = []
+        self.current_dir = os.path.split(os.path.abspath(__file__))[0]
+        self.model_config = None
+        self.packrunner = False
+        self.precision = "fp32"
+
+    def version(self) -> str:
+        """Return compile backend version details."""
+        return poprt.__version__
+
+    def pre_optimize(self, configs: Dict[str, Any]):
+        """Model pre-optimization interface.
+
+        Requirements: Model pre-optimization
+        cannot change the model format. Torch model export to ONNX is allowed.
+        """
+        self._update_model_config(configs.get("interact_info", {}))
+
+        self.precision = (
+            self.model_config.get("converter_options", {})
+            .get("precision", "FP32")
+            .upper()
+        )
+        if self.model_config.get("pack_config"):
+            self.packrunner = True
+
+        model_info = configs["model_info"]
+        model_type = model_info["model_format"]
+        model_name = model_info["model"]
+
+        pre_optimized_root = Path(self.current_dir) / "pre_optimized_models"
+        if not pre_optimized_root.exists():
+            pre_optimized_root.mkdir(parents=True)
+
+        model_path = os.path.abspath(configs["model_info"]["model_path"])
+        onnx_path = pre_optimized_root / (model_name + ".onnx")
+
+        if not self.model_config:
+            self.model_config = configs.get("interact_info", {})
+
+        # convert model to onnx if it's not
+        # configs['workload'] is the content of workloads/<task_name>.json and
+        # configs['model_info'] is content of model_zoo/<task_name>.json
+        if model_type != "onnx":
+            if onnx_path.exists():
+                onnx_path = self._update_pack_model(onnx_path, model_info)
+                model_info["model_path"] = onnx_path
+                log.info("{} file exists, skip ONNX conversion".format(onnx_path.name))
+            else:
+                # convert the model to onnx
+                log.info(
+                    "Convert the model: {} from format: {} to onnx".format(
+                        model_name, model_type
+                    )
+                )
+                if model_type == "saved_model":
+                    saved_to_onnx.savedmodel_to_onnx(model_path, onnx_path)
+                    onnx_path = self._update_pack_model(onnx_path, model_info)
+                elif model_type == "pt":
+                    torch_to_onnx.torch_to_onnx(model_path, str(onnx_path))
+                    onnx_path = self._update_pack_model(onnx_path, model_info)
+                else:
+                    log.error(
+                        "Wrong model type: {}, which must be saved_model, pt, or onnx".format(
+                            model_type
+                        )
+                    )
+                    raise TypeError("Model type must be saved_model, pt, or onnx")
+
+                if os.path.exists(onnx_path):
+                    model_info["model_path"] = onnx_path
+                    log.info(
+                        "Converted the model: {} from format: {} to onnx".format(
+                            model_name, model_type
+                        )
+                    )
+                else:
+                    log.error(
+                        "{} not exists, failed to convert the model: {} to onnx".format(
+                            onnx_path, model_name
+                        )
+                    )
+                    raise RuntimeError("Failed to convert model to onnx")
+        else:
+            log.info("{} is onnx model, skip ONNX conversion".format(model_name))
+
+        return configs
+
+    def compile(self, config, dataloader=None):
+        self.model_info = config["model_info"]
+        if not self.model_config:
+            self.model_config = config["interact_info"]
+
+        # precision not in model_config (multiple precisions available) and user
+        # skipped precision selection prompt
+        if not self.precision and not config["interact_info"].get("precision"):
+            self.precision = "fp16"
+            if "converter_options" not in self.model_config:
+                self.model_config["converter_options"] = {}
+            self.model_config["converter_options"]["precision"] = self.precision
+
+        if self.model_config.get("pack_config"):
+            self.packrunner = True
+
+        log.info("The interaction info is:\n {}".format(self.model_config))
+
+        result = {
+            "model": config["model_info"]["model"],
+            "framework": config["model_info"]["framework"],
+            "compile_precision": self.precision,
+            "input_type": config["model_info"]["input_type"].split(","),
+            "max_batch_size": config["model_info"]["max_batch_size"],
+            "compile_status": "success",
+            "optimizations": {},
+            "instance_count": 1,
+            "device_count": 1,
+            "sg_percent": 100,
+            "segments": [
+                {
+                    "sg_idx": 0,
+                    "is_fallback": False,
+                    "input_tensor_map": config["model_info"]["input_shape"],
+                    "output_tensor_map": config["model_info"]["outputs"],
+                    "compiled_model": [
+                        {
+                            "compiled_bs": 1,
+                            "compiled_obj": config["model_info"]["model_path"],
+                        },
+                    ],
+                },
+            ],
+            "interact_info": self.model_config,
+        }
+
+        # packrunner takes single data as input and perform batching asynchoroly
+        # within itself.
+        # The config option "pack_bs" is the bs for model compiling, and the regular
+        # "batch_size" is used to specify bs for the dataloader/inferencing and
+        # it should always be 1 since pack runner only take single sample a time
+        if self.packrunner:
+            pack_config = self.model_config["pack_config"]
+            assert (
+                "batch_size" in pack_config
+            ), "for pack mode, we acquire pack_bs as the actual compile batch sizes for the pack model"
+            assert isinstance(
+                pack_config["batch_size"], int
+            ), "pack_bs should be a positive integers"
+
+            compile_bs = [pack_config["batch_size"]]
+        else:
+            compile_bs = config["workload"]["batch_sizes"]
+
+        for batch_size in compile_bs:
+            self._compile(batch_size)
+
+        return result
+
+    def get_interact_profile(self, config):
+        """Collect information for core engine to let user interactively fill in configurations."""
+        model_profile = []
+        # load the interact_info by model name
+        interact_info_file = os.path.join(
+            self.current_dir, "interact_infos", config["model_info"]["model"] + ".json"
+        )
+        file_path = os.path.join(self.current_dir, self.hardware_type + ".json")
+
+        with open(file_path, "r") as f:
+            interact_info = json.load(f)
+
+        if os.path.exists(interact_info_file):
+            # has model config file but does not provide must have config options
+            with open(interact_info_file, "r") as f:
+                self.model_config = json.load(f)
+                log.info("interact_info set by file: {}".format(interact_info_file))
+
+            if not self.model_config.get("converter_options", {}).get("precision"):
+                for _, v in enumerate(interact_info):
+                    if v["name"] == "precision":
+                        model_profile.append(v)
+
+        else:
+            file_path = os.path.join(self.current_dir, self.hardware_type + ".json")
+            if os.path.exists(file_path):
+                with open(file_path, "r") as f:
+                    model_profile = json.load(f)
+            else:
+                log.info("File path: {} does not exist, please check".format(file_path))
+
+        return model_profile
+
+    def get_best_batch_size(self):
+        """Get Best Batch Size for the model.
+
+        Usually take the max batch size can be loaded to IPU as the best batch size to
+        get highest throughput.
+        """
+        return self.model_config.get("batch_sizes", None)
+
+    def _update_model_config(self, interact_info):
+        # update poprt configuration based on interact_info
+        if not self.model_config:
+            self.model_config = {}
+            self.model_config["converter_options"] = interact_info.get(
+                "converter_options", {}
+            )
+            self.model_config["clients"] = int(interact_info.get("clients", "1"))
+            batch_sizes = interact_info.get("batch_sizes", "").split(",").remove("")
+            if batch_sizes:
+                self.model_config["batch_sizes"] = [int(x.strip()) for x in batch_sizes]
+            self.model_config["compiler_options"] = json.loads(
+                interact_info.get("compiler_options", "{}")
+            )
+
+            self.model_config["clients"] = int(self.model_config.get("clients", "1"))
+            batch_sizes = self.model_config.get("batch_sizes", "").split(",")
+            if batch_sizes:
+                self.model_config["batch_sizes"] = [
+                    int(x.strip()) for x in batch_sizes if x.strip().isdigit()
+                ]
+            for key, value in self.model_config.items():
+                if "_options" in key and isinstance(value, str):
+                    self.model_config[key] = json.loads(value)
+
+        if interact_info.get("precision"):
+            self.model_config["converter_options"]["precision"] = interact_info[
+                "precision"
+            ]
+            # update converter config when user selected fp8 in interact sections
+            # and there is fp8_configs in interact_info config file
+            if interact_info["precision"] == "fp8" and self.model_config.get(
+                "fp8_configs"
+            ):
+                for config_name, config_section in self.model_config[
+                    "fp8_configs"
+                ].items():
+                    if isinstance(self.model_config[config_name], dict):
+                        self.model_config[config_name].update(config_section)
+                    else:
+                        self.model_config[config_name] = config_section
+
+                del self.model_config["fp8_configs"]
+
+    def _compile(self, batch_size):
+        self.batch_size = batch_size
+        # differentiate popef based on precision
+        self.popef_path = os.path.join(
+            self.current_dir,
+            "compiled_models",
+            self.model_info["model"],
+            str(batch_size),
+            "executable_{}.popef".format(self.precision),
+        )
+        self.popef_path = os.path.abspath(self.popef_path)
+        if os.path.exists(self.popef_path):
+            log.info(
+                "The PopEF file {} already exist, skip compile".format(
+                    os.path.abspath(self.popef_path)
+                )
+            )
+            return self.popef_path
+
+        log.info("Create the directory {}".format(os.path.dirname(self.popef_path)))
+        os.makedirs(os.path.dirname(self.popef_path), exist_ok=True)
+
+        converter_options = self.model_config.get("converter_options", {})
+        compiler_options = self.model_config.get("compiler_options", {})
+
+        converted_model = self._convert(converter_options)
+        self._poprt_compile(converted_model, compiler_options, self.popef_path)
+
+        return self.popef_path
+
+    def _convert(self, converter_options: Dict) -> onnx.ModelProto:
+        model_proto = onnx.load(self.model_info["model_path"])
+
+        input_shape = {}
+        not_extended_with_batch = self.model_config.get("not_extended_with_batch", [])
+        for name, shape in self.model_info["input_shape"].items():
+            if name in not_extended_with_batch:
+                batched_shape = [shape[0]] + shape[1:]
+            elif name == "text" and "videobert" in self.model_info["model"]:
+                batched_shape = [shape[0]] + shape[1:]
+            else:
+                batched_shape = [shape[0] * self.batch_size] + shape[1:]
+            log.info(
+                "The model input {} with shape {} in the model information, and shape with batch size is {}.".format(
+                    name, shape, batched_shape
+                )
+            )
+            input_shape[name] = batched_shape
+        converter_options["input_shape"] = input_shape
+
+        converter = Converter(**converter_options)
+        converted_model = converter.convert(model_proto)
+
+        return converted_model
+
+    def _poprt_compile(
+        self, converted_model: onnx.ModelProto, compiler_options: dict, popef_path: str
+    ):
+        options = CompilerOptions()
+        options.ipu_version = runtime.DeviceManager().ipu_hardware_version()
+
+        options.num_io_tiles = compiler_options.get("num_iotiles", 0)
+        options.batches_per_step = compiler_options.get("batches_per_step", 1)
+        options.enable_prefetch_datastreams = compiler_options.get(
+            "enable_prefetch_datastreams", False
+        )
+        options.stream_buffering_depth = compiler_options.get(
+            "stream_buffering_depth", 1
+        )
+        options.available_memory_proportion = compiler_options.get(
+            "available_memory_proportion", 0.6
+        )
+        options.partials_type = compiler_options.get("partials_type", "half")
+        options.use_128bit_conv_unit_load = compiler_options.get(
+            "use_128bit_conv_unit_load", False
+        )
+        options.enable_fast_reduce = compiler_options.get("enable_fast_reduce", False)
+        options.group_host_sync = compiler_options.get("group_host_sync", False)
+        options.rearrange_anchors_on_host = compiler_options.get(
+            "rearrange_anchors_on_host", False
+        )
+        options.enable_outlining = compiler_options.get("enable_outlining", True)
+        options.outline_threshold = compiler_options.get("outline_threshold", 1.0)
+
+        outputs = [o.name for o in converted_model.graph.output]
+        Compiler.compile_and_export(
+            converted_model.SerializeToString(), outputs, popef_path, options
+        )
+
+        return popef_path
+
+    def _update_pack_model(self, model_path, model_info):
+        """bert like model conversion for pack mode, update corresponded configs as well."""
+        if not self.packrunner:
+            return model_path
+        model = onnx.load(model_path)
+
+        # actual model_path updated to pack
+        model_path = (
+            Path(self.current_dir)
+            / "pre_optimized_models"
+            / (Path(model_path).stem + "_pack.onnx")
+        )
+        assert "input_shape" in model_info
+        assert "inputs" in model_info
+        assert "dataset_name" in model_info
+        assert "input_type" in model_info
+
+        model_info["inputs"] += ",position_ids"
+        model_info["input_type"] += ",LONG"
+        model_info["model_path"] = str(model_path)
+
+        if "deberta" in model_info["model"]:
+            model_info["input_shape"]["unpack_info"] = [1, 1]
+        else:
+            model_info["input_shape"]["position_ids"] = [1, 384]
+        self.model_info = model_info
+
+        if self.model_info["model"] == "roberta-torch-fp32":
+            rm_node_names = [
+                "/model/roberta/embeddings/Equal",
+                "/model/roberta/embeddings/Not",
+                "/model/roberta/embeddings/Cast",
+                "/model/roberta/embeddings/CumSum",
+                "/model/roberta/embeddings/Mul",
+                "/model/roberta/embeddings/Cast_1",
+            ]
+            rm_nodes = []
+            for node in model.graph.node:
+                if node.name in rm_node_names:
+                    rm_nodes.append(node)
+
+            assert len(rm_node_names) == len(rm_nodes)
+
+            for node in rm_nodes:
+                model.graph.node.remove(node)
+
+            position_ids = copy.deepcopy(model.graph.input[0])
+            position_ids.name = "position_ids"
+            model.graph.input.append(position_ids)
+
+            for node in model.graph.node:
+                if (
+                    node.op_type == "Add"
+                    and node.name == "/model/roberta/embeddings/Add"
+                ):
+                    node.input[0] = position_ids.name
+        elif self.model_info["model"] in ("bert-torch-fp32", "albert-torch-fp32"):
+            # for packed bert, we need to export position_ids to model's input
+            # step 1: remove unneed node
+            model_name = "albert" if "albert" in model_path.name else "bert"
+            rm_node_names = [
+                "/model/{0}/embeddings/Shape".format(model_name),
+                "/model/{0}/embeddings/Gather".format(model_name),
+                "/model/{0}/embeddings/Unsqueeze".format(model_name),
+                "/model/{0}/embeddings/Slice".format(model_name),
+            ]
+
+            rm_nodes = []
+            for node in model.graph.node:
+                if node.name in rm_node_names:
+                    rm_nodes.append(node)
+
+            assert len(rm_node_names) == len(rm_nodes)
+
+            for node in rm_nodes:
+                model.graph.node.remove(node)
+
+            # step 2: add position_ids to model's input
+            position_ids = copy.deepcopy(model.graph.input[0])
+            position_ids.name = "position_ids"
+            model.graph.input.append(position_ids)
+
+            for node in model.graph.node:
+                if (
+                    node.op_type == "Gather"
+                    and node.name
+                    == "/model/{0}/embeddings/position_embeddings/Gather".format(
+                        model_name
+                    )
+                ):
+                    node.input[1] = position_ids.name
+
+        print("Save preprocessed model to {}".format(model_path))
+        onnx.save(model, model_path)
+        return model_path
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/engine.py
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/engine.py
+# Copyright 2023 Graphcore Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+class Engine:
+    def __init__():
+        raise NotImplementedError
+
+    def predict(self, feeds):
+        raise NotImplementedError
+
+    def benchmark(self, clients, batch_size, iterations):
+        raise NotImplementedError
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/engine_poprt.py
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/engine_poprt.py
+# Copyright 2023 Graphcore Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import random
+import threading as th
+import time
+from queue import Queue
+
+import numpy as np
+import torch
+from poprt import runtime
+
+from . import engine
+
+log = logging.getLogger("engine_poprt")
+
+
+class PopRT(engine.Engine):
+    def __init__(self, popef_path, config):
+        self.runner = runtime.Runner(popef_path, config)
+        self.packrunner = True if type(config) == runtime.PackRunnerConfig else False
+
+    def predict(self, feeds):
+        input_descriptions = self.runner.get_model_inputs()
+        for desc in input_descriptions:
+            if isinstance(feeds[desc.name], list):
+                feeds[desc.name] = np.array(
+                    feeds[desc.name], dtype=desc.numpy_data_type()
+                )
+            elif isinstance(feeds[desc.name], np.ndarray):
+                feeds[desc.name] = feeds[desc.name].astype(desc.numpy_data_type())
+            elif isinstance(feeds[desc.name], torch.Tensor):
+                feeds[desc.name] = (
+                    feeds[desc.name].numpy().astype(desc.numpy_data_type())
+                )
+            else:
+                raise TypeError(
+                    "The feeds[value] must be list, np.ndarray or torch.Tensor"
+                )
+
+        # create the output numpy arrays
+        output_descriptions = self.runner.get_model_outputs()
+        results = {}
+        for output_desc in output_descriptions:
+            output_shape = output_desc.shape
+            results[output_desc.name] = np.zeros(
+                output_shape, dtype=output_desc.numpy_data_type()
+            )
+
+        if self.packrunner:
+            fut = self.runner.executeAsync(dict(feeds), dict(results))
+            fut.wait()
+        else:
+            self.runner.execute(feeds, results)
+        return results
+
+    def benchmark(self, clients, batch_size, iterations):
+        input_view = runtime.InputMemoryView()
+        input_descriptions = self.runner.get_model_inputs()
+        output_descriptions = self.runner.get_model_outputs()
+        inputs = {}
+        outputs = {}
+        for input_desc in input_descriptions:
+            inputs[input_desc.name] = np.random.randn(*input_desc.shape).astype(
+                input_desc.numpy_data_type()
+            )
+        for output_desc in output_descriptions:
+            outputs[output_desc.name] = np.zeros(
+                output_desc.shape, dtype=output_desc.numpy_data_type()
+            )
+
+        log.info("Warm up")
+        for _ in range(5):
+            self.runner.execute(inputs, outputs)
+        log.info("Warm up completed, start the time counting")
+
+        q = Queue()
+
+        def perf_count(model_runner, iteration, input_view):
+            durations = []
+            for _ in range(iteration):
+                start_time = time.time()
+                self.runner.execute(inputs, outputs)
+                end_time = time.time()
+                durations.append((start_time, end_time))
+            # remove the first and last 20
+            if iteration > 40:
+                durations = durations[20:-20]
+            q.put(durations, timeout=10)
+
+        thp = [
+            th.Thread(target=perf_count, args=(self.runner, iterations, input_view))
+            for _ in range(clients)
+        ]
+        for t in thp:
+            t.start()
+        for t in thp:
+            t.join()
+
+        durations_from_th = []
+        while not q.empty():
+            durations_from_th += q.get()
+        max_timestamp = max(y for _, y in durations_from_th)
+        min_timestamp = min(x for x, _ in durations_from_th)
+        if iterations > 40:
+            iterations -= 40  # iterations -40 as line 260
+        qps = clients * batch_size * iterations / (max_timestamp - min_timestamp)
+        times_range = [y - x for x, y in durations_from_th]
+
+        times_range.sort()
+        tail_latency = round(times_range[int(len(times_range) * 0.99)] * 1000, 2)
+        avg_latency = round(sum(times_range) / len(times_range) * 1000, 2)
+
+        log.info(
+            "Batch size is {}, QPS: {}, Avg Latency:{}, Tail Latency:{}".format(
+                batch_size, int(qps), avg_latency, tail_latency
+            )
+        )
+
+        np_latency = np.array(times_range) * 1000.0
+        log.info(
+            f"====== Latency P50: {np.percentile(np_latency, 50)}, P90: {np.percentile(np_latency, 90)}, P99: {np.percentile(np_latency, 99)} ======"
+        )
+
+        return qps, avg_latency, tail_latency
+
+    def benchmark_pack(self, pack_config, iterations):
+        output_descriptions = self.runner.get_model_outputs()
+
+        outputs = {}
+        for output_desc in output_descriptions:
+            shape = output_desc.shape
+            shape[0] = 1
+            outputs[output_desc.name] = np.zeros(
+                shape, dtype=output_desc.numpy_data_type()
+            )
+
+        # average sequence length in squad is ~172
+        avg_len = 172
+        max_valid_seq = 384
+
+        bs = pack_config.get("batch_size", 20)
+        sample_num = iterations * bs
+        input_len = np.random.normal(avg_len, avg_len, size=sample_num).astype(np.int32)
+        input_len = np.clip(input_len, 1, max_valid_seq)
+
+        datasets = []
+        for s_len in input_len:
+            sample = {}
+            # set value to 1 does not affect the performance, where attention_mask in pack mode required to be set to 1
+            for input_name in pack_config["input_names"]:
+                sample[input_name] = np.ones(s_len).astype(np.int32)
+
+            datasets.append(sample)
+
+        # each client sent a single data, one pack batch can pack more than 2*bs of data
+        clients = int(bs * 3.5)
+        count_percent = 0.6
+
+        q = Queue()
+
+        def perf_count(model_runner, iteration):
+            durations = []
+            for i in range(sample_num):
+                start_time = time.time()
+                sample_idx = random.randint(0, sample_num - 1)
+                self.runner.execute(datasets[sample_idx], outputs)
+                end_time = time.time()
+                durations.append((start_time, end_time))
+            # remove first and last example's time counter
+            ignored_samples = int(sample_num * (1 - count_percent) / 2)
+            durations = durations[ignored_samples:-ignored_samples]
+            q.put(durations, timeout=10)
+
+        thp = [
+            th.Thread(target=perf_count, args=(self.runner, iterations))
+            for _ in range(clients)
+        ]
+        for t in thp:
+            t.start()
+        for t in thp:
+            t.join()
+
+        durations_from_th = []
+        while not q.empty():
+            durations_from_th += q.get()
+        max_timestamp = max(y for _, y in durations_from_th)
+        min_timestamp = min(x for x, _ in durations_from_th)
+        # iterations -40 as line 260
+        qps = clients * (sample_num * count_percent) / (max_timestamp - min_timestamp)
+        times_range = [y - x for x, y in durations_from_th]
+
+        times_range.sort()
+        tail_latency = round(times_range[int(len(times_range) * 0.99)] * 1000, 2)
+        avg_latency = round(sum(times_range) / len(times_range) * 1000, 2)
+
+        log.info(
+            "Batch size is {}, QPS: {}, Avg Latency:{}, Tail Latency:{}".format(
+                bs, int(qps), avg_latency, tail_latency
+            )
+        )
+
+        np_latency = np.array(times_range) * 1000.0
+        log.info(
+            f"====== Latency P50: {np.percentile(np_latency, 50)}, P90: {np.percentile(np_latency, 90)}, P99: {np.percentile(np_latency, 99)} ======"
+        )
+        return qps, avg_latency, tail_latency
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/albert-torch-fp32.json
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/albert-torch-fp32.json
+{
+  "clients": 1,
+  "batch_sizes": [
+    1
+  ],
+  "pack_config": {
+    "batch_size": 40,
+    "input_names": ["input_ids.1", "attention_mask.1", "token_type_ids.1"],
+    "dynamic_input_name" : "input_ids.1",
+    "mask_name": "attention_mask.1",
+    "max_pack_num": 100,
+    "timeout_microseconds": 15000
+  },
+  "converter_options": {
+    "used_passes": [
+      "insert_attention_mask"
+    ],
+    "precision": "fp16",
+    "disable_fast_norm": true
+  },
+  "compiler_options": {
+    "available_memory_proportion": 0.5
+  }
+}
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/bert-torch-fp32.json
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/bert-torch-fp32.json
+{
+  "batch_sizes": [
+    1
+  ],
+  "pack_config": {
+    "batch_size": 40,
+    "input_names": [
+      "input_ids.1",
+      "attention_mask.1",
+      "token_type_ids.1"
+    ],
+    "dynamic_input_name": "input_ids.1",
+    "mask_name": "attention_mask.1",
+    "max_pack_num": 100,
+    "timeout_microseconds": 15000
+  },
+  "converter_options": {
+    "used_passes": [
+      "insert_attention_mask"
+    ],
+    "disable_fast_norm": true,
+    "enable_insert_remap": false,
+    "precision": ""
+  },
+  "compiler_options": {
+    "available_memory_proportion": 0.4
+  },
+  "fp8_configs": {
+    "pack_config": {
+      "batch_size": 45,
+      "max_pack_num": 120
+    },
+    "compiler_options": {
+      "available_memory_proportion": 0.6
+    },
+    "converter_options": {
+      "fp8_params": "F143,F143,0,0",
+      "fp8_skip_op_names": "/model/bert/embeddings/word_embeddings/Gather"
+    }
+  }
+}
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/clip-onnx-fp32.json
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/clip-onnx-fp32.json
+{
+    "batch_sizes": [4,40],
+    "clients":3,
+    "converter_options":{
+        "precision": "fp16",
+        "infer_shape_ahead": true
+    },
+    "compiler_options": {
+        "num_iotiles": 32,
+        "batches_per_step": 128,
+        "enable_prefetch_datastreams": true,
+        "use_128bit_conv_unit_load": true,
+        "stream_buffering_depth": 2,
+        "enable_fast_reduce": true,
+        "rearrange_anchors_on_host": true,
+        "group_host_sync": true,
+        "enable_outlining": true,
+        "outline_threshold": 5000,
+        "available_memory_proportion": 0.8
+    }
+}
+
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/conformer-encoder-onnx-fp32.json
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/conformer-encoder-onnx-fp32.json
+{
+  "clients": 4,
+  "batch_sizes": [
+    4, 44
+  ],
+  "converter_options": {
+    "precision": "fp16",
+    "enable_insert_remap": true
+  },
+  "compiler_options": {
+    "use_128bit_conv_unit_load": true,
+    "enable_fast_reduce": true,
+    "num_iotiles":32,
+    "batches_per_step":128,
+    "enable_prefetch_datastreams":true,
+    "stream_buffering_depth":2
+  }
+}
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/deberta-torch-fp32.json
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/deberta-torch-fp32.json
+{
+    "clients": 1,
+    "batch_sizes": [
+        1
+    ],
+    "pack_config": {
+        "batch_size": 10,
+        "input_names": [
+            "input_ids.1",
+            "attention_mask.1"
+        ],
+        "dynamic_input_name": "input_ids.1",
+        "mask_name": "attention_mask.1",
+        "max_pack_num": 20,
+        "timeout_microseconds": 15000
+    },
+    "converter_options": {
+        "disable_fast_norm": true,
+        "enable_insert_remap": true,
+        "remap_mode": ["after_matmul"," before_add"],
+        "max_tensor_size": 6291456,
+        "precision": "fp16",
+        "used_passes": ["deberta_pack"]
+    },
+    "compiler_options": {
+        "use_128bit_conv_unit_load": true,
+        "enable_fast_reduce": true,
+        "group_host_sync": true,
+        "available_memory_proportion": 0.4
+    }
+}
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/resnet50-torch-fp32.json
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/resnet50-torch-fp32.json
+{
+  "clients": 4,
+  "batch_sizes": [
+    4,
+    32
+  ],
+  "converter_options": {
+    "precision": ""
+  },
+  "compiler_options": {
+    "num_iotiles": 64,
+    "batches_per_step": 128,
+    "enable_prefetch_datastreams": true,
+    "stream_buffering_depth": 2
+  },
+  "fp8_configs": {
+    "converter_options": {
+      "fp8_params": "F143,F143,-3,-3",
+      "fp8_skip_op_names": "/layer1/0/downsample/0/Conv"
+    }
+  }
+}
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/roberta-torch-fp32.json
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/roberta-torch-fp32.json
+{
+  "clients": 1,
+  "batch_sizes": [
+    1
+  ],
+  "pack_config": {
+    "batch_size": 40,
+    "input_names": ["input_ids.1", "attention_mask.1", "token_type_ids.1"],
+    "dynamic_input_name" : "input_ids.1",
+    "mask_name": "attention_mask.1",
+    "max_pack_num": 100,
+    "timeout_microseconds": 15000
+  },
+  "converter_options": {
+    "used_passes": [
+      "insert_attention_mask"
+    ],
+    "precision": "fp16",
+    "disable_fast_norm": true
+  },
+  "compiler_options": {
+    "available_memory_proportion": 0.4
+  }
+}
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/roformer-tf-fp32.json
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/roformer-tf-fp32.json
+{
+  "clients": 1,
+  "batch_sizes": [
+    2, 12
+  ],
+  "converter_options": {
+    "precision": "fp16",
+    "used_passes": [
+      "pre_scale",
+      "remove_input_cast",
+      "matmul_rotary_embedding",
+      "fused_attention",
+      "replace_groupnorm_with_fast_norm"
+    ],
+    "disable_fast_norm": true
+  },
+  "compiler_options": {
+    "use_128bit_conv_unit_load": true,
+    "enable_fast_reduce": true
+  }
+}
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/swin-large-torch-fp32.json
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/swin-large-torch-fp32.json
+{
+  "clients": 4,
+  "batch_sizes": [
+    2
+  ],
+  "converter_options": {
+    "precision": "",
+    "enable_insert_remap": false
+  },
+  "compiler_options": {
+    "batches_per_step": 128,
+    "enable_prefetch_datastreams": true,
+    "use_128bit_conv_unit_load": true,
+    "group_host_sync": false,
+    "stream_buffering_depth": 2,
+    "enable_fast_reduce": true,
+    "rearrange_anchors_on_host": false,
+    "available_memory_proportion": 0.2
+  },
+  "fp8_configs": {
+    "converter_options": {
+      "enable_insert_remap": true
+    },
+    "batch_sizes": [
+      4
+    ]
+  }
+}
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/videobert-onnx-fp32.json
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/videobert-onnx-fp32.json
+{
+    "batch_sizes": [4, 42],
+    "clients":3,
+    "converter_options":{
+        "precision": "fp16"
+    },
+    "compiler_options": {
+        "batches_per_step": 128,
+        "enable_prefetch_datastreams": true,
+        "use_128bit_conv_unit_load": true,
+        "stream_buffering_depth": 2,
+        "enable_fast_reduce": true,
+        "rearrange_anchors_on_host": false,
+        "enable_outlining": true,
+        "outline_threshold": 5000,
+        "available_memory_proportion": 0.2
+    }
+}
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/widedeep-tf-fp32.json
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/interact_infos/widedeep-tf-fp32.json
+{
+    "batch_sizes": [1024, 20000],
+    "clients":5,
+    "converter_options":{
+        "precision": "fp16"
+    },
+    "compiler_options": {
+        "num_iotiles": 32,
+        "batches_per_step": 1024,
+        "enable_prefetch_datastreams": true,
+        "use_128bit_conv_unit_load": true,
+        "stream_buffering_depth": 2,
+        "enable_fast_reduce": true,
+        "rearrange_anchors_on_host": false
+    }
+}
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/passes/__init__.py
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/passes/__init__.py
+from . import custom_final_check # noqa
+from . import deberta_pack # noqa
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/passes/custom_final_check.py
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/passes/custom_final_check.py
+# Copyright (c) 2023 Graphcore Ltd. All rights reserved.
+from typing import Any, Dict
+
+import onnx
+import poprt
+
+from poprt.passes.apply_ir_pass import ApplyIrPass
+from poprt.passes.base_pass import ImmutablePass
+from poprt.passes.onnx_helper import clean_info
+from poprt.passes.remove_duplicated_initializer import RemoveDuplicatedInitializer
+
+# skip register here
+# @register('final_check')
+class CustomFinalCheck(ImmutablePass):
+    """Final check for data type and shape of the converted model."""
+
+    name = 'final_check'
+
+    def run_transform(
+        self, graph: onnx.GraphProto, is_main_graph: bool
+    ) -> onnx.GraphProto:
+        # Check if all tensors have valid dtype and shape
+        output_tensors = []
+        for n in graph.node:
+            output_tensors.extend(n.output)
+        output_tensors = set(output_tensors)
+
+        for t in list(graph.value_info) + list(graph.output):
+            tensor_type = t.type.tensor_type
+            has_dtype = tensor_type.HasField("elem_type")
+            has_shape = (
+                tensor_type.HasField("shape")
+                and len(tensor_type.shape.ListFields()) > 0
+            )
+            if has_dtype and has_shape:
+                dtype = tensor_type.elem_type
+                # If the dtype < 1 (onnx.TensorProto.FLOAT) or dtype > 16 (onnx.TensorProto.BFLOAT16),
+                # the dtype is invalid.
+                is_valid_dtype = (
+                    dtype >= onnx.TensorProto.FLOAT
+                    and dtype <= onnx.TensorProto.BFLOAT16
+                )
+                shape = [dim.dim_value for dim in tensor_type.shape.dim]
+                is_valid_shape = 0 not in shape
+                if (not is_valid_dtype) or (not is_valid_shape):
+                    self.logger.warning(
+                        f"{t.name} has no inferred elem_type {dtype} or shape {shape}"
+                    )
+                if t.name in output_tensors:
+                    output_tensors.remove(t.name)
+
+        for t_name in output_tensors:
+            self.logger.warning(
+                f"Graph {graph.name} tensor {t_name} has no elem_type or shape."
+            )
+        return graph
+
+    def run(self, model: onnx.ModelProto) -> onnx.ModelProto:
+        # NOTE: skip SortGraph here
+        # Ensure topological for subgraph
+        # model = SortGraph().run(model)
+        # Infer shape and dtype to make sure all passes process validly.
+        model = clean_info(model)
+        # Remove duplicated initializer
+        model = RemoveDuplicatedInitializer().run(model)
+        model.graph.CopyFrom(self.traverse_graph(model.graph, self.run_transform))
+        # Ensure each node has a unique name
+        model = ApplyIrPass(["unique_name_for_nodes"])(model)
+        return model
+
+def custom_get_all_named_subclasses(cls: Any) -> Dict[str, Any]:
+    subclasses = {}
+
+    def visit(cls):
+        for subclass in cls.__subclasses__():
+            if hasattr(subclass, 'name'):
+                subclasses[subclass.name] = subclass
+            visit(subclass)
+
+    visit(cls)
+
+    # patch
+    subclasses['final_check'] = CustomFinalCheck
+
+    return subclasses
+
+# monkey patch
+poprt.passes.base_pass.get_all_named_subclasses = custom_get_all_named_subclasses
--- a/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/passes/deberta_pack.py
+++ b/ByteMLPerf/byte_infer_perf/general_perf/backends/IPU/passes/deberta_pack.py
+# Copyright (c) 2023 Graphcore Ltd. All rights reserved.
+import onnx
+from poprt import Pass
+from poprt.passes import register
+from poprt.passes.onnx_helper import clean_info, get_dtype, topological_sort
+from poprt.passes.pattern_helper import PatternMatcher
+from poprt.passes.shape_inference import infer_shapes
+
+
+@register("deberta_pack")
+class PackedDeberta(Pass):
+    @staticmethod
+    def _find(items, search_func, return_all=False):
+        results = []
+        for i, item in enumerate(items):
+            if search_func(item):
+                results.append((i, item))
+                if not return_all:
+                    break
+        return results if return_all else (-1, None) if not results else results[0]
+
+    def __init__(self):
+        super().__init__()
+
+    def _modify_mask_before_mul_to_input(self, model):
+        pattern = ["s:0->Unsqueeze:Unsqueeze->Cast:Cast->Mul:Mul->e:5"]
+        pattern_matcher = PatternMatcher(pattern)
+        ops = pattern_matcher.next_pattern(model.graph)
+        if ops:
+            Cast = onnx.helper.make_node(
+                "Cast",
+                name="{}_Cast".format(ops["Unsqueeze"].node.name),
+                inputs=[ops["Unsqueeze"].node.input[0]],
+                outputs=["{}_Cast:0".format(ops["Unsqueeze"].node.name)],
+                to=onnx.TensorProto.BOOL,
+            )
+            ops["Unsqueeze"].node.input[0] = Cast.output[0]
+            model.graph.node.insert(ops["Unsqueeze"].index, Cast)
+        return model
+
+    def _modify_attentionmask(self, model):
+        pattern = [
+            "s:0->Reshape:Reshape->Squeeze:Squeeze->Unsqueeze:Unsqueeze->Mul:Mul->Cast:Cast->Not:Not->e:1",
+            "     Reshape:Reshape                                      ->Mul:Mul",
+        ]
+        pattern_matcher = PatternMatcher(pattern)
+        ops = pattern_matcher.next_pattern(model.graph)
+        if ops:
+            input = ops["Reshape"].node.input[0]
+            for node in [
+                ops[key].node
+                for key in ["Reshape", "Squeeze", "Unsqueeze", "Mul", "Cast", "Not"]
+            ]:
+                model.graph.node.remove(node)
+        else:
+            return model
+
+        pattern = [
+            "s:0         ->WhereV2:WhereV2_1->Softmax:Softmax->WhereV2:WhereV2_2->MatMul:MatMul->e:1",
+            "s:2->Add:Add->WhereV2:WhereV2_1",
+        ]
+        pattern_matcher = PatternMatcher(pattern)
+        ops = pattern_matcher.next_pattern(model.graph)
+        AttentionMask, AttentionMaskNot = None, None
+        while ops:
+            if AttentionMask is None:
+                dtype = get_dtype(model.graph, ops["Add"].node.output[0])
+                kwargs = {
+                    "dataType": "FLOAT"
+                    if dtype == onnx.TensorProto.FLOAT
+                    else "FLOAT16"
+                }
+                AttentionMask = onnx.helper.make_node(
+                    "AttentionMask",
+                    name="AttentionMask",
+                    inputs=[input, ops["Add"].node.output[0]],
+                    outputs=["{}_AttentionMask".format(ops["Add"].node.output[0])],
+                    domain="ai.graphcore",
+                    **kwargs,
+                )
+                Cast = onnx.helper.make_node(
+                    "Cast",
+                    name="{}_Cast".format(AttentionMask.name),
+                    inputs=[AttentionMask.output[0]],
+                    outputs=["{}_Cast:0".format(AttentionMask.name)],
+                    to=onnx.TensorProto.BOOL,
+                )
+                Not = onnx.helper.make_node(
+                    "Not",
+                    name="{}_Not".format(Cast.name),
+                    inputs=[Cast.output[0]],
+                    outputs=["{}_Not:0".format(Cast.name)],
+                )
+                AttentionMaskNot = onnx.helper.make_node(
+                    "Cast",
+                    name="{}_Cast".format(Not.name),
+                    inputs=[Not.output[0]],
+                    outputs=["{}_Cast:0".format(Not.name)],
+                    to=onnx.TensorProto.FLOAT16,
+                )
+                model.graph.node.insert(ops["Softmax"].index, AttentionMaskNot)
+                model.graph.node.insert(ops["Softmax"].index, Not)
+                model.graph.node.insert(ops["Softmax"].index, Cast)
+                model.graph.node.insert(ops["Softmax"].index, AttentionMask)
+            Add = onnx.helper.make_node(
+                "Add",
+                name="{}_Add".format(ops["Add"].node.output[0]),
+                inputs=[AttentionMask.output[0], ops["Add"].node.output[0]],
+                outputs=["{}_Add:0".format(ops["Add"].node.output[0])],
+            )
+            ops["Softmax"].node.input[0] = Add.output[0]
+            Mul = onnx.helper.make_node(
+                "Mul",
+                name="{}_Mul".format(ops["Softmax"].node.output[0]),
+                inputs=[AttentionMaskNot.output[0], ops["Softmax"].node.output[0]],
+                outputs=["{}_Mul".format(ops["Softmax"].node.output[0])],
+            )
+            ops["MatMul"].node.input[0] = Mul.output[0]
+            softmax_index, _ = self._find(
+                model.graph.node, lambda n: n.name == ops["Softmax"].node.name
+            )
+            model.graph.node.insert(softmax_index + 1, Mul)
+            model.graph.node.insert(softmax_index, Add)
+            for key in ("WhereV2_1", "WhereV2_2"):
+                model.graph.node.remove(ops[key].node)
+            ops = pattern_matcher.next_pattern(model.graph)
+
+        return model
+
+    def _add_unpack(self, model):
+        max_valid_num, segment_max_size, segment_num = 2 * 10, 384, 1
+        pattern = [
+            "s:0->Reshape:1->MatMul:2->Add:3->Split:4->e:5",
+        ]
+        pattern_matcher = PatternMatcher(pattern)
+        ops = pattern_matcher.next_pattern(model.graph)
+
+        if ops:
+            unpack_info = onnx.helper.make_tensor_value_info(
+                "unpack_info",
+                onnx.TensorProto.INT32,
+                (max_valid_num, segment_num),
+            )
+            model.graph.input.append(unpack_info)
+
+            unpack_attributes = {
+                "max_valid_num": max_valid_num,
+                "segment_max_size": [segment_max_size],
+            }
+            Unpack = onnx.helper.make_node(
+                "Unpack",
+                name="Unpack",
+                inputs=[ops["1"].node.output[0], unpack_info.name],
+                outputs=["{}_Unpack:0".format(ops["1"].node.output[0])],
+                domain="ai.graphcore",
+                **unpack_attributes,
+            )
+            ops["2"].node.input[0] = Unpack.output[0]
+            model.graph.node.insert(ops["2"].index, Unpack)
+        return model
+
+    def _add_pack(self, model):
+        model = self._modify_mask_before_mul_to_input(model)
+        model = self._modify_attentionmask(model)
+        sorted_nodes = topological_sort(model.graph)
+        model.graph.ClearField("node")
+        for node in sorted_nodes:
+            model.graph.node.append(node)
+        return model
+
+    def __call__(self, model: onnx.ModelProto) -> onnx.ModelProto:
+        model = self._add_pack(model)
+        model = infer_shapes(clean_info(model))
+
+        model = self._add_unpack(model)
+        model = infer_shapes(clean_info(model))
+        return model
+
+    def run(self, onnx_model: onnx.ModelProto) -> onnx.ModelProto:
+        onnx_model.CopyFrom(self.traverse_graph(onnx_model.graph, self.__call__))
+        return onnx_model