Commit 24b257f1 authored by sunzhq2's avatar sunzhq2
Browse files

init

parent 920b3c0f
<div align="center">
<img src="Graphcore-Chinese-Wordmark-Horizontal.svg">
</div>
[ [中文](README.zh_CN.md) ]
# Graphcore® C600
The Graphcore® C600 IPU-Processor PCIe Card is a high-performance acceleration server card targeted for machine learning inference and training. Powered by the Graphcore Mk2 IPU Processor with FP8 support, the C600 is a dual-slot, full height PCI Express Gen4 card designed for mounting in industry standard server chassis to accelerate machine intelligence workloads.
Up to eight C600 IPU-Processor PCIe Cards can be networked together using IPU-Link™ high-bandwidth interconnect cables, delivering enhanced IPU compute capability.
## Product Specs
| Name | Description |
| :-----| :-----|
| IPU Processor | Graphcore Mk2 IPU Processor with FP8 support |
| IPU-Cores™ | 1,472 IPU-Cores, each one a high-performance processor capable of multi-thread, independent code execution |
| In-Processor Memory™ | Each IPU-Core is paired with fast, local, tightly-coupled In-Processor Memory. The C600 accelerator includes 900MB of In-Processor Memory |
| Compute | Up to 560 teraFLOPS of FP8 compute <br> Up to 280 teraFLOPS of FP16 compute <br> Up to 70 teraFLOPS of FP32 compute |
| System Interface | Dual PCIe Gen4 8-lane interfaces |
| Thermal Solution | Passive |
| Form Factor | PCIe full-height/length; double-slot |
| System Dimensions | Length: 267mm (10.50”); Height: 111mm (4.37”); Width: 27.6mm (1.09”); Mass: 1.27kg (2.8lbs) |
| IPU-Link™ | Support 32 lanes, 128 GB/s bandwidth (64 GB/s in each direction) IPU-Links |
| TDP | 185W |
| Auxiliary Power Supply | 8-pin |
| Quality Level | Server grade |
For more information of the Graphcore® C600, please refer to [C600 cards](https://docs.graphcore.ai/en/latest/hardware.html#c600-cards).
# PopRT
PopRT is a high-performance inference framework specifically for Graphcore IPUs. It is responsible for deeply optimizing the trained models, generating executable programs that can run on the Graphcore IPUs, and performing low-latency, high-throughput inference.
You can get PopRT and related documents from [graphcore/PopRT](https://graphcore.github.io/PopRT/1.4.0/).
Docker images are provided at [graphcorecn/poprt](https://hub.docker.com/r/graphcorecn/poprt).
# Models supported
| Model name | Precision | QPS | Dataset | Metric name | Metric value | report |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| albert-torch-fp32 | FP16 | 3,280 | Open Squad 1.1 | F1 Score | 87.69675 | [report](../../reports/IPU/albert-torch-fp32/) |
| bert-torch-fp32 | FP8 | 4,464 | Open Squad 1.1 | F1 Score | 85.71465 | [report](../../reports/IPU/bert-torch-fp32/) |
| bert-torch-fp32 | FP16 | 3,134 | Open Squad 1.1 | F1 Score | 85.85797 | [report](../../reports/IPU/bert-torch-fp32/) |
| clip-onnx-fp32 | FP16 | 7,305 | Fake Dataset | Mean Diff | 0.00426 | [report](../../reports/IPU/clip-onnx-fp32/) |
| conformer-encoder-onnx-fp32 | FP16 | 9,341 | Fake Dataset | Mean Diff | 0.00161 | [report](../../reports/IPU/conformer-encoder-onnx-fp32/) |
| deberta-torch-fp32 | FP16 | 1,702 | Open Squad 1.1 | F1 Score | 81.24629 | [report](../../reports/IPU/deberta-torch-fp32/) |
| resnet50-torch-fp32 | FP8 | 18,851 | Open Imagenet | Top-1 | 0.76824 | [report](../../reports/IPU/resnet50-torch-fp32/) |
| resnet50-torch-fp32 | FP16 | 13,499 | Open Imagenet | Top-1 | 0.76963 | [report](../../reports/IPU/resnet50-torch-fp32/) |
| roberta-torch-fp32 | FP16 | 3,088 | Open Squad 1.1 | F1 Score | 83.1606 | [report](../../reports/IPU/roberta-torch-fp32/) |
| roformer-tf-fp32 | FP16 | 2,520 | OPEN_CAIL2019 | Top-1 | 0.64323 | [report](../../reports/IPU/roformer-tf-fp32/) |
| swin-large-torch-fp32 | FP8 | 480 | Open Imagenet | Top-1 | 0.8552 | [report](../../reports/IPU/swin-large-torch-fp32/) |
| swin-large-torch-fp32 | FP16 | 315 | Open Imagenet | Top-1 | 0.8536 | [report](../../reports/IPU/swin-large-torch-fp32/) |
| videobert-onnx-fp32 | FP16 | 3,691 | OPEN_CIFAR | Top-1 | 0.6169 | [report](../../reports/IPU/videobert-onnx-fp32/) |
| widedeep-tf-fp32 | FP16 | 31,446,195 | Open Criteo Kaggle | Top-1 | 0.77392 | [report](../../reports/IPU/widedeep-tf-fp32/) |
# How to run
## Download and enable Poplar SDK
```
wget -O 'poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz' 'https://downloads.graphcore.ai/direct?package=poplar-poplar_sdk_ubuntu_20_04_3.3.0_208993bbb7-3.3.0&file=poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz'
tar xzf poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz
source poplar_sdk-ubuntu_20_04-3.3.0+1403-208993bbb7/enable
```
## Start PopRT docker container
```
docker pull graphcorecn/poprt:1.4.0
gc-docker -- -it \
-v `pwd -P`:/workspace \
-w /workspace \
--entrypoint /bin/bash \
graphcorecn/poprt:1.4.0
```
## Install dependencies in docker container
```
apt-get update && \
apt-get install wget libglib2.0-0 -y
```
## Run byte-mlperf task
For example,
```
python3 launch.py --task widedeep-tf-fp32 --hardware IPU
```
For more information of the command to run the task, please refer to [ByteMLPerf](../../../README.md#usage).
<div align="center">
<img src="Graphcore-Chinese-Wordmark-Horizontal.svg">
</div>
[ [English](README.md) ]
# Graphcore® C600
C600 是 Graphcore 为云和数据中心打造的高端推训一体加速卡,主打推理,兼做训练,可以支持各种主流的 AI 应用,在搜索和推荐等业务上别具优势。C600 在提供低延时、高吞吐量的同时不损失精度,帮助 AI 开发人员解决”精度与速度难两全”的痛点,为 AI 应用提供解锁 IPU 强大算力的新路径,以满足客户和机器智能从业者对于易用、高效以及更优 TCO 推理产品的强烈需求。
C600 是一张 PCIe Gen 4 双插槽卡,使用一个 IPU,每个 IPU 具有 1472 个处理核心,能够并行运行 8832 个独立程序线程。每个 IPU 都有 900MB 的片上 SRAM 存储。用户可以在单个机箱中直接连接多达 8 块卡,通过高带宽的 IPU-Links 进行桥接。C600 可搭配市场上主流的 AI 服务器使用,比如浪潮信息 NF5468M6 等。
## 产品规格
| 规格 | 说明 |
| :-----| :-----|
| **IPU 处理器** | 支持 FP8 的 Graphcore® MK2 IPU 处理器 |
| **IPU 核心** | 1472 个 IPU 核心,每个核心都是一个高性能处理器,支持多线程和独立代码执行 |
| **处理器内存储** | 每个 IPU 核心都配有快速且紧密耦合的本地处理器内存储 <br> C600加速器包括 900MB 的处理器内存储 |
| **计算** | 高达 560 teraFLOPS 的 FP8 计算 <br> 高达 280 teraFLOPS 的 FP16 计算 |
| **系统接口** | 2 个分叉 16 位 PCIe 接口的 8 路端口 |
| **散热方案** | 被动散热 |
| **外形** | PCIe 全高/全长;双插槽 |
| **尺寸** | 长度:267 毫米(10.5 英寸)<br> 高度:111 毫米(4.37 英寸)<br> 宽度:27.6 毫米(1.09 英寸)|
| **重量** | 1.27 千克(2.8 磅) |
| **IPU-Link™ 支持** | 64 路,256GB/s 的双 IPU-Links |
| **电源** | 185 瓦 |
| **辅助电源** | 8 针 |
| **质量级别** | 服务器级别 |
关于 Graphcore® C600 的更多信息,请访问 [Graphcore 中文网站](https://www.graphcore.cn/c600-pcie%e5%8d%a1/)
# Graphcore® PopRT
PopRT 是一个针对 IPU 处理器的高性能推理引擎,负责把训练完导出的模型,针对推理进行深度编译优化,生成能在 IPU 上运行的可执行程序 PopEF,并提供灵活的 Runtime,实现对 PopEF 进行低延时,高吞吐的推理。
PopRT 提供了易于集成的 Python 和 C++ API,ByteMLPerf 模型在 IPU 上的运行即通过 PopRT Python API 进行模型的优化,编译和运行。
更多关于 PopRT 的资料,请访问 [PopRT 用户指南](https://graphcore.github.io/PopRT/1.4.0/)
获取 PopRT 的 Docker 镜像,请访问 [graphcorecn/poprt](https://hub.docker.com/r/graphcorecn/poprt)
# 支持的模型
| Model name | Precision | QPS | Dataset | Metric name | Metric value | report |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| albert-torch-fp32 | FP16 | 3,280 | Open Squad 1.1 | F1 Score | 87.69675 | [report](../../reports/IPU/albert-torch-fp32/) |
| bert-torch-fp32 | FP8 | 4,464 | Open Squad 1.1 | F1 Score | 85.71465 | [report](../../reports/IPU/bert-torch-fp32/) |
| bert-torch-fp32 | FP16 | 3,134 | Open Squad 1.1 | F1 Score | 85.85797 | [report](../../reports/IPU/bert-torch-fp32/) |
| clip-onnx-fp32 | FP16 | 7,305 | Fake Dataset | Mean Diff | 0.00426 | [report](../../reports/IPU/clip-onnx-fp32/) |
| conformer-encoder-onnx-fp32 | FP16 | 9,341 | Fake Dataset | Mean Diff | 0.00161 | [report](../../reports/IPU/conformer-encoder-onnx-fp32/) |
| deberta-torch-fp32 | FP16 | 1,702 | Open Squad 1.1 | F1 Score | 81.24629 | [report](../../reports/IPU/deberta-torch-fp32/) |
| resnet50-torch-fp32 | FP8 | 18,851 | Open Imagenet | Top-1 | 0.76824 | [report](../../reports/IPU/resnet50-torch-fp32/) |
| resnet50-torch-fp32 | FP16 | 13,499 | Open Imagenet | Top-1 | 0.76963 | [report](../../reports/IPU/resnet50-torch-fp32/) |
| roberta-torch-fp32 | FP16 | 3,088 | Open Squad 1.1 | F1 Score | 83.1606 | [report](../../reports/IPU/roberta-torch-fp32/) |
| roformer-tf-fp32 | FP16 | 2,520 | OPEN_CAIL2019 | Top-1 | 0.64323 | [report](../../reports/IPU/roformer-tf-fp32/) |
| swin-large-torch-fp32 | FP8 | 480 | Open Imagenet | Top-1 | 0.8552 | [report](../../reports/IPU/swin-large-torch-fp32/) |
| swin-large-torch-fp32 | FP16 | 315 | Open Imagenet | Top-1 | 0.8536 | [report](../../reports/IPU/swin-large-torch-fp32/) |
| videobert-onnx-fp32 | FP16 | 3,691 | OPEN_CIFAR | Top-1 | 0.6169 | [report](../../reports/IPU/videobert-onnx-fp32/) |
| widedeep-tf-fp32 | FP16 | 31,446,195 | Open Criteo Kaggle | Top-1 | 0.77392 | [report](../../reports/IPU/widedeep-tf-fp32/) |
# 如何运行
## 下载并安装 Poplar SDK
```
wget -O 'poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz' 'https://downloads.graphcore.ai/direct?package=poplar-poplar_sdk_ubuntu_20_04_3.3.0_208993bbb7-3.3.0&file=poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz'
tar xzf poplar_sdk-ubuntu_20_04-3.3.0-208993bbb7.tar.gz
source poplar_sdk-ubuntu_20_04-3.3.0+1403-208993bbb7/enable
```
## 启动 PopRT Docker 容器
```
docker pull graphcorecn/poprt:1.4.0
gc-docker -- -it \
-v `pwd -P`:/workspace \
-w /workspace \
--entrypoint /bin/bash \
graphcorecn/poprt:1.4.0
```
## 安装 ByteMLPerf 的依赖
```
apt-get update && \
apt-get install wget libglib2.0-0 -y
```
## 运行 ByteMLPerf 的任务
使用如下命令运行:
```
python3 launch.py --task widedeep-tf-fp32 --hardware IPU
```
更多关于 ByteMLPerf 运行命令的说明,请参考 [ByteMLPerf](../../../README.zh_CN.md#usage)
# Copyright 2023 Graphcore Ltd.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
current_dir = os.path.split(os.path.abspath(__file__))[0]
byte_mlperf_dir = current_dir.rsplit("/", 2)[0]
sys.path.append(byte_mlperf_dir)
# Copyright 2023 Graphcore Ltd.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import copy
import json
import logging
import os
from pathlib import Path
from typing import Any, Dict
import onnx
import poprt
from poprt import runtime
from poprt.compiler import Compiler, CompilerOptions
from poprt.converter import Converter
from tools import saved_to_onnx, torch_to_onnx
from general_perf.backends import compile_backend
from general_perf.backends.IPU.passes import *
log = logging.getLogger("CompileBackendIPU")
class CompileBackendIPU(compile_backend.CompileBackend):
def __init__(self):
super(CompileBackendIPU, self).__init__()
self.hardware_type = "IPU"
self.need_reload = False
self.model_runtimes = []
self.current_dir = os.path.split(os.path.abspath(__file__))[0]
self.model_config = None
self.packrunner = False
self.precision = "fp32"
def version(self) -> str:
"""Return compile backend version details."""
return poprt.__version__
def pre_optimize(self, configs: Dict[str, Any]):
"""Model pre-optimization interface.
Requirements: Model pre-optimization
cannot change the model format. Torch model export to ONNX is allowed.
"""
self._update_model_config(configs.get("interact_info", {}))
self.precision = (
self.model_config.get("converter_options", {})
.get("precision", "FP32")
.upper()
)
if self.model_config.get("pack_config"):
self.packrunner = True
model_info = configs["model_info"]
model_type = model_info["model_format"]
model_name = model_info["model"]
pre_optimized_root = Path(self.current_dir) / "pre_optimized_models"
if not pre_optimized_root.exists():
pre_optimized_root.mkdir(parents=True)
model_path = os.path.abspath(configs["model_info"]["model_path"])
onnx_path = pre_optimized_root / (model_name + ".onnx")
if not self.model_config:
self.model_config = configs.get("interact_info", {})
# convert model to onnx if it's not
# configs['workload'] is the content of workloads/<task_name>.json and
# configs['model_info'] is content of model_zoo/<task_name>.json
if model_type != "onnx":
if onnx_path.exists():
onnx_path = self._update_pack_model(onnx_path, model_info)
model_info["model_path"] = onnx_path
log.info("{} file exists, skip ONNX conversion".format(onnx_path.name))
else:
# convert the model to onnx
log.info(
"Convert the model: {} from format: {} to onnx".format(
model_name, model_type
)
)
if model_type == "saved_model":
saved_to_onnx.savedmodel_to_onnx(model_path, onnx_path)
onnx_path = self._update_pack_model(onnx_path, model_info)
elif model_type == "pt":
torch_to_onnx.torch_to_onnx(model_path, str(onnx_path))
onnx_path = self._update_pack_model(onnx_path, model_info)
else:
log.error(
"Wrong model type: {}, which must be saved_model, pt, or onnx".format(
model_type
)
)
raise TypeError("Model type must be saved_model, pt, or onnx")
if os.path.exists(onnx_path):
model_info["model_path"] = onnx_path
log.info(
"Converted the model: {} from format: {} to onnx".format(
model_name, model_type
)
)
else:
log.error(
"{} not exists, failed to convert the model: {} to onnx".format(
onnx_path, model_name
)
)
raise RuntimeError("Failed to convert model to onnx")
else:
log.info("{} is onnx model, skip ONNX conversion".format(model_name))
return configs
def compile(self, config, dataloader=None):
self.model_info = config["model_info"]
if not self.model_config:
self.model_config = config["interact_info"]
# precision not in model_config (multiple precisions available) and user
# skipped precision selection prompt
if not self.precision and not config["interact_info"].get("precision"):
self.precision = "fp16"
if "converter_options" not in self.model_config:
self.model_config["converter_options"] = {}
self.model_config["converter_options"]["precision"] = self.precision
if self.model_config.get("pack_config"):
self.packrunner = True
log.info("The interaction info is:\n {}".format(self.model_config))
result = {
"model": config["model_info"]["model"],
"framework": config["model_info"]["framework"],
"compile_precision": self.precision,
"input_type": config["model_info"]["input_type"].split(","),
"max_batch_size": config["model_info"]["max_batch_size"],
"compile_status": "success",
"optimizations": {},
"instance_count": 1,
"device_count": 1,
"sg_percent": 100,
"segments": [
{
"sg_idx": 0,
"is_fallback": False,
"input_tensor_map": config["model_info"]["input_shape"],
"output_tensor_map": config["model_info"]["outputs"],
"compiled_model": [
{
"compiled_bs": 1,
"compiled_obj": config["model_info"]["model_path"],
},
],
},
],
"interact_info": self.model_config,
}
# packrunner takes single data as input and perform batching asynchoroly
# within itself.
# The config option "pack_bs" is the bs for model compiling, and the regular
# "batch_size" is used to specify bs for the dataloader/inferencing and
# it should always be 1 since pack runner only take single sample a time
if self.packrunner:
pack_config = self.model_config["pack_config"]
assert (
"batch_size" in pack_config
), "for pack mode, we acquire pack_bs as the actual compile batch sizes for the pack model"
assert isinstance(
pack_config["batch_size"], int
), "pack_bs should be a positive integers"
compile_bs = [pack_config["batch_size"]]
else:
compile_bs = config["workload"]["batch_sizes"]
for batch_size in compile_bs:
self._compile(batch_size)
return result
def get_interact_profile(self, config):
"""Collect information for core engine to let user interactively fill in configurations."""
model_profile = []
# load the interact_info by model name
interact_info_file = os.path.join(
self.current_dir, "interact_infos", config["model_info"]["model"] + ".json"
)
file_path = os.path.join(self.current_dir, self.hardware_type + ".json")
with open(file_path, "r") as f:
interact_info = json.load(f)
if os.path.exists(interact_info_file):
# has model config file but does not provide must have config options
with open(interact_info_file, "r") as f:
self.model_config = json.load(f)
log.info("interact_info set by file: {}".format(interact_info_file))
if not self.model_config.get("converter_options", {}).get("precision"):
for _, v in enumerate(interact_info):
if v["name"] == "precision":
model_profile.append(v)
else:
file_path = os.path.join(self.current_dir, self.hardware_type + ".json")
if os.path.exists(file_path):
with open(file_path, "r") as f:
model_profile = json.load(f)
else:
log.info("File path: {} does not exist, please check".format(file_path))
return model_profile
def get_best_batch_size(self):
"""Get Best Batch Size for the model.
Usually take the max batch size can be loaded to IPU as the best batch size to
get highest throughput.
"""
return self.model_config.get("batch_sizes", None)
def _update_model_config(self, interact_info):
# update poprt configuration based on interact_info
if not self.model_config:
self.model_config = {}
self.model_config["converter_options"] = interact_info.get(
"converter_options", {}
)
self.model_config["clients"] = int(interact_info.get("clients", "1"))
batch_sizes = interact_info.get("batch_sizes", "").split(",").remove("")
if batch_sizes:
self.model_config["batch_sizes"] = [int(x.strip()) for x in batch_sizes]
self.model_config["compiler_options"] = json.loads(
interact_info.get("compiler_options", "{}")
)
self.model_config["clients"] = int(self.model_config.get("clients", "1"))
batch_sizes = self.model_config.get("batch_sizes", "").split(",")
if batch_sizes:
self.model_config["batch_sizes"] = [
int(x.strip()) for x in batch_sizes if x.strip().isdigit()
]
for key, value in self.model_config.items():
if "_options" in key and isinstance(value, str):
self.model_config[key] = json.loads(value)
if interact_info.get("precision"):
self.model_config["converter_options"]["precision"] = interact_info[
"precision"
]
# update converter config when user selected fp8 in interact sections
# and there is fp8_configs in interact_info config file
if interact_info["precision"] == "fp8" and self.model_config.get(
"fp8_configs"
):
for config_name, config_section in self.model_config[
"fp8_configs"
].items():
if isinstance(self.model_config[config_name], dict):
self.model_config[config_name].update(config_section)
else:
self.model_config[config_name] = config_section
del self.model_config["fp8_configs"]
def _compile(self, batch_size):
self.batch_size = batch_size
# differentiate popef based on precision
self.popef_path = os.path.join(
self.current_dir,
"compiled_models",
self.model_info["model"],
str(batch_size),
"executable_{}.popef".format(self.precision),
)
self.popef_path = os.path.abspath(self.popef_path)
if os.path.exists(self.popef_path):
log.info(
"The PopEF file {} already exist, skip compile".format(
os.path.abspath(self.popef_path)
)
)
return self.popef_path
log.info("Create the directory {}".format(os.path.dirname(self.popef_path)))
os.makedirs(os.path.dirname(self.popef_path), exist_ok=True)
converter_options = self.model_config.get("converter_options", {})
compiler_options = self.model_config.get("compiler_options", {})
converted_model = self._convert(converter_options)
self._poprt_compile(converted_model, compiler_options, self.popef_path)
return self.popef_path
def _convert(self, converter_options: Dict) -> onnx.ModelProto:
model_proto = onnx.load(self.model_info["model_path"])
input_shape = {}
not_extended_with_batch = self.model_config.get("not_extended_with_batch", [])
for name, shape in self.model_info["input_shape"].items():
if name in not_extended_with_batch:
batched_shape = [shape[0]] + shape[1:]
elif name == "text" and "videobert" in self.model_info["model"]:
batched_shape = [shape[0]] + shape[1:]
else:
batched_shape = [shape[0] * self.batch_size] + shape[1:]
log.info(
"The model input {} with shape {} in the model information, and shape with batch size is {}.".format(
name, shape, batched_shape
)
)
input_shape[name] = batched_shape
converter_options["input_shape"] = input_shape
converter = Converter(**converter_options)
converted_model = converter.convert(model_proto)
return converted_model
def _poprt_compile(
self, converted_model: onnx.ModelProto, compiler_options: dict, popef_path: str
):
options = CompilerOptions()
options.ipu_version = runtime.DeviceManager().ipu_hardware_version()
options.num_io_tiles = compiler_options.get("num_iotiles", 0)
options.batches_per_step = compiler_options.get("batches_per_step", 1)
options.enable_prefetch_datastreams = compiler_options.get(
"enable_prefetch_datastreams", False
)
options.stream_buffering_depth = compiler_options.get(
"stream_buffering_depth", 1
)
options.available_memory_proportion = compiler_options.get(
"available_memory_proportion", 0.6
)
options.partials_type = compiler_options.get("partials_type", "half")
options.use_128bit_conv_unit_load = compiler_options.get(
"use_128bit_conv_unit_load", False
)
options.enable_fast_reduce = compiler_options.get("enable_fast_reduce", False)
options.group_host_sync = compiler_options.get("group_host_sync", False)
options.rearrange_anchors_on_host = compiler_options.get(
"rearrange_anchors_on_host", False
)
options.enable_outlining = compiler_options.get("enable_outlining", True)
options.outline_threshold = compiler_options.get("outline_threshold", 1.0)
outputs = [o.name for o in converted_model.graph.output]
Compiler.compile_and_export(
converted_model.SerializeToString(), outputs, popef_path, options
)
return popef_path
def _update_pack_model(self, model_path, model_info):
"""bert like model conversion for pack mode, update corresponded configs as well."""
if not self.packrunner:
return model_path
model = onnx.load(model_path)
# actual model_path updated to pack
model_path = (
Path(self.current_dir)
/ "pre_optimized_models"
/ (Path(model_path).stem + "_pack.onnx")
)
assert "input_shape" in model_info
assert "inputs" in model_info
assert "dataset_name" in model_info
assert "input_type" in model_info
model_info["inputs"] += ",position_ids"
model_info["input_type"] += ",LONG"
model_info["model_path"] = str(model_path)
if "deberta" in model_info["model"]:
model_info["input_shape"]["unpack_info"] = [1, 1]
else:
model_info["input_shape"]["position_ids"] = [1, 384]
self.model_info = model_info
if self.model_info["model"] == "roberta-torch-fp32":
rm_node_names = [
"/model/roberta/embeddings/Equal",
"/model/roberta/embeddings/Not",
"/model/roberta/embeddings/Cast",
"/model/roberta/embeddings/CumSum",
"/model/roberta/embeddings/Mul",
"/model/roberta/embeddings/Cast_1",
]
rm_nodes = []
for node in model.graph.node:
if node.name in rm_node_names:
rm_nodes.append(node)
assert len(rm_node_names) == len(rm_nodes)
for node in rm_nodes:
model.graph.node.remove(node)
position_ids = copy.deepcopy(model.graph.input[0])
position_ids.name = "position_ids"
model.graph.input.append(position_ids)
for node in model.graph.node:
if (
node.op_type == "Add"
and node.name == "/model/roberta/embeddings/Add"
):
node.input[0] = position_ids.name
elif self.model_info["model"] in ("bert-torch-fp32", "albert-torch-fp32"):
# for packed bert, we need to export position_ids to model's input
# step 1: remove unneed node
model_name = "albert" if "albert" in model_path.name else "bert"
rm_node_names = [
"/model/{0}/embeddings/Shape".format(model_name),
"/model/{0}/embeddings/Gather".format(model_name),
"/model/{0}/embeddings/Unsqueeze".format(model_name),
"/model/{0}/embeddings/Slice".format(model_name),
]
rm_nodes = []
for node in model.graph.node:
if node.name in rm_node_names:
rm_nodes.append(node)
assert len(rm_node_names) == len(rm_nodes)
for node in rm_nodes:
model.graph.node.remove(node)
# step 2: add position_ids to model's input
position_ids = copy.deepcopy(model.graph.input[0])
position_ids.name = "position_ids"
model.graph.input.append(position_ids)
for node in model.graph.node:
if (
node.op_type == "Gather"
and node.name
== "/model/{0}/embeddings/position_embeddings/Gather".format(
model_name
)
):
node.input[1] = position_ids.name
print("Save preprocessed model to {}".format(model_path))
onnx.save(model, model_path)
return model_path
# Copyright 2023 Graphcore Ltd.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
class Engine:
def __init__():
raise NotImplementedError
def predict(self, feeds):
raise NotImplementedError
def benchmark(self, clients, batch_size, iterations):
raise NotImplementedError
# Copyright 2023 Graphcore Ltd.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import random
import threading as th
import time
from queue import Queue
import numpy as np
import torch
from poprt import runtime
from . import engine
log = logging.getLogger("engine_poprt")
class PopRT(engine.Engine):
def __init__(self, popef_path, config):
self.runner = runtime.Runner(popef_path, config)
self.packrunner = True if type(config) == runtime.PackRunnerConfig else False
def predict(self, feeds):
input_descriptions = self.runner.get_model_inputs()
for desc in input_descriptions:
if isinstance(feeds[desc.name], list):
feeds[desc.name] = np.array(
feeds[desc.name], dtype=desc.numpy_data_type()
)
elif isinstance(feeds[desc.name], np.ndarray):
feeds[desc.name] = feeds[desc.name].astype(desc.numpy_data_type())
elif isinstance(feeds[desc.name], torch.Tensor):
feeds[desc.name] = (
feeds[desc.name].numpy().astype(desc.numpy_data_type())
)
else:
raise TypeError(
"The feeds[value] must be list, np.ndarray or torch.Tensor"
)
# create the output numpy arrays
output_descriptions = self.runner.get_model_outputs()
results = {}
for output_desc in output_descriptions:
output_shape = output_desc.shape
results[output_desc.name] = np.zeros(
output_shape, dtype=output_desc.numpy_data_type()
)
if self.packrunner:
fut = self.runner.executeAsync(dict(feeds), dict(results))
fut.wait()
else:
self.runner.execute(feeds, results)
return results
def benchmark(self, clients, batch_size, iterations):
input_view = runtime.InputMemoryView()
input_descriptions = self.runner.get_model_inputs()
output_descriptions = self.runner.get_model_outputs()
inputs = {}
outputs = {}
for input_desc in input_descriptions:
inputs[input_desc.name] = np.random.randn(*input_desc.shape).astype(
input_desc.numpy_data_type()
)
for output_desc in output_descriptions:
outputs[output_desc.name] = np.zeros(
output_desc.shape, dtype=output_desc.numpy_data_type()
)
log.info("Warm up")
for _ in range(5):
self.runner.execute(inputs, outputs)
log.info("Warm up completed, start the time counting")
q = Queue()
def perf_count(model_runner, iteration, input_view):
durations = []
for _ in range(iteration):
start_time = time.time()
self.runner.execute(inputs, outputs)
end_time = time.time()
durations.append((start_time, end_time))
# remove the first and last 20
if iteration > 40:
durations = durations[20:-20]
q.put(durations, timeout=10)
thp = [
th.Thread(target=perf_count, args=(self.runner, iterations, input_view))
for _ in range(clients)
]
for t in thp:
t.start()
for t in thp:
t.join()
durations_from_th = []
while not q.empty():
durations_from_th += q.get()
max_timestamp = max(y for _, y in durations_from_th)
min_timestamp = min(x for x, _ in durations_from_th)
if iterations > 40:
iterations -= 40 # iterations -40 as line 260
qps = clients * batch_size * iterations / (max_timestamp - min_timestamp)
times_range = [y - x for x, y in durations_from_th]
times_range.sort()
tail_latency = round(times_range[int(len(times_range) * 0.99)] * 1000, 2)
avg_latency = round(sum(times_range) / len(times_range) * 1000, 2)
log.info(
"Batch size is {}, QPS: {}, Avg Latency:{}, Tail Latency:{}".format(
batch_size, int(qps), avg_latency, tail_latency
)
)
np_latency = np.array(times_range) * 1000.0
log.info(
f"====== Latency P50: {np.percentile(np_latency, 50)}, P90: {np.percentile(np_latency, 90)}, P99: {np.percentile(np_latency, 99)} ======"
)
return qps, avg_latency, tail_latency
def benchmark_pack(self, pack_config, iterations):
output_descriptions = self.runner.get_model_outputs()
outputs = {}
for output_desc in output_descriptions:
shape = output_desc.shape
shape[0] = 1
outputs[output_desc.name] = np.zeros(
shape, dtype=output_desc.numpy_data_type()
)
# average sequence length in squad is ~172
avg_len = 172
max_valid_seq = 384
bs = pack_config.get("batch_size", 20)
sample_num = iterations * bs
input_len = np.random.normal(avg_len, avg_len, size=sample_num).astype(np.int32)
input_len = np.clip(input_len, 1, max_valid_seq)
datasets = []
for s_len in input_len:
sample = {}
# set value to 1 does not affect the performance, where attention_mask in pack mode required to be set to 1
for input_name in pack_config["input_names"]:
sample[input_name] = np.ones(s_len).astype(np.int32)
datasets.append(sample)
# each client sent a single data, one pack batch can pack more than 2*bs of data
clients = int(bs * 3.5)
count_percent = 0.6
q = Queue()
def perf_count(model_runner, iteration):
durations = []
for i in range(sample_num):
start_time = time.time()
sample_idx = random.randint(0, sample_num - 1)
self.runner.execute(datasets[sample_idx], outputs)
end_time = time.time()
durations.append((start_time, end_time))
# remove first and last example's time counter
ignored_samples = int(sample_num * (1 - count_percent) / 2)
durations = durations[ignored_samples:-ignored_samples]
q.put(durations, timeout=10)
thp = [
th.Thread(target=perf_count, args=(self.runner, iterations))
for _ in range(clients)
]
for t in thp:
t.start()
for t in thp:
t.join()
durations_from_th = []
while not q.empty():
durations_from_th += q.get()
max_timestamp = max(y for _, y in durations_from_th)
min_timestamp = min(x for x, _ in durations_from_th)
# iterations -40 as line 260
qps = clients * (sample_num * count_percent) / (max_timestamp - min_timestamp)
times_range = [y - x for x, y in durations_from_th]
times_range.sort()
tail_latency = round(times_range[int(len(times_range) * 0.99)] * 1000, 2)
avg_latency = round(sum(times_range) / len(times_range) * 1000, 2)
log.info(
"Batch size is {}, QPS: {}, Avg Latency:{}, Tail Latency:{}".format(
bs, int(qps), avg_latency, tail_latency
)
)
np_latency = np.array(times_range) * 1000.0
log.info(
f"====== Latency P50: {np.percentile(np_latency, 50)}, P90: {np.percentile(np_latency, 90)}, P99: {np.percentile(np_latency, 99)} ======"
)
return qps, avg_latency, tail_latency
{
"clients": 1,
"batch_sizes": [
1
],
"pack_config": {
"batch_size": 40,
"input_names": ["input_ids.1", "attention_mask.1", "token_type_ids.1"],
"dynamic_input_name" : "input_ids.1",
"mask_name": "attention_mask.1",
"max_pack_num": 100,
"timeout_microseconds": 15000
},
"converter_options": {
"used_passes": [
"insert_attention_mask"
],
"precision": "fp16",
"disable_fast_norm": true
},
"compiler_options": {
"available_memory_proportion": 0.5
}
}
{
"batch_sizes": [
1
],
"pack_config": {
"batch_size": 40,
"input_names": [
"input_ids.1",
"attention_mask.1",
"token_type_ids.1"
],
"dynamic_input_name": "input_ids.1",
"mask_name": "attention_mask.1",
"max_pack_num": 100,
"timeout_microseconds": 15000
},
"converter_options": {
"used_passes": [
"insert_attention_mask"
],
"disable_fast_norm": true,
"enable_insert_remap": false,
"precision": ""
},
"compiler_options": {
"available_memory_proportion": 0.4
},
"fp8_configs": {
"pack_config": {
"batch_size": 45,
"max_pack_num": 120
},
"compiler_options": {
"available_memory_proportion": 0.6
},
"converter_options": {
"fp8_params": "F143,F143,0,0",
"fp8_skip_op_names": "/model/bert/embeddings/word_embeddings/Gather"
}
}
}
{
"batch_sizes": [4,40],
"clients":3,
"converter_options":{
"precision": "fp16",
"infer_shape_ahead": true
},
"compiler_options": {
"num_iotiles": 32,
"batches_per_step": 128,
"enable_prefetch_datastreams": true,
"use_128bit_conv_unit_load": true,
"stream_buffering_depth": 2,
"enable_fast_reduce": true,
"rearrange_anchors_on_host": true,
"group_host_sync": true,
"enable_outlining": true,
"outline_threshold": 5000,
"available_memory_proportion": 0.8
}
}
{
"clients": 4,
"batch_sizes": [
4, 44
],
"converter_options": {
"precision": "fp16",
"enable_insert_remap": true
},
"compiler_options": {
"use_128bit_conv_unit_load": true,
"enable_fast_reduce": true,
"num_iotiles":32,
"batches_per_step":128,
"enable_prefetch_datastreams":true,
"stream_buffering_depth":2
}
}
{
"clients": 1,
"batch_sizes": [
1
],
"pack_config": {
"batch_size": 10,
"input_names": [
"input_ids.1",
"attention_mask.1"
],
"dynamic_input_name": "input_ids.1",
"mask_name": "attention_mask.1",
"max_pack_num": 20,
"timeout_microseconds": 15000
},
"converter_options": {
"disable_fast_norm": true,
"enable_insert_remap": true,
"remap_mode": ["after_matmul"," before_add"],
"max_tensor_size": 6291456,
"precision": "fp16",
"used_passes": ["deberta_pack"]
},
"compiler_options": {
"use_128bit_conv_unit_load": true,
"enable_fast_reduce": true,
"group_host_sync": true,
"available_memory_proportion": 0.4
}
}
{
"clients": 4,
"batch_sizes": [
4,
32
],
"converter_options": {
"precision": ""
},
"compiler_options": {
"num_iotiles": 64,
"batches_per_step": 128,
"enable_prefetch_datastreams": true,
"stream_buffering_depth": 2
},
"fp8_configs": {
"converter_options": {
"fp8_params": "F143,F143,-3,-3",
"fp8_skip_op_names": "/layer1/0/downsample/0/Conv"
}
}
}
{
"clients": 1,
"batch_sizes": [
1
],
"pack_config": {
"batch_size": 40,
"input_names": ["input_ids.1", "attention_mask.1", "token_type_ids.1"],
"dynamic_input_name" : "input_ids.1",
"mask_name": "attention_mask.1",
"max_pack_num": 100,
"timeout_microseconds": 15000
},
"converter_options": {
"used_passes": [
"insert_attention_mask"
],
"precision": "fp16",
"disable_fast_norm": true
},
"compiler_options": {
"available_memory_proportion": 0.4
}
}
{
"clients": 1,
"batch_sizes": [
2, 12
],
"converter_options": {
"precision": "fp16",
"used_passes": [
"pre_scale",
"remove_input_cast",
"matmul_rotary_embedding",
"fused_attention",
"replace_groupnorm_with_fast_norm"
],
"disable_fast_norm": true
},
"compiler_options": {
"use_128bit_conv_unit_load": true,
"enable_fast_reduce": true
}
}
{
"clients": 4,
"batch_sizes": [
2
],
"converter_options": {
"precision": "",
"enable_insert_remap": false
},
"compiler_options": {
"batches_per_step": 128,
"enable_prefetch_datastreams": true,
"use_128bit_conv_unit_load": true,
"group_host_sync": false,
"stream_buffering_depth": 2,
"enable_fast_reduce": true,
"rearrange_anchors_on_host": false,
"available_memory_proportion": 0.2
},
"fp8_configs": {
"converter_options": {
"enable_insert_remap": true
},
"batch_sizes": [
4
]
}
}
{
"batch_sizes": [4, 42],
"clients":3,
"converter_options":{
"precision": "fp16"
},
"compiler_options": {
"batches_per_step": 128,
"enable_prefetch_datastreams": true,
"use_128bit_conv_unit_load": true,
"stream_buffering_depth": 2,
"enable_fast_reduce": true,
"rearrange_anchors_on_host": false,
"enable_outlining": true,
"outline_threshold": 5000,
"available_memory_proportion": 0.2
}
}
{
"batch_sizes": [1024, 20000],
"clients":5,
"converter_options":{
"precision": "fp16"
},
"compiler_options": {
"num_iotiles": 32,
"batches_per_step": 1024,
"enable_prefetch_datastreams": true,
"use_128bit_conv_unit_load": true,
"stream_buffering_depth": 2,
"enable_fast_reduce": true,
"rearrange_anchors_on_host": false
}
}
from . import custom_final_check # noqa
from . import deberta_pack # noqa
# Copyright (c) 2023 Graphcore Ltd. All rights reserved.
from typing import Any, Dict
import onnx
import poprt
from poprt.passes.apply_ir_pass import ApplyIrPass
from poprt.passes.base_pass import ImmutablePass
from poprt.passes.onnx_helper import clean_info
from poprt.passes.remove_duplicated_initializer import RemoveDuplicatedInitializer
# skip register here
# @register('final_check')
class CustomFinalCheck(ImmutablePass):
"""Final check for data type and shape of the converted model."""
name = 'final_check'
def run_transform(
self, graph: onnx.GraphProto, is_main_graph: bool
) -> onnx.GraphProto:
# Check if all tensors have valid dtype and shape
output_tensors = []
for n in graph.node:
output_tensors.extend(n.output)
output_tensors = set(output_tensors)
for t in list(graph.value_info) + list(graph.output):
tensor_type = t.type.tensor_type
has_dtype = tensor_type.HasField("elem_type")
has_shape = (
tensor_type.HasField("shape")
and len(tensor_type.shape.ListFields()) > 0
)
if has_dtype and has_shape:
dtype = tensor_type.elem_type
# If the dtype < 1 (onnx.TensorProto.FLOAT) or dtype > 16 (onnx.TensorProto.BFLOAT16),
# the dtype is invalid.
is_valid_dtype = (
dtype >= onnx.TensorProto.FLOAT
and dtype <= onnx.TensorProto.BFLOAT16
)
shape = [dim.dim_value for dim in tensor_type.shape.dim]
is_valid_shape = 0 not in shape
if (not is_valid_dtype) or (not is_valid_shape):
self.logger.warning(
f"{t.name} has no inferred elem_type {dtype} or shape {shape}"
)
if t.name in output_tensors:
output_tensors.remove(t.name)
for t_name in output_tensors:
self.logger.warning(
f"Graph {graph.name} tensor {t_name} has no elem_type or shape."
)
return graph
def run(self, model: onnx.ModelProto) -> onnx.ModelProto:
# NOTE: skip SortGraph here
# Ensure topological for subgraph
# model = SortGraph().run(model)
# Infer shape and dtype to make sure all passes process validly.
model = clean_info(model)
# Remove duplicated initializer
model = RemoveDuplicatedInitializer().run(model)
model.graph.CopyFrom(self.traverse_graph(model.graph, self.run_transform))
# Ensure each node has a unique name
model = ApplyIrPass(["unique_name_for_nodes"])(model)
return model
def custom_get_all_named_subclasses(cls: Any) -> Dict[str, Any]:
subclasses = {}
def visit(cls):
for subclass in cls.__subclasses__():
if hasattr(subclass, 'name'):
subclasses[subclass.name] = subclass
visit(subclass)
visit(cls)
# patch
subclasses['final_check'] = CustomFinalCheck
return subclasses
# monkey patch
poprt.passes.base_pass.get_all_named_subclasses = custom_get_all_named_subclasses
# Copyright (c) 2023 Graphcore Ltd. All rights reserved.
import onnx
from poprt import Pass
from poprt.passes import register
from poprt.passes.onnx_helper import clean_info, get_dtype, topological_sort
from poprt.passes.pattern_helper import PatternMatcher
from poprt.passes.shape_inference import infer_shapes
@register("deberta_pack")
class PackedDeberta(Pass):
@staticmethod
def _find(items, search_func, return_all=False):
results = []
for i, item in enumerate(items):
if search_func(item):
results.append((i, item))
if not return_all:
break
return results if return_all else (-1, None) if not results else results[0]
def __init__(self):
super().__init__()
def _modify_mask_before_mul_to_input(self, model):
pattern = ["s:0->Unsqueeze:Unsqueeze->Cast:Cast->Mul:Mul->e:5"]
pattern_matcher = PatternMatcher(pattern)
ops = pattern_matcher.next_pattern(model.graph)
if ops:
Cast = onnx.helper.make_node(
"Cast",
name="{}_Cast".format(ops["Unsqueeze"].node.name),
inputs=[ops["Unsqueeze"].node.input[0]],
outputs=["{}_Cast:0".format(ops["Unsqueeze"].node.name)],
to=onnx.TensorProto.BOOL,
)
ops["Unsqueeze"].node.input[0] = Cast.output[0]
model.graph.node.insert(ops["Unsqueeze"].index, Cast)
return model
def _modify_attentionmask(self, model):
pattern = [
"s:0->Reshape:Reshape->Squeeze:Squeeze->Unsqueeze:Unsqueeze->Mul:Mul->Cast:Cast->Not:Not->e:1",
" Reshape:Reshape ->Mul:Mul",
]
pattern_matcher = PatternMatcher(pattern)
ops = pattern_matcher.next_pattern(model.graph)
if ops:
input = ops["Reshape"].node.input[0]
for node in [
ops[key].node
for key in ["Reshape", "Squeeze", "Unsqueeze", "Mul", "Cast", "Not"]
]:
model.graph.node.remove(node)
else:
return model
pattern = [
"s:0 ->WhereV2:WhereV2_1->Softmax:Softmax->WhereV2:WhereV2_2->MatMul:MatMul->e:1",
"s:2->Add:Add->WhereV2:WhereV2_1",
]
pattern_matcher = PatternMatcher(pattern)
ops = pattern_matcher.next_pattern(model.graph)
AttentionMask, AttentionMaskNot = None, None
while ops:
if AttentionMask is None:
dtype = get_dtype(model.graph, ops["Add"].node.output[0])
kwargs = {
"dataType": "FLOAT"
if dtype == onnx.TensorProto.FLOAT
else "FLOAT16"
}
AttentionMask = onnx.helper.make_node(
"AttentionMask",
name="AttentionMask",
inputs=[input, ops["Add"].node.output[0]],
outputs=["{}_AttentionMask".format(ops["Add"].node.output[0])],
domain="ai.graphcore",
**kwargs,
)
Cast = onnx.helper.make_node(
"Cast",
name="{}_Cast".format(AttentionMask.name),
inputs=[AttentionMask.output[0]],
outputs=["{}_Cast:0".format(AttentionMask.name)],
to=onnx.TensorProto.BOOL,
)
Not = onnx.helper.make_node(
"Not",
name="{}_Not".format(Cast.name),
inputs=[Cast.output[0]],
outputs=["{}_Not:0".format(Cast.name)],
)
AttentionMaskNot = onnx.helper.make_node(
"Cast",
name="{}_Cast".format(Not.name),
inputs=[Not.output[0]],
outputs=["{}_Cast:0".format(Not.name)],
to=onnx.TensorProto.FLOAT16,
)
model.graph.node.insert(ops["Softmax"].index, AttentionMaskNot)
model.graph.node.insert(ops["Softmax"].index, Not)
model.graph.node.insert(ops["Softmax"].index, Cast)
model.graph.node.insert(ops["Softmax"].index, AttentionMask)
Add = onnx.helper.make_node(
"Add",
name="{}_Add".format(ops["Add"].node.output[0]),
inputs=[AttentionMask.output[0], ops["Add"].node.output[0]],
outputs=["{}_Add:0".format(ops["Add"].node.output[0])],
)
ops["Softmax"].node.input[0] = Add.output[0]
Mul = onnx.helper.make_node(
"Mul",
name="{}_Mul".format(ops["Softmax"].node.output[0]),
inputs=[AttentionMaskNot.output[0], ops["Softmax"].node.output[0]],
outputs=["{}_Mul".format(ops["Softmax"].node.output[0])],
)
ops["MatMul"].node.input[0] = Mul.output[0]
softmax_index, _ = self._find(
model.graph.node, lambda n: n.name == ops["Softmax"].node.name
)
model.graph.node.insert(softmax_index + 1, Mul)
model.graph.node.insert(softmax_index, Add)
for key in ("WhereV2_1", "WhereV2_2"):
model.graph.node.remove(ops[key].node)
ops = pattern_matcher.next_pattern(model.graph)
return model
def _add_unpack(self, model):
max_valid_num, segment_max_size, segment_num = 2 * 10, 384, 1
pattern = [
"s:0->Reshape:1->MatMul:2->Add:3->Split:4->e:5",
]
pattern_matcher = PatternMatcher(pattern)
ops = pattern_matcher.next_pattern(model.graph)
if ops:
unpack_info = onnx.helper.make_tensor_value_info(
"unpack_info",
onnx.TensorProto.INT32,
(max_valid_num, segment_num),
)
model.graph.input.append(unpack_info)
unpack_attributes = {
"max_valid_num": max_valid_num,
"segment_max_size": [segment_max_size],
}
Unpack = onnx.helper.make_node(
"Unpack",
name="Unpack",
inputs=[ops["1"].node.output[0], unpack_info.name],
outputs=["{}_Unpack:0".format(ops["1"].node.output[0])],
domain="ai.graphcore",
**unpack_attributes,
)
ops["2"].node.input[0] = Unpack.output[0]
model.graph.node.insert(ops["2"].index, Unpack)
return model
def _add_pack(self, model):
model = self._modify_mask_before_mul_to_input(model)
model = self._modify_attentionmask(model)
sorted_nodes = topological_sort(model.graph)
model.graph.ClearField("node")
for node in sorted_nodes:
model.graph.node.append(node)
return model
def __call__(self, model: onnx.ModelProto) -> onnx.ModelProto:
model = self._add_pack(model)
model = infer_shapes(clean_info(model))
model = self._add_unpack(model)
model = infer_shapes(clean_info(model))
return model
def run(self, onnx_model: onnx.ModelProto) -> onnx.ModelProto:
onnx_model.CopyFrom(self.traverse_graph(onnx_model.graph, self.__call__))
return onnx_model
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment