task_bert_cls_onnx_tensorrt.md 7.08 KB
Newer Older
wangsen's avatar
wangsen committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
# ONNX+TensorRT
本文以情感二分类为例,使用ONNX+TensorRT来部署

## 1. pytorch权重转onnx
1. 首先需要运行[情感分类任务](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification.py),并保存pytorch的权重

2. 使用了pytorch自带的`torch.onnx.export()`来转换,转换脚本见[ONNX转换bert权重](https://github.com/Tongjilibo/bert4torch/blob/master/examples/serving/task_bert_cls_onnx.py)

## 2. tensorrt环境安装
参考[TensorRT 8.2.1.8 安装笔记(超全超详细)|Docker 快速搭建 TensorRT 环境](https://zhuanlan.zhihu.com/p/446477459)中的半自动安装流程,可直接阅读源文档

1. 官网下载对应版本的镜像(个人根据具体cuda版本选择) 
```shell
docker pull nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
```
2. 运行镜像/创建容器
```shell
docker run -it --name trt_test --gpus all -v /home/tensorrt:/tensorrt nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04 /bin/bash
```
3. [下载TensorRT包](https://developer.nvidia.com/zh-cn/tensorrt),这一步需要注册账号,我下载的是`TensorRT-8.4.1.5.Linux.x86_64-gnu.cuda-11.6.cudnn8.4.tar.gz`
4. 回到容器安装TensorRT(cd到容器内的tensorrt路径下解压刚才下载的tar包)
```shell
tar -zxvf  TensorRT-8.4.1.5.Linux.x86_64-gnu.cuda-11.6.cudnn8.4.tar.gz
```
5. 添加环境变量
```
# 安装vim
apt-get update
apt-get install vim

vim ~/.bashrc
export LD_LIBRARY_PATH=/tensorrt/TensorRT-8.4.1.5/lib:$LD_LIBRARY_PATH
source ~/.bashrc
```
6. 安装 python(安装之后输入python查看安装的版本,下一步要用到)
```shell
apt-get install -y --no-install-recommends \
python3 \
python3-pip \
python3-dev \
python3-wheel &&\
cd /usr/local/bin &&\
ln -s /usr/bin/python3 python &&\
ln -s /usr/bin/pip3 pip;
```
7. pip安装对应的TensorRT库
注意一定要使用pip本地安装tar附带的对应python版本的whl包
```shell
cd TensorRT-8.4.1.5/python/
pip3 install tensorrt-8.2.1.8-cp36-none-linux_x86_64.whl
```
8. 测试TensorRT的python接口
```python
import tensorrt
print(tensorrt.__version__)
```

## 3. onnx转trt权重
- 转换命令
```shell
./trtexec --onnx=/tensorrt/bert_cls.onnx --saveEngine=/tensorrt/bert_cls.trt --minShapes=input_ids:1x512,segment_ids:1x512 --optShapes=input_ids:1x512,segment_ids:1x512 --maxShapes=input_ids:20x512,segment_ids:20x512 --device=0
```
- 注意项:1)测试中如果把batch_size维度和seq_len维度都设置成动态速度会很慢(100ms+),因此这里只保留动态的batchsize维度,seq_len都padding到512;2)[参考资料](https://github.com/NVIDIA/TENSORRT/issues/976)

## 4. tensorrt加载模型推理
- 参考文档:[基于 TensorRT 实现 Bert 预训练模型推理加速(超详细-附核心代码-避坑指南)](https://zhuanlan.zhihu.com/p/446477075)
- 推理代码
```python
import numpy as np
from bert4torch.tokenizers import Tokenizer
import tensorrt as trt
import common
import time
import numpy as np
from tqdm import tqdm

"""
a、获取 engine,建立上下文
"""
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def get_engine(engine_file_path):
    print("Reading engine from file {}".format(engine_file_path))
    with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())
        return engine

engine_model_path = "bert_cls.trt"
# Build a TensorRT engine.
engine = get_engine(engine_model_path)
# Contexts are used to perform inference.
context = engine.create_execution_context()

"""
b、从engine中获取inputs, outputs, bindings, stream 的格式以及分配缓存
"""
def to_numpy(tensor):
    for i, item in enumerate(tensor):
        tensor[i] = item + [0] * (512-len(item))
    return np.array(tensor, np.int32)

dict_path = '/tensorrt/vocab.txt'
tokenizer = Tokenizer(dict_path, do_lower_case=True)
sentences = ['你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门去转转呢。你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门去转转呢。你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门去转转呢。你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门。']
input_ids, segment_ids = tokenizer.encode(sentences)
tokens_id = to_numpy(input_ids)
segment_ids = to_numpy(segment_ids)

context.active_optimization_profile = 0
origin_inputshape = context.get_binding_shape(0)                # (1,-1) 
origin_inputshape[0],origin_inputshape[1] = tokens_id.shape     # (batch_size, max_sequence_length)
context.set_binding_shape(0, (origin_inputshape))               
context.set_binding_shape(1, (origin_inputshape))

"""
c、输入数据填充
"""
inputs, outputs, bindings, stream = common.allocate_buffers_v2(engine, context)
inputs[0].host = tokens_id
inputs[1].host = segment_ids

"""
d、tensorrt推理
"""
trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
preds = np.argmax(trt_outputs, axis=1)
print("====preds====:",preds)

"""
e、测试耗时
"""
steps = 100
start = time.time()
for i in tqdm(range(steps)):
    common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
    preds = np.argmax(trt_outputs, axis=1)
print('onnx+tensorrt: ',  (time.time()-start)*1000/steps, ' ms')
```

- 所需[common.py](https://github.com/NVIDIA/TensorRT/blob/96e23978cd6e4a8fe869696d3d8ec2b47120629b/samples/python/common.py)
- 运行结果
```shell
Reading engine from file bert_cls.trt
onnx_tensorrt.py:44: DeprecationWarning: Use set_optimization_profile_async instead.
  context.active_optimization_profile = 0
====preds====: [1]
100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 79.81it/s]
onnx+tensorrt:  12.542836666107178  ms
```

# 5. 速度比较
- 测试方式: btz=1, seq_len=202(对于tensorrt测试了seq_len=202和512), iterations=100

| 方案 | cpu | gpu |
|----|----|----|
|pytorch|144ms|29ms|
|onnx|66ms|——|
|onnx+tensorrt|——|7ms (len=202), 12ms (len=512)|

# 6. 实验文件
- [文件树](https://pan.baidu.com/s/1vX3yK7BWQScnK_5Zb-pAkQ?pwd=rhq9)
```shell
tensorrt
├─common.py
├─onnx_tensorrt.py
├─bert_cls.onnx
├─bert_cls.trt
├─TensorRT-8.4.1.5
```
- docker镜像: 1)可按上述方式自行构建,2)直接pull笔者上传的镜像
```shell
docker pull tongjilibo/tensorrt:11.3.0-cudnn8-devel-ubuntu20.04-tensorrt8.4.1.5

docker run -it --name trt_torch --gpus all -v /home/libo/tensorrt:/tensorrt tongjilibo/tensorrt:11.3.0-cudnn8-devel-ubuntu20.04-tensorrt8.4.1.5 /bin/bash
```