task_bert_cls_onnx_tensorrt.md

# ONNX+TensorRT
本文以情感二分类为例，使用ONNX+TensorRT来部署

## 1. pytorch权重转onnx
1. 首先需要运行[情感分类任务](https://github.com/Tongjilibo/bert4torch/blob/master/examples/sentence_classfication/task_sentiment_classification.py)，并保存pytorch的权重

2. 使用了pytorch自带的`torch.onnx.export()`来转换，转换脚本见[ONNX转换bert权重](https://github.com/Tongjilibo/bert4torch/blob/master/examples/serving/task_bert_cls_onnx.py)

## 2. tensorrt环境安装
参考[TensorRT 8.2.1.8 安装笔记(超全超详细)|Docker 快速搭建 TensorRT 环境](https://zhuanlan.zhihu.com/p/446477459)中的半自动安装流程，可直接阅读源文档

1. 官网下载对应版本的镜像(个人根据具体cuda版本选择) 
```shell
docker pull nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
```
2. 运行镜像/创建容器
```shell
docker run -it --name trt_test --gpus all -v /home/tensorrt:/tensorrt nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04 /bin/bash
```
3. [下载TensorRT包](https://developer.nvidia.com/zh-cn/tensorrt)，这一步需要注册账号，我下载的是`TensorRT-8.4.1.5.Linux.x86_64-gnu.cuda-11.6.cudnn8.4.tar.gz`
4. 回到容器安装TensorRT(cd到容器内的tensorrt路径下解压刚才下载的tar包)
```shell
tar -zxvf  TensorRT-8.4.1.5.Linux.x86_64-gnu.cuda-11.6.cudnn8.4.tar.gz
```
5. 添加环境变量
```
# 安装vim
apt-get update
apt-get install vim

vim ~/.bashrc
export LD_LIBRARY_PATH=/tensorrt/TensorRT-8.4.1.5/lib:$LD_LIBRARY_PATH
source ~/.bashrc
```
6. 安装 python(安装之后输入python查看安装的版本，下一步要用到)
```shell
apt-get install -y --no-install-recommends \
python3 \
python3-pip \
python3-dev \
python3-wheel &&\
cd /usr/local/bin &&\
ln -s /usr/bin/python3 python &&\
ln -s /usr/bin/pip3 pip;
```
7. pip安装对应的TensorRT库
注意一定要使用pip本地安装tar附带的对应python版本的whl包
```shell
cd TensorRT-8.4.1.5/python/
pip3 install tensorrt-8.2.1.8-cp36-none-linux_x86_64.whl
```
8. 测试TensorRT的python接口
```python
import tensorrt
print(tensorrt.__version__)
```

## 3. onnx转trt权重
- 转换命令
```shell
./trtexec --onnx=/tensorrt/bert_cls.onnx --saveEngine=/tensorrt/bert_cls.trt --minShapes=input_ids:1x512,segment_ids:1x512 --optShapes=input_ids:1x512,segment_ids:1x512 --maxShapes=input_ids:20x512,segment_ids:20x512 --device=0
```
- 注意项：1）测试中如果把batch_size维度和seq_len维度都设置成动态速度会很慢(100ms+)，因此这里只保留动态的batchsize维度，seq_len都padding到512；2）[参考资料](https://github.com/NVIDIA/TENSORRT/issues/976)

## 4. tensorrt加载模型推理
- 参考文档：[基于 TensorRT 实现 Bert 预训练模型推理加速(超详细-附核心代码-避坑指南)](https://zhuanlan.zhihu.com/p/446477075)
- 推理代码
```python
import numpy as np
from bert4torch.tokenizers import Tokenizer
import tensorrt as trt
import common
import time
import numpy as np
from tqdm import tqdm

"""
a、获取 engine，建立上下文
"""
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def get_engine(engine_file_path):
    print("Reading engine from file {}".format(engine_file_path))
    with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())
        return engine

engine_model_path = "bert_cls.trt"
# Build a TensorRT engine.
engine = get_engine(engine_model_path)
# Contexts are used to perform inference.
context = engine.create_execution_context()

"""
b、从engine中获取inputs, outputs, bindings, stream 的格式以及分配缓存
"""
def to_numpy(tensor):
    for i, item in enumerate(tensor):
        tensor[i] = item + [0] * (512-len(item))
    return np.array(tensor, np.int32)

dict_path = '/tensorrt/vocab.txt'
tokenizer = Tokenizer(dict_path, do_lower_case=True)
sentences = ['你在干嘛呢？这几天外面的天气真不错啊，万里无云，阳光明媚的，我的心情也特别的好，我特别想出门去转转呢。你在干嘛呢？这几天外面的天气真不错啊，万里无云，阳光明媚的，我的心情也特别的好，我特别想出门去转转呢。你在干嘛呢？这几天外面的天气真不错啊，万里无云，阳光明媚的，我的心情也特别的好，我特别想出门去转转呢。你在干嘛呢？这几天外面的天气真不错啊，万里无云，阳光明媚的，我的心情也特别的好，我特别想出门。']
input_ids, segment_ids = tokenizer.encode(sentences)
tokens_id = to_numpy(input_ids)
segment_ids = to_numpy(segment_ids)

context.active_optimization_profile = 0
origin_inputshape = context.get_binding_shape(0)                # (1,-1) 
origin_inputshape[0],origin_inputshape[1] = tokens_id.shape     # (batch_size, max_sequence_length)
context.set_binding_shape(0, (origin_inputshape))               
context.set_binding_shape(1, (origin_inputshape))

"""
c、输入数据填充
"""
inputs, outputs, bindings, stream = common.allocate_buffers_v2(engine, context)
inputs[0].host = tokens_id
inputs[1].host = segment_ids

"""
d、tensorrt推理
"""
trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
preds = np.argmax(trt_outputs, axis=1)
print("====preds====:",preds)

"""
e、测试耗时
"""
steps = 100
start = time.time()
for i in tqdm(range(steps)):
    common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
    preds = np.argmax(trt_outputs, axis=1)
print('onnx+tensorrt: ',  (time.time()-start)*1000/steps, ' ms')
```

- 所需[common.py](https://github.com/NVIDIA/TensorRT/blob/96e23978cd6e4a8fe869696d3d8ec2b47120629b/samples/python/common.py)
- 运行结果
```shell
Reading engine from file bert_cls.trt
onnx_tensorrt.py:44: DeprecationWarning: Use set_optimization_profile_async instead.
  context.active_optimization_profile = 0
====preds====: [1]
100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 79.81it/s]
onnx+tensorrt:  12.542836666107178  ms
```

# 5. 速度比较
- 测试方式: btz=1, seq_len=202（对于tensorrt测试了seq_len=202和512）, iterations=100

| 方案 | cpu | gpu |
|----|----|----|
|pytorch|144ms|29ms|
|onnx|66ms|——|
|onnx+tensorrt|——|7ms (len=202), 12ms (len=512)|

# 6. 实验文件
- [文件树](https://pan.baidu.com/s/1vX3yK7BWQScnK_5Zb-pAkQ?pwd=rhq9)
```shell
tensorrt
├─common.py
├─onnx_tensorrt.py
├─bert_cls.onnx
├─bert_cls.trt
├─TensorRT-8.4.1.5
```
- docker镜像: 1)可按上述方式自行构建，2)直接pull笔者上传的镜像
```shell
docker pull tongjilibo/tensorrt:11.3.0-cudnn8-devel-ubuntu20.04-tensorrt8.4.1.5

docker run -it --name trt_torch --gpus all -v /home/libo/tensorrt:/tensorrt tongjilibo/tensorrt:11.3.0-cudnn8-devel-ubuntu20.04-tensorrt8.4.1.5 /bin/bash
```