add part code

4353fa59 · limm · 4353fa59 · 4353fa59 · 4353fa59 · 4353fa59
Commit 4353fa59 authored Jun 25, 2025 by limm
20 changed files
--- a/docs/zh_cn/05-supported-backends/torchscript.md
+++ b/docs/zh_cn/05-supported-backends/torchscript.md
+# TorchScript 支持情况
+## Introduction of TorchScript
+**TorchScript** a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency. Check the [Introduction to TorchScript](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html) for more details.
+## Build custom ops
+### Prerequisite
+- Download libtorch from the official website [here](https://pytorch.org/get-started/locally/).
+*Please note that only **Pre-cxx11 ABI** and **version 1.8.1+** on Linux platform are supported by now.*
+For previous versions of libtorch, users can find through the [issue comment](https://github.com/pytorch/pytorch/issues/40961#issuecomment-1017317786). Libtorch1.8.1+cu111 as an example, extract it, expose `Torch_DIR` and add the lib path to `LD_LIBRARY_PATH` as below:
+```bash
+wget https://download.pytorch.org/libtorch/cu111/libtorch-shared-with-deps-1.8.1%2Bcu111.zip
+unzip libtorch-shared-with-deps-1.8.1+cu111.zip
+cd libtorch
+export Torch_DIR=$(pwd)
+export LD_LIBRARY_PATH=$Torch_DIR/lib:$LD_LIBRARY_PATH
+```
+Note:
+- If you want to save libtorch env variables to bashrc, you could run
+  ```bash
+  echo '# set env for libtorch' >> ~/.bashrc
+  echo "export Torch_DIR=${Torch_DIR}" >> ~/.bashrc
+  echo 'export LD_LIBRARY_PATH=$Torch_DIR/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
+  source ~/.bashrc
+  ```
+### Build on Linux
+```bash
+cd ${MMDEPLOY_DIR} # To MMDeploy root directory
+mkdir -p build && cd build
+cmake -DMMDEPLOY_TARGET_BACKENDS=torchscript -DTorch_DIR=${Torch_DIR} ..
+make -j$(nproc) && make install
+```
+## How to convert a model
+- You could follow the instructions of tutorial [How to convert model](../02-how-to-run/convert_model.md)
+## SDK backend
+TorchScript SDK backend may be built by passing `-DMMDEPLOY_TORCHSCRIPT_SDK_BACKEND=ON` to `cmake`.
+Notice that `libtorch` is sensitive to C++ ABI versions. On platforms defaulted to C++11 ABI (e.g. Ubuntu 16+) one may
+pass `-DCMAKE_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=0"` to `cmake` to use pre-C++11 ABI for building. In this case all
+dependencies with ABI sensitive interfaces (e.g. OpenCV) must be built with pre-C++11 ABI.
+## FAQs
+- Error: `projects/thirdparty/libtorch/share/cmake/Caffe2/Caffe2Config.cmake:96 (message):Your installed Caffe2 version uses cuDNN but I cannot find the cuDNN libraries.  Please set the proper cuDNN prefixes and / or install cuDNN.`
+  May export CUDNN_ROOT=/root/path/to/cudnn to resolve the build error.
--- a/docs/zh_cn/05-supported-backends/tvm.md
+++ b/docs/zh_cn/05-supported-backends/tvm.md
+# TVM 特性支持
+MMDeploy 已经将 TVM 集成到模型转换工具以及 SDK 当中。可用的特性包括：
+- AutoTVM 调优器
+- Ansor 调优器
+- Graph Executor 运行时
+- Virtual Machine 运行时
--- a/docs/zh_cn/05-supported-backends/vacc.md
+++ b/docs/zh_cn/05-supported-backends/vacc.md
+# VACC Backend
+- cmake 3.10.0+
+- gcc/g++ 7.5.0
+- llvm 9.0.1
+- ubuntu 18.04
+## PCIE
+### 1.package
+- dkms (>=1.95)
+- linux-headers
+- dpkg (Ubuntu)
+- rpm  (CentOS)
+- python2
+- python3
+查看是否有瀚博推理卡：`lspci -d:0100`
+1. 环境准备
+   ```bash
+   sudo apt-get install dkms dpkg python2 python3
+   ```
+2. driver安装
+   ```bash
+   sudo dpkg -i vastai-pci_xx.xx.xx.xx_xx.deb
+   ```
+3. 查看安装
+   ```bash
+   # 1.查看deb包是否安装成功
+   dpkg --status vastai-pci-xxx
+   #output
+   Package: vastai-pci-dkms
+   Status: install ok installed
+   ……
+   Version: xx.xx.xx.xx
+   Provides: vastai-pci-modules (= xx.xx.xx.xx)
+   Depends: dkms (>= 1.95)
+   Description: vastai-pci driver in DKMS format.
+   # 2.查看驱动是否已加载到内核
+   lsmod | grep vastai_pci
+   #output
+   vastai_pci        xxx  x
+   ```
+4. 升级驱动
+   ```bash
+   sudo dpkg -i vastai-pci_dkms_xx.xx.xx.xx_xx.deb
+   ```
+5. 卸载驱动
+   ```bash
+   sudo dpkg -r vastai-pci_dkms_xx.xx.xx.xx_xx
+   ```
+### 2.reboot pcie
+```bash
+sudo chmod 666 /dev/kchar:0 && sudo echo reboot > /dev/kchar:0
+```
+## SDK
+### step.1
+```bash
+pip install torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0
+pip install onnx==1.10.0 tqdm==4.64.1
+pip install h5py==3.8.0
+pip install decorator==5.1.1 scipy==1.7.3
+```
+### step.2
+```bash
+sudo vi ~/.bashrc
+export VASTSTREAM_PIPELINE=true
+export VACC_IRTEXT_ENABLE=1
+export TVM_HOME="/opt/vastai/vaststream/tvm"
+export VASTSTREAM_HOME="/opt/vastai/vaststream/vacl"
+export LD_LIBRARY_PATH=$TVM_HOME/lib:$VASTSTREAM_HOME/lib
+export PYTHONPATH=$TVM_HOME/python:$TVM_HOME/vacc/python:$TVM_HOME/topi/python:${PYTHONPATH}:$VASTSTREAM_HOME/python
+source ~/.bashrc
+```
--- a/docs/zh_cn/06-custom-ops/ncnn.md
+++ b/docs/zh_cn/06-custom-ops/ncnn.md
+## ncnn 自定义算子
+<!-- TOC -->
+- [ncnn Ops](#ncnn-ops)
+  - [Expand](#expand)
+    - [Description](#description)
+    - [Parameters](#parameters)
+    - [Inputs](#inputs)
+    - [Outputs](#outputs)
+    - [Type Constraints](#type-constraints)
+  - [Gather](#gather)
+    - [Description](#description)
+    - [Parameters](#parameters)
+    - [Inputs](#inputs)
+    - [Outputs](#outputs)
+    - [Type Constraints](#type-constraints)
+  - [Shape](#shape)
+    - [Description](#description)
+    - [Parameters](#parameters)
+    - [Inputs](#inputs)
+    - [Outputs](#outputs)
+    - [Type Constraints](#type-constraints)
+  - [TopK](#topk)
+    - [Description](#description)
+    - [Parameters](#parameters)
+    - [Inputs](#inputs)
+    - [Outputs](#outputs)
+    - [Type Constraints](#type-constraints)
+<!-- TOC -->
+### Expand
+#### Description
+Broadcast the input blob following the given shape and the broadcast rule of ncnn.
+#### Parameters
+Expand has no parameters.
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: ncnn.Mat</dt>
+<dd>bottom_blobs[0]; An ncnn.Mat of input data.</dd>
+<dt><tt>inputs[1]</tt>: ncnn.Mat</dt>
+<dd>bottom_blobs[1]; An 1-dim ncnn.Mat. A valid shape of ncnn.Mat.</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>top_blob; The blob of ncnn.Mat which expanded by given shape and broadcast rule of ncnn.</dd>
+</dl>
+#### Type Constraints
+- ncnn.Mat: Mat(float32)
+### Gather
+#### Description
+Given the data and indice blob, gather entries of the axis dimension of data indexed by indices.
+#### Parameters
+| Type  | Parameter | Description                            |
+| ----- | --------- | -------------------------------------- |
+| `int` | `axis`    | Which axis to gather on. Default is 0. |
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: ncnn.Mat</dt>
+<dd>bottom_blobs[0]; An ncnn.Mat of input data.</dd>
+<dt><tt>inputs[1]</tt>: ncnn.Mat</dt>
+<dd>bottom_blobs[1]; An 1-dim ncnn.Mat of indices on given axis.</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>top_blob; The blob of ncnn.Mat which gathered by given data and indice blob.</dd>
+</dl>
+#### Type Constraints
+- ncnn.Mat: Mat(float32)
+### Shape
+#### Description
+Get the shape of the ncnn blobs.
+#### Parameters
+Shape has no parameters.
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: ncnn.Mat</dt>
+<dd>bottom_blob; An ncnn.Mat of input data.</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>top_blob; 1-D ncnn.Mat of shape (bottom_blob.dims,), `bottom_blob.dims` is the input blob dimensions.</dd>
+</dl>
+#### Type Constraints
+- ncnn.Mat: Mat(float32)
+### TopK
+#### Description
+Get the indices and value(optional) of largest or smallest k data among the axis. This op will map to onnx op `TopK`, `ArgMax`, and `ArgMin`.
+#### Parameters
+| Type  | Parameter   | Description                                                                                                                                                                |
+| ----- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `int` | `axis`      | The axis of data which topk calculate on. Default is -1, indicates the last dimension.                                                                                     |
+| `int` | `largest`   | The binary value which indicates the TopK operator selects the largest or smallest K values. Default is 1, the TopK selects the largest K values.                          |
+| `int` | `sorted`    | The binary value of whether returning sorted topk value or not. If not, the topk returns topk values in any order. Default is 1, this operator returns sorted topk values. |
+| `int` | `keep_dims` | The binary value of whether keep the reduced dimension or not. Default is 1, each output blob has the same dimension as input blob.                                        |
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: ncnn.Mat</dt>
+<dd>bottom_blob[0]; An ncnn.Mat of input data.</dd>
+<dt><tt>inputs[1] (optional)</tt>: ncnn.Mat</dt>
+<dd>bottom_blob[1]; An optional ncnn.Mat. A blob of K in TopK. If this blob not exist, K is 1.</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>top_blob[0]; If outputs has only 1 blob, outputs[0] is the indice blob of topk, if outputs has 2 blobs, outputs[0] is the value blob of topk. This blob is ncnn.Mat format with the shape of bottom_blob[0] or reduced shape of bottom_blob[0].</dd>
+<dt><tt>outputs[1]</tt>: T</dt>
+<dd>top_blob[1] (optional); If outputs has 2 blobs, outputs[1] is the value blob of topk. This blob is ncnn.Mat format with the shape of bottom_blob[0] or reduced shape of bottom_blob[0].</dd>
+</dl>
+#### Type Constraints
+- ncnn.Mat: Mat(float32)
--- a/docs/zh_cn/06-custom-ops/onnxruntime.md
+++ b/docs/zh_cn/06-custom-ops/onnxruntime.md
+## onnxruntime 自定义算子
+<!-- TOC -->
+- [ONNX Runtime Ops](#onnx-runtime-ops)
+  - [grid_sampler](#grid_sampler)
+    - [Description](#description)
+    - [Parameters](#parameters)
+    - [Inputs](#inputs)
+    - [Outputs](#outputs)
+    - [Type Constraints](#type-constraints)
+  - [MMCVModulatedDeformConv2d](#mmcvmodulateddeformconv2d)
+    - [Description](#description-1)
+    - [Parameters](#parameters-1)
+    - [Inputs](#inputs-1)
+    - [Outputs](#outputs-1)
+    - [Type Constraints](#type-constraints-1)
+- [NMSRotated](#nmsrotated)
+  - [Description](#description-2)
+  - [Parameters](#parameters-2)
+  - [Inputs](#inputs-2)
+  - [Outputs](#outputs-2)
+  - [Type Constraints](#type-constraints-2)
+  - [RoIAlignRotated](#roialignrotated)
+    - [Description](#description-3)
+    - [Parameters](#parameters-3)
+    - [Inputs](#inputs-3)
+    - [Outputs](#outputs-3)
+    - [Type Constraints](#type-constraints-3)
+- [NMSMatch](#nmsmatch)
+  - [Description](#description-2)
+  - [Parameters](#parameters-2)
+  - [Inputs](#inputs-2)
+  - [Outputs](#outputs-2)
+  - [Type Constraints](#type-constraints-2)
+<!-- TOC -->
+### grid_sampler
+#### Description
+Perform sample from `input` with pixel locations from `grid`.
+#### Parameters
+| Type  | Parameter            | Description                                                                                                                                                                                                                                                                                     |
+| ----- | -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `int` | `interpolation_mode` | Interpolation mode to calculate output values. (0: `bilinear` , 1: `nearest`)                                                                                                                                                                                                                   |
+| `int` | `padding_mode`       | Padding mode for outside grid values. (0: `zeros`, 1: `border`, 2: `reflection`)                                                                                                                                                                                                                |
+| `int` | `align_corners`      | If `align_corners=1`, the extrema (`-1` and `1`) are considered as referring to the center points of the input's corner pixels. If `align_corners=0`, they are instead considered as referring to the corner points of the input's corner pixels, making the sampling more resolution agnostic. |
+#### Inputs
+<dl>
+<dt><tt>input</tt>: T</dt>
+<dd>Input feature; 4-D tensor of shape (N, C, inH, inW), where N is the batch size, C is the numbers of channels, inH and inW are the height and width of the data.</dd>
+<dt><tt>grid</tt>: T</dt>
+<dd>Input offset; 4-D tensor of shape (N, outH, outW, 2), where outH and outW are the height and width of offset and output. </dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>output</tt>: T</dt>
+<dd>Output feature; 4-D tensor of shape (N, C, outH, outW).</dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+### MMCVModulatedDeformConv2d
+#### Description
+Perform Modulated Deformable Convolution on input feature, read [Deformable ConvNets v2: More Deformable, Better Results](https://arxiv.org/abs/1811.11168?from=timeline) for detail.
+#### Parameters
+| Type           | Parameter           | Description                                                                           |
+| -------------- | ------------------- | ------------------------------------------------------------------------------------- |
+| `list of ints` | `stride`            | The stride of the convolving kernel. (sH, sW)                                         |
+| `list of ints` | `padding`           | Paddings on both sides of the input. (padH, padW)                                     |
+| `list of ints` | `dilation`          | The spacing between kernel elements. (dH, dW)                                         |
+| `int`          | `deformable_groups` | Groups of deformable offset.                                                          |
+| `int`          | `groups`            | Split input into groups. `input_channel` should be divisible by the number of groups. |
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Input feature; 4-D tensor of shape (N, C, inH, inW), where N is the batch size, C is the number of channels, inH and inW are the height and width of the data.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>Input offset; 4-D tensor of shape (N, deformable_group* 2* kH* kW, outH, outW), where kH and kW are the height and width of weight, outH and outW are the height and width of offset and output.</dd>
+<dt><tt>inputs[2]</tt>: T</dt>
+<dd>Input mask; 4-D tensor of shape (N, deformable_group* kH* kW, outH, outW), where kH and kW are the height and width of weight, outH and outW are the height and width of offset and output.</dd>
+<dt><tt>inputs[3]</tt>: T</dt>
+<dd>Input weight; 4-D tensor of shape (output_channel, input_channel, kH, kW).</dd>
+<dt><tt>inputs[4]</tt>: T, optional</dt>
+<dd>Input bias; 1-D tensor of shape (output_channel).</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Output feature; 4-D tensor of shape (N, output_channel, outH, outW).</dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+### NMSRotated
+#### Description
+Non Max Suppression for rotated bboxes.
+#### Parameters
+| Type    | Parameter       | Description                |
+| ------- | --------------- | -------------------------- |
+| `float` | `iou_threshold` | The IoU threshold for NMS. |
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Input feature; 2-D tensor of shape (N, 5), where N is the number of rotated bboxes, .</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>Input offset; 1-D tensor of shape (N, ), where N is the number of rotated bboxes.</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Output feature; 1-D tensor of shape (K, ), where K is the number of keep bboxes.</dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+### RoIAlignRotated
+#### Description
+Perform RoIAlignRotated on output feature, used in bbox_head of most two-stage rotated object detectors.
+#### Parameters
+| Type    | Parameter        | Description                                                                                                                               |
+| ------- | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
+| `int`   | `output_height`  | height of output roi                                                                                                                      |
+| `int`   | `output_width`   | width of output roi                                                                                                                       |
+| `float` | `spatial_scale`  | used to scale the input boxes                                                                                                             |
+| `int`   | `sampling_ratio` | number of input samples to take for each output sample. `0` means to take samples densely for current models.                             |
+| `int`   | `aligned`        | If `aligned=0`, use the legacy implementation in MMDetection. Else, align the results more perfectly.                                     |
+| `int`   | `clockwise`      | If True, the angle in each proposal follows a clockwise fashion in image space, otherwise, the angle is counterclockwise. Default: False. |
+#### Inputs
+<dl>
+<dt><tt>input</tt>: T</dt>
+<dd>Input feature map; 4D tensor of shape (N, C, H, W), where N is the batch size, C is the numbers of channels, H and W are the height and width of the data.</dd>
+<dt><tt>rois</tt>: T</dt>
+<dd>RoIs (Regions of Interest) to pool over; 2-D tensor of shape (num_rois, 6) given as [[batch_index, cx, cy, w, h, theta], ...]. The RoIs' coordinates are the coordinate system of input.</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>feat</tt>: T</dt>
+<dd>RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, output_width). The r-th batch element feat[r-1] is a pooled feature map corresponding to the r-th RoI RoIs[r-1].<dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32)
+### NMSMatch
+#### Description
+Non Max Suppression with the suppression box match.
+#### Parameters
+| Type    | Parameter   | Description                       |
+| ------- | ----------- | --------------------------------- |
+| `float` | `iou_thr`   | The IoU threshold for NMSMatch.   |
+| `float` | `score_thr` | The score threshold for NMSMatch. |
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Input boxes; 3-D tensor of shape (b, N, 4), where b is the batch size, N is the number of boxes and 4 means the coordinate.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>Input scores; 3-D tensor of shape (b, c, N), where b is the batch size, c is the class size and N is the number of boxes.</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Output feature; 2-D tensor of shape (K, 4), K is the number of matched boxes, 4 is batch id, class id, select boxes, suppressed boxes.</dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32)
--- a/docs/zh_cn/06-custom-ops/tensorrt.md
+++ b/docs/zh_cn/06-custom-ops/tensorrt.md
+## TRT 自定义算子
+<!-- TOC -->
+- [TRT 自定义算子](#trt-自定义算子)
+  - [TRTBatchedNMS](#trtbatchednms)
+    - [Description](#description)
+    - [Parameters](#parameters)
+    - [Inputs](#inputs)
+    - [Outputs](#outputs)
+    - [Type Constraints](#type-constraints)
+  - [grid_sampler](#grid_sampler)
+    - [Description](#description-1)
+    - [Parameters](#parameters-1)
+    - [Inputs](#inputs-1)
+    - [Outputs](#outputs-1)
+    - [Type Constraints](#type-constraints-1)
+  - [MMCVInstanceNormalization](#mmcvinstancenormalization)
+    - [Description](#description-2)
+    - [Parameters](#parameters-2)
+    - [Inputs](#inputs-2)
+    - [Outputs](#outputs-2)
+    - [Type Constraints](#type-constraints-2)
+  - [MMCVModulatedDeformConv2d](#mmcvmodulateddeformconv2d)
+    - [Description](#description-3)
+    - [Parameters](#parameters-3)
+    - [Inputs](#inputs-3)
+    - [Outputs](#outputs-3)
+    - [Type Constraints](#type-constraints-3)
+  - [MMCVMultiLevelRoiAlign](#mmcvmultilevelroialign)
+    - [Description](#description-4)
+    - [Parameters](#parameters-4)
+    - [Inputs](#inputs-4)
+    - [Outputs](#outputs-4)
+    - [Type Constraints](#type-constraints-4)
+  - [MMCVRoIAlign](#mmcvroialign)
+    - [Description](#description-5)
+    - [Parameters](#parameters-5)
+    - [Inputs](#inputs-5)
+    - [Outputs](#outputs-5)
+    - [Type Constraints](#type-constraints-5)
+  - [ScatterND](#scatternd)
+    - [Description](#description-6)
+    - [Parameters](#parameters-6)
+    - [Inputs](#inputs-6)
+    - [Outputs](#outputs-6)
+    - [Type Constraints](#type-constraints-6)
+  - [TRTBatchedRotatedNMS](#trtbatchedrotatednms)
+    - [Description](#description-7)
+    - [Parameters](#parameters-7)
+    - [Inputs](#inputs-7)
+    - [Outputs](#outputs-7)
+    - [Type Constraints](#type-constraints-7)
+  - [GridPriorsTRT](#gridpriorstrt)
+    - [Description](#description-8)
+    - [Parameters](#parameters-8)
+    - [Inputs](#inputs-8)
+    - [Outputs](#outputs-8)
+    - [Type Constraints](#type-constraints-8)
+  - [ScaledDotProductAttentionTRT](#scaleddotproductattentiontrt)
+    - [Description](#description-9)
+    - [Parameters](#parameters-9)
+    - [Inputs](#inputs-9)
+    - [Outputs](#outputs-9)
+    - [Type Constraints](#type-constraints-9)
+  - [GatherTopk](#gathertopk)
+    - [Description](#description-10)
+    - [Parameters](#parameters-10)
+    - [Inputs](#inputs-10)
+    - [Outputs](#outputs-10)
+    - [Type Constraints](#type-constraints-10)
+<!-- TOC -->
+### TRTBatchedNMS
+#### Description
+Batched NMS with a fixed number of output bounding boxes.
+#### Parameters
+| Type    | Parameter             | Description                                                                                                                             |
+| ------- | --------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
+| `int`   | `background_label_id` | The label ID for the background class. If there is no background class, set it to `-1`.                                                 |
+| `int`   | `num_classes`         | The number of classes.                                                                                                                  |
+| `int`   | `topK`                | The number of bounding boxes to be fed into the NMS step.                                                                               |
+| `int`   | `keepTopK`            | The number of total bounding boxes to be kept per-image after the NMS step. Should be less than or equal to the `topK` value.           |
+| `float` | `scoreThreshold`      | The scalar threshold for score (low scoring boxes are removed).                                                                         |
+| `float` | `iouThreshold`        | The scalar threshold for IoU (new boxes that have high IoU overlap with previously selected boxes are removed).                         |
+| `int`   | `isNormalized`        | Set to `false` if the box coordinates are not normalized, meaning they are not in the range `[0,1]`. Defaults to `true`.                |
+| `int`   | `clipBoxes`           | Forcibly restrict bounding boxes to the normalized range `[0,1]`. Only applicable if `isNormalized` is also `true`. Defaults to `true`. |
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>boxes; 4-D tensor of shape (N, num_boxes, num_classes, 4), where N is the batch size; `num_boxes` is the number of boxes; `num_classes` is the number of classes, which could be 1 if the boxes are shared between all classes.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>scores; 4-D tensor of shape (N, num_boxes, 1, num_classes). </dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>dets; 3-D tensor of shape (N, valid_num_boxes, 5), `valid_num_boxes` is the number of boxes after NMS. For each row `dets[i,j,:] = [x0, y0, x1, y1, score]`</dd>
+<dt><tt>outputs[1]</tt>: tensor(int32, Linear)</dt>
+<dd>labels; 2-D tensor of shape (N, valid_num_boxes). </dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+### grid_sampler
+#### Description
+Perform sample from `input` with pixel locations from `grid`.
+#### Parameters
+| Type  | Parameter            | Description                                                                                                                                                                                                                                                                                     |
+| ----- | -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `int` | `interpolation_mode` | Interpolation mode to calculate output values. (0: `bilinear` , 1: `nearest`)                                                                                                                                                                                                                   |
+| `int` | `padding_mode`       | Padding mode for outside grid values. (0: `zeros`, 1: `border`, 2: `reflection`)                                                                                                                                                                                                                |
+| `int` | `align_corners`      | If `align_corners=1`, the extrema (`-1` and `1`) are considered as referring to the center points of the input's corner pixels. If `align_corners=0`, they are instead considered as referring to the corner points of the input's corner pixels, making the sampling more resolution agnostic. |
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Input feature; 4-D tensor of shape (N, C, inH, inW), where N is the batch size, C is the numbers of channels, inH and inW are the height and width of the data.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>Input offset; 4-D tensor of shape (N, outH, outW, 2), where outH and outW are the height and width of offset and output. </dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Output feature; 4-D tensor of shape (N, C, outH, outW).</dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+### MMCVInstanceNormalization
+#### Description
+Carry out instance normalization as described in the paper https://arxiv.org/abs/1607.08022.
+y = scale * (x - mean) / sqrt(variance + epsilon) + B, where mean and variance are computed per instance per channel.
+#### Parameters
+| Type    | Parameter | Description                                                          |
+| ------- | --------- | -------------------------------------------------------------------- |
+| `float` | `epsilon` | The epsilon value to use to avoid division by zero. Default is 1e-05 |
+#### Inputs
+<dl>
+<dt><tt>input</tt>: T</dt>
+<dd>Input data tensor from the previous operator; dimensions for image case are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size.</dd>
+<dt><tt>scale</tt>: T</dt>
+<dd>The input 1-dimensional scale tensor of size C.</dd>
+<dt><tt>B</tt>: T</dt>
+<dd>The input 1-dimensional bias tensor of size C.</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>output</tt>: T</dt>
+<dd>The output tensor of the same shape as input.</dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+### MMCVModulatedDeformConv2d
+#### Description
+Perform Modulated Deformable Convolution on input feature. Read [Deformable ConvNets v2: More Deformable, Better Results](https://arxiv.org/abs/1811.11168?from=timeline) for detail.
+#### Parameters
+| Type           | Parameter          | Description                                                                           |
+| -------------- | ------------------ | ------------------------------------------------------------------------------------- |
+| `list of ints` | `stride`           | The stride of the convolving kernel. (sH, sW)                                         |
+| `list of ints` | `padding`          | Paddings on both sides of the input. (padH, padW)                                     |
+| `list of ints` | `dilation`         | The spacing between kernel elements. (dH, dW)                                         |
+| `int`          | `deformable_group` | Groups of deformable offset.                                                          |
+| `int`          | `group`            | Split input into groups. `input_channel` should be divisible by the number of groups. |
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Input feature; 4-D tensor of shape (N, C, inH, inW), where N is the batch size, C is the number of channels, inH and inW are the height and width of the data.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>Input offset; 4-D tensor of shape (N, deformable_group* 2* kH* kW, outH, outW), where kH and kW are the height and width of weight, outH and outW are the height and width of offset and output.</dd>
+<dt><tt>inputs[2]</tt>: T</dt>
+<dd>Input mask; 4-D tensor of shape (N, deformable_group* kH* kW, outH, outW), where kH and kW are the height and width of weight, outH and outW are the height and width of offset and output.</dd>
+<dt><tt>inputs[3]</tt>: T</dt>
+<dd>Input weight; 4-D tensor of shape (output_channel, input_channel, kH, kW).</dd>
+<dt><tt>inputs[4]</tt>: T, optional</dt>
+<dd>Input weight; 1-D tensor of shape (output_channel).</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Output feature; 4-D tensor of shape (N, output_channel, outH, outW).</dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+### MMCVMultiLevelRoiAlign
+#### Description
+Perform RoIAlign on features from multiple levels. Used in bbox_head of most two-stage detectors.
+#### Parameters
+| Type             | Parameter          | Description                                                                                                   |
+| ---------------- | ------------------ | ------------------------------------------------------------------------------------------------------------- |
+| `int`            | `output_height`    | height of output roi.                                                                                         |
+| `int`            | `output_width`     | width of output roi.                                                                                          |
+| `list of floats` | `featmap_strides`  | feature map stride of each level.                                                                             |
+| `int`            | `sampling_ratio`   | number of input samples to take for each output sample. `0` means to take samples densely for current models. |
+| `float`          | `roi_scale_factor` | RoIs will be scaled by this factor before RoI Align.                                                          |
+| `int`            | `finest_scale`     | Scale threshold of mapping to level 0. Default: 56.                                                           |
+| `int`            | `aligned`          | If `aligned=0`, use the legacy implementation in MMDetection. Else, align the results more perfectly.         |
+#### Inputs
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>RoIs (Regions of Interest) to pool over; 2-D tensor of shape (num_rois, 5) given as [[batch_index, x1, y1, x2, y2], ...].</dd>
+<dt><tt>inputs[1~]</tt>: T</dt>
+<dd>Input feature map; 4D tensor of shape (N, C, H, W), where N is the batch size, C is the numbers of channels, H and W are the height and width of the data.</dd>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, output_width). The r-th batch element output[0][r-1] is a pooled feature map corresponding to the r-th RoI inputs[1][r-1].<dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+### MMCVRoIAlign
+#### Description
+Perform RoIAlign on output feature, used in bbox_head of most two-stage detectors.
+#### Parameters
+| Type    | Parameter        | Description                                                                                                   |
+| ------- | ---------------- | ------------------------------------------------------------------------------------------------------------- |
+| `int`   | `output_height`  | height of output roi                                                                                          |
+| `int`   | `output_width`   | width of output roi                                                                                           |
+| `float` | `spatial_scale`  | used to scale the input boxes                                                                                 |
+| `int`   | `sampling_ratio` | number of input samples to take for each output sample. `0` means to take samples densely for current models. |
+| `str`   | `mode`           | pooling mode in each bin. `avg` or `max`                                                                      |
+| `int`   | `aligned`        | If `aligned=0`, use the legacy implementation in MMDetection. Else, align the results more perfectly.         |
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Input feature map; 4D tensor of shape (N, C, H, W), where N is the batch size, C is the numbers of channels, H and W are the height and width of the data.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>RoIs (Regions of Interest) to pool over; 2-D tensor of shape (num_rois, 5) given as [[batch_index, x1, y1, x2, y2], ...]. The RoIs' coordinates are the coordinate system of inputs[0].</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, output_width). The r-th batch element output[0][r-1] is a pooled feature map corresponding to the r-th RoI inputs[1][r-1].<dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+### ScatterND
+#### Description
+ScatterND takes three inputs `data` tensor of rank r >= 1, `indices` tensor of rank q >= 1, and `updates` tensor of rank q + r - indices.shape[-1] - 1. The output of the operation is produced by creating a copy of the input `data`, and then updating its value to values specified by updates at specific index positions specified by `indices`. Its output shape is the same as the shape of `data`. Note that `indices` should not have duplicate entries. That is, two or more updates for the same index-location is not supported.
+The `output` is calculated via the following equation:
+```python
+  output = np.copy(data)
+  update_indices = indices.shape[:-1]
+  for idx in np.ndindex(update_indices):
+      output[indices[idx]] = updates[idx]
+```
+#### Parameters
+None
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Tensor of rank r>=1.</dd>
+<dt><tt>inputs[1]</tt>: tensor(int32, Linear)</dt>
+<dd>Tensor of rank q>=1.</dd>
+<dt><tt>inputs[2]</tt>: T</dt>
+<dd>Tensor of rank q + r - indices_shape[-1] - 1.</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Tensor of rank r >= 1.</dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear), tensor(int32, Linear)
+### TRTBatchedRotatedNMS
+#### Description
+Batched rotated NMS with a fixed number of output bounding boxes.
+#### Parameters
+| Type    | Parameter             | Description                                                                                                                             |
+| ------- | --------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
+| `int`   | `background_label_id` | The label ID for the background class. If there is no background class, set it to `-1`.                                                 |
+| `int`   | `num_classes`         | The number of classes.                                                                                                                  |
+| `int`   | `topK`                | The number of bounding boxes to be fed into the NMS step.                                                                               |
+| `int`   | `keepTopK`            | The number of total bounding boxes to be kept per-image after the NMS step. Should be less than or equal to the `topK` value.           |
+| `float` | `scoreThreshold`      | The scalar threshold for score (low scoring boxes are removed).                                                                         |
+| `float` | `iouThreshold`        | The scalar threshold for IoU (new boxes that have high IoU overlap with previously selected boxes are removed).                         |
+| `int`   | `isNormalized`        | Set to `false` if the box coordinates are not normalized, meaning they are not in the range `[0,1]`. Defaults to `true`.                |
+| `int`   | `clipBoxes`           | Forcibly restrict bounding boxes to the normalized range `[0,1]`. Only applicable if `isNormalized` is also `true`. Defaults to `true`. |
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>boxes; 4-D tensor of shape (N, num_boxes, num_classes, 5), where N is the batch size; `num_boxes` is the number of boxes; `num_classes` is the number of classes, which could be 1 if the boxes are shared between all classes.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>scores; 4-D tensor of shape (N, num_boxes, 1, num_classes). </dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>dets; 3-D tensor of shape (N, valid_num_boxes, 6), `valid_num_boxes` is the number of boxes after NMS. For each row `dets[i,j,:] = [x0, y0, width, height, theta, score]`</dd>
+<dt><tt>outputs[1]</tt>: tensor(int32, Linear)</dt>
+<dd>labels; 2-D tensor of shape (N, valid_num_boxes). </dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+### GridPriorsTRT
+#### Description
+Generate the anchors for object detection task.
+#### Parameters
+| Type  | Parameter  | Description                       |
+| ----- | ---------- | --------------------------------- |
+| `int` | `stride_w` | The stride of the feature width.  |
+| `int` | `stride_h` | The stride of the feature height. |
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>The base anchors; 2-D tensor with shape [num_base_anchor, 4].</dd>
+<dt><tt>inputs[1]</tt>: TAny</dt>
+<dd>height provider; 1-D tensor with shape [featmap_height]. The data will never been used.</dd>
+<dt><tt>inputs[2]</tt>: TAny</dt>
+<dd>width provider; 1-D tensor with shape [featmap_width]. The data will never been used.</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>output anchors; 2-D tensor of shape (num_base_anchor*featmap_height*featmap_widht, 4).</dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+- TAny: Any
+### ScaledDotProductAttentionTRT
+#### Description
+Dot product attention used to support multihead attention, read [Attention Is All You Need](https://arxiv.org/abs/1706.03762?context=cs) for more detail.
+#### Parameters
+None
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>query; 3-D tensor with shape [batch_size, sequence_length, embedding_size].</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>key; 3-D tensor with shape [batch_size, sequence_length, embedding_size].</dd>
+<dt><tt>inputs[2]</tt>: T</dt>
+<dd>value; 3-D tensor with shape [batch_size, sequence_length, embedding_size].</dd>
+<dt><tt>inputs[3]</tt>: T</dt>
+<dd>mask; 2-D/3-D tensor with shape [sequence_length, sequence_length] or [batch_size, sequence_length, sequence_length]. optional.</dd>
+</dl>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>3-D tensor of shape [batch_size, sequence_length, embedding_size]. `softmax(q@k.T)@v`</dd>
+<dt><tt>outputs[1]</tt>: T</dt>
+<dd>3-D tensor of shape [batch_size, sequence_length, sequence_length]. `softmax(q@k.T)`</dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear)
+### GatherTopk
+#### Description
+TensorRT 8.2~8.4 would give unexpected result for multi-index gather.
+```python
+data[batch_index, bbox_index, ...]
+```
+Read [this](https://github.com/NVIDIA/TensorRT/issues/2299) for more details.
+#### Parameters
+None
+#### Inputs
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Tensor to be gathered, with shape (A0, ..., An, G0, C0, ...).</dd>
+<dt><tt>inputs[1]</tt>: tensor(int32, Linear)</dt>
+<dd>Tensor of index. with shape (A0, ..., An, G1)</dd>
+#### Outputs
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Tensor of output. With shape (A0, ..., An, G1, C0, ...)</dd>
+</dl>
+#### Type Constraints
+- T:tensor(float32, Linear), tensor(int32, Linear)
--- a/docs/zh_cn/07-developer-guide/add_backend_ops_unittest.md
+++ b/docs/zh_cn/07-developer-guide/add_backend_ops_unittest.md
+# 为推理 ops 添加测试单元
+本教程介绍如何为后端 ops 添加单元测试。在 backend_ops 目录下添加自定义 op 时，需要添加相应的测试单元。op 的单元测试在 `test/test_ops/test_ops.py` 中。
+添加新的自定义 op 后，需要重新编译，引用 [build.md](../01-how-to-build/build_from_source.md) 。
+## ops 单元测试样例
+```python
+@pytest.mark.parametrize('backend', [TEST_TENSORRT, TEST_ONNXRT])        # 1.1 backend test class
+@pytest.mark.parametrize('pool_h,pool_w,spatial_scale,sampling_ratio',   # 1.2 set parameters of op
+                         [(2, 2, 1.0, 2), (4, 4, 2.0, 4)])               # [（# Examples of op test parameters）,...]
+def test_roi_align(backend,
+                   pool_h,                                               # set parameters of op
+                   pool_w,
+                   spatial_scale,
+                   sampling_ratio,
+                   input_list=None,
+                   save_dir=None):
+    backend.check_env()
+    if input_list is None:
+        input = torch.rand(1, 1, 16, 16, dtype=torch.float32)            # 1.3 op input data initialization
+        single_roi = torch.tensor([[0, 0, 0, 4, 4]], dtype=torch.float32)
+    else:
+        input = torch.tensor(input_list[0], dtype=torch.float32)
+        single_roi = torch.tensor(input_list[1], dtype=torch.float32)
+    from mmcv.ops import roi_align
+    def wrapped_function(torch_input, torch_rois):                       # 1.4 initialize op model to be tested
+        return roi_align(torch_input, torch_rois, (pool_w, pool_h),
+                         spatial_scale, sampling_ratio, 'avg', True)
+    wrapped_model = WrapFunction(wrapped_function).eval()
+    with RewriterContext(cfg={}, backend=backend.backend_name, opset=11): # 1.5 call the backend test class interface
+        backend.run_and_validate(
+            wrapped_model, [input, single_roi],
+            'roi_align',
+            input_names=['input', 'rois'],
+            output_names=['roi_feat'],
+            save_dir=save_dir)
+```
+mmdeploy 支持的模型有两种格式：
+- torch 模型：参考 roi_align 单元测试，必须要求 op 相关 Python 代码
+- onnx 模型：参考 multi_level_roi_align 单元测试，需要调用 onnx api 进行构建
+调用 `run_and_validate` 即可运行
+```python
+    def run_and_validate(self,
+                         model,
+                         input_list,
+                         model_name='tmp',
+                         tolerate_small_mismatch=False,
+                         do_constant_folding=True,
+                         dynamic_axes=None,
+                         output_names=None,
+                         input_names=None,
+                         expected_result=None,
+                         save_dir=None):
+```
+#### Parameter Description
+|          参数           |                 说明                  |
+| :---------------------: | :-----------------------------------: |
+|          model          |           要测试的输入模型            |
+|       input_list        | 测试数据列表，映射到input_names的顺序 |
+| tolerate_small_mismatch |     是否允许验证结果出现精度误差      |
+|   do_constant_folding   |           是否使用常量折叠            |
+|      output_names       |             输出节点名字              |
+|       input_names       |             输入节点名字              |
+|     expected_result     |          期望的 ground truth          |
+|        save_dir         |             结果保存目录              |
+## 测试模型
+用 `pytest` 调用 ops 测试
+```bash
+pytest tests/test_ops/test_ops.py::test_XXXX
+```
--- a/docs/zh_cn/07-developer-guide/architecture.md
+++ b/docs/zh_cn/07-developer-guide/architecture.md
+# mmdeploy 各目录功能
+本文主要介绍 mmdeploy 各目录功能，以及从模型到具体推理框架是怎么工作的。
+## 一、大致看下目录结构
+整个 mmdeploy 可以看成比较独立的两部分：模型转换 和 SDK。
+我们介绍整个 repo 目录结构和功能，不必细究源码、有个印象即可。
+外围目录功能：
+```bash
+$ cd /path/to/mmdeploy
+$ tree -L 1
+.
+├── CMakeLists.txt    # 编译模型转换自定义算子和 SDK 的 cmake 配置
+├── configs                   # 模型转换要用的算法库配置
+├── csrc                          # SDK 和自定义算子
+├── demo                      # 各语言的 ffi 接口应用实例，如 csharp、java、python 等
+├── docker                   #  docker build
+├── mmdeploy           # 用于模型转换的 python 包
+├── requirements      # python 包安装依赖
+├── service                    # 有些小板子不能跑 python，模型转换用的 C/S 模式。这个目录放 Server
+├── tests                         # 单元测试
+├── third_party           # SDK 和 ffi 要的第三方依赖
+└── tools                        # 工具，也是一切功能的入口。除了 deploy.py 还有 onnx2xx.py、profiler.py 和 test.py
+```
+这样大致应该清楚了
+- 模型转换主要看 tools + mmdeploy + 小部分 csrc 目录；
+- 而 SDK 的本体在 csrc + third_party + demo 三个目录。
+## 二、模型转换
+模型以 mmpretrain 的 ViT 为例，推理框架就用 ncnn 举例。其他模型、推理都是类似的。
+我们看下 mmdeploy/mmdeploy 目录结构，有个印象即可：
+```bash
+.
+├── apis                             # tools 工具用的 api，都是这里实现的，如 onnx2ncnn.py
+│   ├── calibration.py          # trt 专用收集量化数据
+│   ├── core                              # 软件脚手架
+│   ├── extract_model.py  # onnx 模型只想导出一部分，切 onnx 用的
+│   ├── inference.py             # 抽象函数，实际会调 torch/ncnn 具体的 inference
+│   ├── ncnn                            # 引用 backend/ncnn 的函数，只是包了一下
+│   └── visualize.py              # 还是抽象函数，实际会调用 torch/ncnn 具体的 inference 和 visualize
+..
+├── backend                  # 具体的 backend 包装
+│   ├── base                            # 因为有多个 backend，所以得有个 base 类的 OO 设计
+│   ├── ncnn                           # 这里为模型转换调用 ncnn python 接口
+│   │   ├── init_plugins.py           # 找 ncnn 自定义算子和 ncnn 工具的路径
+│   │   ├── onnx2ncnn.py            # 把 `mmdeploy_onnx2ncnn` 封装成 python 接口
+│   │   ├── quant.py                       # 封装 `ncnn2int8` 工具为 python 接口
+│   │   └── wrapper.py                  # 封装 pyncnn forward 接口
+..
+├── codebase                #  mm 系列算法 forward 重写
+│   ├── base                          # 有多个算法，需要点 OO 设计
+│   ├── mmpretrain                      #  mmpretrain 相关模型重写
+│   │   ├── deploy                       # mmpretrain 对 base 抽象 task/model/codebase 的实现
+│   │   └── models                      # 开始真正的模型重写
+│   │       ├── backbones                 # 骨干网络部分的重写，例如  multiheadattention
+│   │       ├── heads                           # 例如  MultiLabelClsHead
+│   │       ├── necks                            # 例如 GlobalAveragePooling
+│..
+├── core                         # 软件脚手架，重写机制怎么实现的
+├── mmcv                     # mmcv 有的 opr 也需要重写
+├── pytorch                 #  针对 ncnn 重写 torch 的 opr，例如 Gemm
+..
+```
+上面的每一行是需要读的，请勿跳过。
+当敲下`tools/deploy.py` 转换 ViT，核心是这 3 件事：
+1. mmpretrain ViT forward 过程的重写
+2. ncnn 不支持 gather opr，自定义一下、和 libncnn.so 一起加载
+3. 真实跑一遍，渲染结果，确保正确
+### 1. forward 重写
+因为 onnx 会生成稀碎的算子、ncnn 也不是完美支持 onnx，所以 mmdeploy 的方案是劫持有问题的 forward 代码、改成适合 ncnn 的 onnx 结果。
+例如把 `conv -> shape -> concat_const -> reshape` 过程改成 `conv -> reshape`，削掉多余的 `shape` 和 `concat` 算子。
+所有的 mmpretrain 算法重写都在 `mmdeploy/codebase/mmpretrain/models`目录。
+### 2. 自定义算子
+针对 ncnn 自定义的算子都在 `csrc/mmdeploy/backend_ops/ncnn/`目录，编译后和 libncnn.so 一起加载。本质是在 hotfix ncnn，目前实现了
+- topk
+- tensorslice
+- shape
+- gather
+- expand
+- constantofshape
+### 3. 转换和测试
+ncnn 的兼容性较好，转换用的是修改后的 `mmdeploy_onnx2ncnn`，推理封装了 `pyncnn`+ 自定义 ops。
+遇到 snpe 这种不支持 python 的框架，则使用 C/S 模式：用 gRPC 等协议封装一个 server，转发真实的推理结果。
+渲染使用上游算法框架的渲染 API，mmdeploy 自身不做绘制。
+## 三、SDK
+模型转换完成后，可用 C++ 编译的 SDK 执行在不同平台上。
+我们看下 csrc/mmdeploy 目录结构：
+```bash
+.
+├── apis           # Csharp、java、go、Rust 等 ffi 接口
+├── backend_ops    # 各推理框架的自定义算子
+├── CMakeLists.txt
+├── codebase       # 各 mm 算法框架偏好的结果类型，例如检测任务多用 bbox
+├── core           # 脚手架，对图、算子、设备的抽象
+├── device         # CPU/GPU device 抽象的实现
+├── execution      # 对 exec 抽象的实现
+├── graph          # 对图抽象的实现
+├── model          # 实现 zip 压缩和非压缩两种工作目录
+├── net            # net 的具体实现，例如封装了 ncnn forward C 接口
+├── preprocess     # 预处理的实现
+└── utils          # OCV 工具类
+```
+SDK 本质是设计了一套计算图的抽象，把**多个模型**的
+- 预处理
+- 推理
+- 后处理
+调度起来，同时提供多种语言的 ffi。
--- a/docs/zh_cn/07-developer-guide/partition_model.md
+++ b/docs/zh_cn/07-developer-guide/partition_model.md
+# 如何拆分 onnx 模型
+MMDeploy 支持将PyTorch模型导出到onnx模型并进行拆分得到多个onnx模型文件，用户可以自由的对模型图节点进行标记并根据这些标记的节点定制任意的onnx模型拆分策略。在这个教程中，我们将通过具体例子来展示如何进行onnx模型拆分。在这个例子中，我们的目标是将YOLOV3模型拆分成两个部分，保留不带后处理的onnx模型，丢弃包含Anchor生成，NMS的后处理部分。
+## 步骤 1: 添加模型标记点
+为了进行图拆分，我们定义了`Mark`类型op，标记模型导出的边界。在实现方法上，采用`mark`装饰器对函数的输入、输出`Tensor`打标记。需要注意的是，我们的标记函数需要在某个重写函数中执行才能生效。
+为了对YOLOV3进行拆分，首先我们需要标记模型的输入。这里为了通用性，我们标记检测器父类`BaseDetector`的`forward`方法中的`img` `Tensor`，同时为了支持其他拆分方案，也对`forward`函数的输出进行了标记，分别是`dets`, `labels`和`masks`。下面的代码是截图[mmdeploy/codebase/mmdet/models/detectors/single_stage.py](https://github.com/open-mmlab/mmdeploy/blob/4fc8828af84281b62be143012cd9f9dafd1e7cc2/mmdeploy/codebase/mmdet/models/detectors/single_stage.py)中的一部分，可以看出我们使用`mark`装饰器标记了`__forward_impl`函数的输入输出，并在重写函数`base_detector__forward`进行了调用，从而完成了对检测器输入的标记。
+```python
+from mmdeploy.core import FUNCTION_REWRITER, mark
+@mark(
+    'detector_forward', inputs=['input'], outputs=['dets', 'labels', 'masks'])
+def __forward_impl(self, img, img_metas=None, **kwargs):
+    ...
+@FUNCTION_REWRITER.register_rewriter(
+    'mmdet.models.detectors.base.BaseDetector.forward')
+def base_detector__forward(self, img, img_metas=None, **kwargs):
+    ...
+    # call the mark function
+    return __forward_impl(...)
+```
+接下来，我们只需要对`YOLOV3Head`中最后一层输出特征`Tensor`进行标记就可以将整个`YOLOV3`模型拆分成两部分。通过查看`mmdet`源码我们可以知道`YOLOV3Head`的`get_bboxes`方法中输入参数`pred_maps`就是我们想要的拆分点，因此可以在重写函数[`yolov3_head__get_bboxes`](https://github.com/open-mmlab/mmdeploy/blob/4fc8828af84281b62be143012cd9f9dafd1e7cc2/mmdeploy/codebase/mmdet/models/dense_heads/yolo_head.py#L16)中添加内部函数对`pred_mapes`进行标记，具体参考如下示例代码。值得注意的是，输入参数`pred_maps`是由三个`Tensor`组成的列表，所以我们在onnx模型中添加了三个`Mark`标记节点。
+```python
+from mmdeploy.core import FUNCTION_REWRITER, mark
+@FUNCTION_REWRITER.register_rewriter(
+    func_name='mmdet.models.dense_heads.YOLOV3Head.get_bboxes')
+def yolov3_head__get_bboxes(self,
+                            pred_maps,
+                            img_metas,
+                            cfg=None,
+                            rescale=False,
+                            with_nms=True):
+    # mark pred_maps
+    @mark('yolo_head', inputs=['pred_maps'])
+    def __mark_pred_maps(pred_maps):
+        return pred_maps
+    pred_maps = __mark_pred_maps(pred_maps)
+    ...
+```
+## 步骤 2: 添加部署配置文件
+在完成模型中节点标记之后，我们需要创建部署配置文件，我们假设部署后端是`onnxruntime`,并模型输入是固定尺寸`608x608`,因此添加文件`configs/mmdet/detection/yolov3_partition_onnxruntime_static.py`. 我们需要在配置文件中添加基本的配置信息如`onnx_config`，如何你还不熟悉如何添加配置文件，可以参考[write_config.md](../02-how-to-run/write_config.md).
+在这个部署配置文件中, 我们需要添加一个特殊的模型分段配置字段`partition_config`. 在模型分段配置中，我们可以可以给分段策略添加一个类型名称如`yolov3_partition`，设定`apply_marks=True`。在分段方式`partition_cfg`,我们需要指定每段模型的分割起始点`start`, 终止点`end`以及保存分段onnx的文件名。需要提醒的是，各段模型起始点`start`和终止点`end`是由多个标记节点`Mark`组成，例如`'detector_forward:input'`代表`detector_forward`标记处输入所产生的标记节点。配置文件具体内容参考如下代码：
+```python
+_base_ = ['./detection_onnxruntime_static.py']
+onnx_config = dict(input_shape=[608, 608])
+partition_config = dict(
+    type='yolov3_partition', # the partition policy name
+    apply_marks=True, # should always be set to True
+    partition_cfg=[
+        dict(
+            save_file='yolov3.onnx', # filename to save the partitioned onnx model
+            start=['detector_forward:input'], # [mark_name:input/output, ...]
+            end=['yolo_head:input'],  # [mark_name:input/output, ...]
+            output_names=[f'pred_maps.{i}' for i in range(3)]) # output names
+    ])
+```
+## 步骤 3: 拆分onnx模型
+添加好节点标记和部署配置文件，我们可以使用`tools/torch2onnx.py`工具导出带有`Mark`标记的完成onnx模型并根据分段策略提取分段的onnx模型文件。我们可以执行如下脚本，得到不带后处理的`YOLOV3`onnx模型文件`yolov3.onnx`，同时输出文件中也包含了添加`Mark`标记的完整模型文件`end2end.onnx`。此外，用户可以使用网页版模型可视化工具[netron](https://netron.app/)来查看和验证输出onnx模型的结构是否正确。
+```shell
+python tools/torch2onnx.py \
+configs/mmdet/detection/yolov3_partition_onnxruntime_static.py \
+../mmdetection/configs/yolo/yolov3_d53_8xb8-ms-608-273e_coco.py \
+https://download.openmmlab.com/mmdetection/v2.0/yolo/yolov3_d53_mstrain-608_273e_coco/yolov3_d53_mstrain-608_273e_coco_20210518_115020-a2c3acb8.pth \
+../mmdetection/demo/demo.jpg \
+--work-dir ./work-dirs/mmdet/yolov3/ort/partition
+```
+当得到分段onnx模型之后，我们可以使用mmdeploy提供的其他工具如`mmdeploy_onnx2ncnn`, `onnx2tensorrt`来进行后续的模型部署工作。
--- a/docs/zh_cn/07-developer-guide/regression_test.md
+++ b/docs/zh_cn/07-developer-guide/regression_test.md
+# 如何进行回归测试
+<!-- -->
+这篇教程介绍了如何进行回归测试。部署配置文件由`每个codebase的回归配置文件`，`推理框架配置信息`组成。
+<!-- TOC -->
+- [如何进行回归测试](#如何进行回归测试)
+  - [1. 环境搭建](#1-环境搭建)
+    - [MMDeploy的安装及配置](#mmdeploy的安装及配置)
+    - [Python环境依赖](#python环境依赖)
+  - [2. 用法](#2-用法)
+    - [参数解析](#参数解析)
+    - [注意事项](#注意事项)
+  - [例子](#例子)
+  - [3. 回归测试配置文件](#3-回归测试配置文件)
+    - [示例及参数解析](#示例及参数解析)
+  - [4. 生成的报告](#4-生成的报告)
+    - [模板](#模板)
+    - [示例](#示例)
+  - [5. 支持的后端](#5-支持的后端)
+  - [6. 支持的Codebase及其Metric](#6-支持的codebase及其metric)
+  - [7. 注意事项](#7-注意事项)
+  - [8. 常见问题](#8-常见问题)
+<!-- TOC -->
+## 1. 环境搭建
+### MMDeploy的安装及配置
+本章节的内容，需要提前根据[build 文档](../01-how-to-build/build_from_source.md)将 MMDeploy 安装配置好之后，才能进行。
+### Python环境依赖
+需要安装 test 的环境
+```shell
+pip install -r requirements/tests.txt
+```
+如果在使用过程是 numpy 报错，则更新一下 numpy
+```shell
+pip install -U numpy
+```
+## 2. 用法
+```shell
+python ./tools/regression_test.py \
+    --codebase "${CODEBASE_NAME}" \
+    --backends "${BACKEND}" \
+    [--models "${MODELS}"] \
+    --work-dir "${WORK_DIR}" \
+    --device "${DEVICE}" \
+    --log-level INFO \
+    [--performance 或 -p] \
+    [--checkpoint-dir "$CHECKPOINT_DIR"]
+```
+### 参数解析
+- `--codebase` : 需要测试的 codebase，eg.`mmdet`, 测试多个 `mmpretrain mmdet ...`
+- `--backends` : 筛选测试的后端, 默认测全部`backend`, 也可传入若干个后端，例如 `onnxruntime tesnsorrt`。如果需要一同进行 SDK 的测试，需要在 `tests/regression/${codebase}.yml` 里面的 `sdk_config` 进行配置。
+- `--models` : 指定测试的模型, 默认测试 `yml` 中所有模型, 也可传入若干个模型名称，模型名称可参考相关yml配置文件。例如 `ResNet SE-ResNet "Mask R-CNN"`。注意的是，可传入只有字母和数字组成模型名称，例如 `resnet seresnet maskrcnn`。
+- `--work-dir` : 模型转换、报告生成的路径，默认是`../mmdeploy_regression_working_dir`，注意路径中不要含空格等特殊字符。
+- `--checkpoint-dir`: PyTorch 模型文件下载保存路径，默认是`../mmdeploy_checkpoints`，注意路径中不要含空格等特殊字符。
+- `--device` : 使用的设备，默认 `cuda`。
+- `--log-level` : 设置日记的等级，选项包括`'CRITICAL'， 'FATAL'， 'ERROR'， 'WARN'， 'WARNING'， 'INFO'， 'DEBUG'， 'NOTSET'`。默认是`INFO`。
+- `-p` 或 `--performance` : 是否测试精度，加上则测试转换+精度，不加上则只测试转换
+### 注意事项
+对于 Windows 用户：
+1. 要在 shell 命令中使用 `&&` 连接符，需要下载并使用 `PowerShell 7 Preview 5+`。
+2. 如果您使用 conda env，可能需要在 regression_test.py 中将 `python3` 更改为 `python`，因为 `%USERPROFILE%\AppData\Local\Microsoft\WindowsApps` 目录中有 `python3.exe`。
+## 例子
+1. 测试 mmdet 和 mmpose 的所有 backend 的 **转换+精度**
+```shell
+python ./tools/regression_test.py \
+    --codebase mmdet mmpose \
+    --work-dir "../mmdeploy_regression_working_dir" \
+    --device "cuda" \
+    --log-level INFO \
+    --performance
+```
+2. 测试 mmdet 和 mmpose 的某几个 backend 的 **转换+精度**
+```shell
+python ./tools/regression_test.py \
+    --codebase mmdet mmpose \
+    --backends onnxruntime tensorrt \
+    --work-dir "../mmdeploy_regression_working_dir" \
+    --device "cuda" \
+    --log-level INFO \
+    -p
+```
+3. 测试 mmdet 和 mmpose 的某几个 backend，**只测试转换**
+```shell
+python ./tools/regression_test.py \
+    --codebase mmdet mmpose \
+    --backends onnxruntime tensorrt \
+    --work-dir "../mmdeploy_regression_working_dir" \
+    --device "cuda" \
+    --log-level INFO
+```
+4. 测试 mmdet 和 mmpretrain 的某几个 models，**只测试转换**
+```shell
+python ./tools/regression_test.py \
+    --codebase mmdet mmpose \
+    --models ResNet SE-ResNet "Mask R-CNN" \
+    --work-dir "../mmdeploy_regression_working_dir" \
+    --device "cuda" \
+    --log-level INFO
+```
+## 3. 回归测试配置文件
+### 示例及参数解析
+```yaml
+globals:
+  codebase_dir: ../mmocr # 回归测试的 codebase 路径
+  checkpoint_force_download: False # 回归测试是否重新下载模型即使其已经存在
+  images: # 测试使用图片
+    img_densetext_det: &img_densetext_det ../mmocr/demo/demo_densetext_det.jpg
+    img_demo_text_det: &img_demo_text_det ../mmocr/demo/demo_text_det.jpg
+    img_demo_text_ocr: &img_demo_text_ocr ../mmocr/demo/demo_text_ocr.jpg
+    img_demo_text_recog: &img_demo_text_recog ../mmocr/demo/demo_text_recog.jpg
+  metric_info: &metric_info # 指标参数
+    hmean-iou: # 命名根据 metafile.Results.Metrics
+      eval_name: hmean-iou # 命名根据 test.py --metrics args 入参名称
+      metric_key: 0_hmean-iou:hmean # 命名根据 eval 写入 log 的 key name
+      tolerance: 0.1 # 容忍的阈值区间
+      task_name: Text Detection # 命名根据模型 metafile.Results.Task
+      dataset: ICDAR2015 #命名根据模型 metafile.Results.Dataset
+    word_acc: # 同上
+      eval_name: acc
+      metric_key: 0_word_acc_ignore_case
+      tolerance: 0.2
+      task_name: Text Recognition
+      dataset: IIIT5K
+  convert_image_det: &convert_image_det # det转换会使用到的图片
+    input_img: *img_densetext_det
+    test_img: *img_demo_text_det
+  convert_image_rec: &convert_image_rec
+    input_img: *img_demo_text_recog
+    test_img: *img_demo_text_recog
+  backend_test: &default_backend_test True # 是否对 backend 进行精度测试
+  sdk: # SDK 配置文件
+    sdk_detection_dynamic: &sdk_detection_dynamic configs/mmocr/text-detection/text-detection_sdk_dynamic.py
+    sdk_recognition_dynamic: &sdk_recognition_dynamic configs/mmocr/text-recognition/text-recognition_sdk_dynamic.py
+onnxruntime:
+  pipeline_ort_recognition_static_fp32: &pipeline_ort_recognition_static_fp32
+    convert_image: *convert_image_rec # 转换过程中使用的图片
+    backend_test: *default_backend_test # 是否进行后端测试，存在则判断，不存在则视为 False
+    sdk_config: *sdk_recognition_dynamic # 是否进行SDK测试，存在则使用特定的 SDK config 进行测试，不存在则视为不进行 SDK 测试
+    deploy_config: configs/mmocr/text-recognition/text-recognition_onnxruntime_static.py # 使用的 deploy cfg 路径，基于 mmdeploy 的路径
+  pipeline_ort_recognition_dynamic_fp32: &pipeline_ort_recognition_dynamic_fp32
+    convert_image: *convert_image_rec
+    backend_test: *default_backend_test
+    sdk_config: *sdk_recognition_dynamic
+    deploy_config: configs/mmocr/text-recognition/text-recognition_onnxruntime_dynamic.py
+  pipeline_ort_detection_dynamic_fp32: &pipeline_ort_detection_dynamic_fp32
+    convert_image: *convert_image_det
+    deploy_config: configs/mmocr/text-detection/text-detection_onnxruntime_dynamic.py
+tensorrt:
+  pipeline_trt_recognition_dynamic_fp16: &pipeline_trt_recognition_dynamic_fp16
+    convert_image: *convert_image_rec
+    backend_test: *default_backend_test
+    sdk_config: *sdk_recognition_dynamic
+    deploy_config: configs/mmocr/text-recognition/text-recognition_tensorrt-fp16_dynamic-1x32x32-1x32x640.py
+  pipeline_trt_detection_dynamic_fp16: &pipeline_trt_detection_dynamic_fp16
+    convert_image: *convert_image_det
+    backend_test: *default_backend_test
+    sdk_config: *sdk_detection_dynamic
+    deploy_config: configs/mmocr/text-detection/text-detection_tensorrt-fp16_dynamic-320x320-2240x2240.py
+openvino:
+  # 此处省略，内容同上
+ncnn:
+  # 此处省略，内容同上
+pplnn:
+  # 此处省略，内容同上
+torchscript:
+  # 此处省略，内容同上
+models:
+  - name: crnn # 模型名称
+    metafile: configs/textrecog/crnn/metafile.yml # 模型对应的 metafile 的路径，相对于 codebase 的路径
+    codebase_model_config_dir: configs/textrecog/crnn # `model_configs` 的父文件夹路径，相对于 codebase 的路径
+    model_configs: # 需要测试的 config 名称
+      - crnn_academic_dataset.py
+    pipelines: # 使用的 pipeline
+      - *pipeline_ort_recognition_dynamic_fp32
+  - name: dbnet
+    metafile: configs/textdet/dbnet/metafile.yml
+    codebase_model_config_dir: configs/textdet/dbnet
+    model_configs:
+      - dbnet_r18_fpnc_1200e_icdar2015.py
+    pipelines:
+      - *pipeline_ort_detection_dynamic_fp32
+      - *pipeline_trt_detection_dynamic_fp16
+      # 特殊的 pipeline 可以这样加入
+      - convert_image: xxx
+        backend_test: xxx
+        sdk_config: xxx
+        deploy_config: configs/mmocr/text-detection/xxx
+```
+## 4. 生成的报告
+### 模板
+|      | Model    | Model Config      | Task             | Checkpoint     | Dataset    | Backend  | Deploy Config   | Static or Dynamic | Precision Type | Conversion Result | metric_1    | metric_2    | metric_n    | Test Pass    |
+| ---- | -------- | ----------------- | ---------------- | -------------- | ---------- | -------- | --------------- | ----------------- | -------------- | ----------------- | ----------- | ----------- | ----------- | ------------ |
+| 序号 | 模型名称 | model config 路径 | 执行的 task name | `.pth`模型路径 | 数据集名称 | 后端名称 | deploy cfg 路径 | 动态 or 静态      | 测试精度       | 模型转换结果      | 指标 1 数值 | 指标 2 数值 | 指标 n 数值 | 后端测试结果 |
+### 示例
+这是 MMOCR 生成的报告
+|     | Model | Model Config                                                     | Task             | Checkpoint                                                                                                   | Dataset   | Backend         | Deploy Config                                                                          | Static or Dynamic | Precision Type | Conversion Result | hmean-iou | word_acc | Test Pass |
+| --- | ----- | ---------------------------------------------------------------- | ---------------- | ------------------------------------------------------------------------------------------------------------ | --------- | --------------- | -------------------------------------------------------------------------------------- | ----------------- | -------------- | ----------------- | --------- | -------- | --------- |
+| 0   | crnn  | ../mmocr/configs/textrecog/crnn/crnn_academic_dataset.py         | Text Recognition | ../mmdeploy_checkpoints/mmocr/crnn/crnn_academic-a723a1c5.pth                                                | IIIT5K    | Pytorch         | -                                                                                      | -                 | -              | -                 | -         | 80.5     | -         |
+| 1   | crnn  | ../mmocr/configs/textrecog/crnn/crnn_academic_dataset.py         | Text Recognition | ${WORK_DIR}/mmocr/crnn/onnxruntime/static/crnn_academic-a723a1c5/end2end.onnx                                | x         | onnxruntime     | configs/mmocr/text-recognition/text-recognition_onnxruntime_dynamic.py                 | static            | fp32           | True              | -         | 80.67    | True      |
+| 2   | crnn  | ../mmocr/configs/textrecog/crnn/crnn_academic_dataset.py         | Text Recognition | ${WORK_DIR}/mmocr/crnn/onnxruntime/static/crnn_academic-a723a1c5                                             | x         | SDK-onnxruntime | configs/mmocr/text-recognition/text-recognition_sdk_dynamic.py                         | static            | fp32           | True              | -         | x        | False     |
+| 3   | dbnet | ../mmocr/configs/textdet/dbnet/dbnet_r18_fpnc_1200e_icdar2015.py | Text Detection   | ../mmdeploy_checkpoints/mmocr/dbnet/dbnet_r18_fpnc_sbn_1200e_icdar2015_20210329-ba3ab597.pth                 | ICDAR2015 | Pytorch         | -                                                                                      | -                 | -              | -                 | 0.795     | -        | -         |
+| 4   | dbnet | ../mmocr/configs/textdet/dbnet/dbnet_r18_fpnc_1200e_icdar2015.py | Text Detection   | ../mmdeploy_checkpoints/mmocr/dbnet/dbnet_r18_fpnc_sbn_1200e_icdar2015_20210329-ba3ab597.pth                 | ICDAR     | onnxruntime     | configs/mmocr/text-detection/text-detection_onnxruntime_dynamic.py                     | dynamic           | fp32           | True              | -         | -        | True      |
+| 5   | dbnet | ../mmocr/configs/textdet/dbnet/dbnet_r18_fpnc_1200e_icdar2015.py | Text Detection   | ${WORK_DIR}/mmocr/dbnet/tensorrt/dynamic/dbnet_r18_fpnc_sbn_1200e_icdar2015_20210329-ba3ab597/end2end.engine | ICDAR     | tensorrt        | configs/mmocr/text-detection/text-detection_tensorrt-fp16_dynamic-320x320-2240x2240.py | dynamic           | fp16           | True              | 0.793302  | -        | True      |
+| 6   | dbnet | ../mmocr/configs/textdet/dbnet/dbnet_r18_fpnc_1200e_icdar2015.py | Text Detection   | ${WORK_DIR}/mmocr/dbnet/tensorrt/dynamic/dbnet_r18_fpnc_sbn_1200e_icdar2015_20210329-ba3ab597                | ICDAR     | SDK-tensorrt    | configs/mmocr/text-detection/text-detection_sdk_dynamic.py                             | dynamic           | fp16           | True              | 0.795073  | -        | True      |
+## 5. 支持的后端
+- [x] ONNX Runtime
+- [x] TensorRT
+- [x] PPLNN
+- [x] ncnn
+- [x] OpenVINO
+- [x] TorchScript
+- [x] SNPE
+- [x] MMDeploy SDK
+## 6. 支持的Codebase及其Metric
+| Codebase   | Metric   | Support            |
+| ---------- | -------- | ------------------ |
+| mmdet      | bbox     | :heavy_check_mark: |
+|            | segm     | :heavy_check_mark: |
+|            | PQ       | :x:                |
+| mmpretrain | accuracy | :heavy_check_mark: |
+| mmseg      | mIoU     | :heavy_check_mark: |
+| mmpose     | AR       | :heavy_check_mark: |
+|            | AP       | :heavy_check_mark: |
+| mmocr      | hmean    | :heavy_check_mark: |
+|            | acc      | :heavy_check_mark: |
+| mmagic     | PSNR     | :heavy_check_mark: |
+|            | SSIM     | :heavy_check_mark: |
+## 7. 注意事项
+暂无
+## 8. 常见问题
+暂无
--- a/docs/zh_cn/07-developer-guide/support_new_backend.md
+++ b/docs/zh_cn/07-developer-guide/support_new_backend.md
+# 如何支持新的后端
+MMDeploy 支持了许多后端推理引擎，但我们依然非常欢迎新后端的贡献。在本教程中，我们将介绍在 MMDeploy 中支持新后端的一般过程。
+## 必要条件
+在对 MMDeploy 添加新的后端引擎之前，需要先检查所要支持的新后端是否符合一些要求:
+- 后端必须能够支持 ONNX 作为 IR。
+- 如果后端需要“.onnx”文件以外的模型文件或权重文件，则需要添加将“.onnx”文件转换为模型文件或权重文件的转换工具，该工具可以是 Python API、脚本或可执行程序。
+- 强烈建议新后端可提供 Python 接口来加载后端文件和推理以进行验证。
+## 支持后端转换
+MMDeploy 中的后端必须支持 ONNX，因此后端能直接加载“.onnx”文件，或者使用转换工具将“.onnx”转换成自己的格式。在本节中，我们将介绍支持后端转换的步骤。
+1. 在 `mmdeploy/utils/constants.py` 文件中添加新推理后端变量，以表示支持的后端名称。
+   **示例**：
+   ```Python
+   # mmdeploy/utils/constants.py
+   class Backend(AdvancedEnum):
+       # 以现有的TensorRT为例
+       TENSORRT = 'tensorrt'
+   ```
+2. 在 `mmdeploy/backend/` 目录下添加相应的库(一个包括 `__init__.py` 的文件夹),例如， `mmdeploy/backend/tensorrt` 。在 `__init__.py` 中，必须有一个名为 `is_available` 的函数检查用户是否安装了后端库。如果检查通过，则将加载库的剩余文件。
+   **例子**:
+   ```Python
+   # mmdeploy/backend/tensorrt/__init__.py
+   def is_available():
+       return importlib.util.find_spec('tensorrt') is not None
+   if is_available():
+       from .utils import from_onnx, load, save
+       from .wrapper import TRTWrapper
+       __all__ = [
+           'from_onnx', 'save', 'load', 'TRTWrapper'
+       ]
+   ```
+3. 在 `configs/_base_/backends` 目录中创建一个配置文件(例如， `configs/_base_/backends/tensorrt.py` )。如果新后端引擎只是将“.onnx”文件作为输入，那么新的配置可以很简单,对应配置只需包含一个表示后端名称的字段(但也应该与 `mmdeploy/utils/constants.py` 中的名称相同)。
+   **例子**
+   ```python
+   backend_config = dict(type='tensorrt')
+   ```
+   但如果后端需要其他文件，则从“.onnx”文件转换为后端文件所需的参数也应包含在配置文件中。
+   **例子**
+   ```Python
+   backend_config = dict(
+       type='tensorrt',
+       common_config=dict(
+           fp16_mode=False, max_workspace_size=0))
+   ```
+   在拥有一个基本的后端配置文件后，您已经可以通过继承轻松构建一个完整的部署配置。有关详细信息，请参阅我们的[配置教程](../02-how-to-run/write_config.md)。下面是一个例子：
+   ```Python
+   _base_ = ['../_base_/backends/tensorrt.py']
+   codebase_config = dict(type='mmpretrain', task='Classification')
+   onnx_config = dict(input_shape=None)
+   ```
+4. 如果新后端需要模型文件或权重文件而不是“.onnx”文件，则需要在相应的文件夹中创建一个 `onnx2backend.py` 文件(例如,创建 `mmdeploy/backend/tensorrt/onnx2tensorrt.py` )。然后在文件中添加一个转换函数`onnx2backend`。该函数应将给定的“.onnx”文件转换为给定工作目录中所需的后端文件。对函数的其他参数和实现细节没有要求，您可以使用任何工具进行转换。下面有些例子：
+   **使用python脚本**
+   ```Python
+   def onnx2openvino(input_info: Dict[str, Union[List[int], torch.Size]],
+                     output_names: List[str], onnx_path: str, work_dir: str):
+       input_names = ','.join(input_info.keys())
+       input_shapes = ','.join(str(list(elem)) for elem in input_info.values())
+       output = ','.join(output_names)
+       mo_args = f'--input_model="{onnx_path}" '\
+                 f'--output_dir="{work_dir}" ' \
+                 f'--output="{output}" ' \
+                 f'--input="{input_names}" ' \
+                 f'--input_shape="{input_shapes}" ' \
+                 f'--disable_fusing '
+       command = f'mo.py {mo_args}'
+       mo_output = run(command, stdout=PIPE, stderr=PIPE, shell=True, check=True)
+   ```
+   **使用可执行文件**
+   ```Python
+   def onnx2ncnn(onnx_path: str, work_dir: str):
+       onnx2ncnn_path = get_onnx2ncnn_path()
+       save_param, save_bin = get_output_model_file(onnx_path, work_dir)
+       call([onnx2ncnn_path, onnx_path, save_param, save_bin])\
+   ```
+5. 在 `mmdeploy/apis` 中创建新后端库并声明对应 APIs
+   **例子**
+   ```Python
+   # mmdeploy/apis/ncnn/__init__.py
+   from mmdeploy.backend.ncnn import is_available
+   __all__ = ['is_available']
+   if is_available():
+       from mmdeploy.backend.ncnn.onnx2ncnn import (onnx2ncnn,
+                                                    get_output_model_file)
+       __all__ += ['onnx2ncnn', 'get_output_model_file']
+   ```
+   从 BaseBackendManager 派生类，实现 `to_backend` 类方法。
+   **例子**
+   ```Python
+   @classmethod
+    def to_backend(cls,
+                   ir_files: Sequence[str],
+                   deploy_cfg: Any,
+                   work_dir: str,
+                   log_level: int = logging.INFO,
+                   device: str = 'cpu',
+                   **kwargs) -> Sequence[str]:
+        return ir_files
+   ```
+6. 将 OpenMMLab 的模型转换后(如有必要)并在后端引擎上进行推理。如果在测试时发现一些不兼容的算子，可以尝试按照[重写器教程](support_new_model.md)为后端重写原始模型或添加自定义算子。
+7. 为新后端引擎代码添加相关注释和单元测试:).
+## 支持后端推理
+尽管后端引擎通常用C/C++实现，但如果后端提供Python推理接口，则测试和调试非常方便。我们鼓励贡献者在MMDeploy的Python接口中支持新后端推理。在本节中，我们将介绍支持后端推理的步骤。
+1. 添加一个名为 `wrapper.py` 的文件到 `mmdeploy/backend/{backend}` 中相应后端文件夹。例如， `mmdeploy/backend/tensorrt/wrapper` 。此模块应实现并注册一个封装类，该类继承 `mmdeploy/backend/base/base_wrapper.py` 中的基类 `BaseWrapper` 。
+   **例子**
+   ```Python
+   from mmdeploy.utils import Backend
+   from ..base import BACKEND_WRAPPER, BaseWrapper
+   @BACKEND_WRAPPER.register_module(Backend.TENSORRT.value)
+   class TRTWrapper(BaseWrapper):
+   ```
+2. 封装类可以在函数 `__init__` 中初始化引擎以及在 `forward` 函数中进行推理。请注意，该 `__init__` 函数必须接受一个参数 `output_names` 并将其传递给基类以确定输出张量的顺序。其中 `forward` 输入和输出变量应表示tensors的名称和值的字典。
+3. 为了方便性能测试，该类应该定义一个 `execute` 函数，只调用后端引擎的推理接口。该 `forward` 函数应在预处理数据后调用 `execute` 函数。
+   **例子**
+   ```Python
+   from mmdeploy.utils import Backend
+   from mmdeploy.utils.timer import TimeCounter
+   from ..base import BACKEND_WRAPPER, BaseWrapper
+   @BACKEND_WRAPPER.register_module(Backend.ONNXRUNTIME.value)
+   class ORTWrapper(BaseWrapper):
+       def __init__(self,
+                    onnx_file: str,
+                    device: str,
+                    output_names: Optional[Sequence[str]] = None):
+           # Initialization
+           #
+           # ...
+           super().__init__(output_names)
+       def forward(self, inputs: Dict[str,
+                                      torch.Tensor]) -> Dict[str, torch.Tensor]:
+           # Fetch data
+           # ...
+           self.__ort_execute(self.io_binding)
+   		# Postprocess data
+           # ...
+       @TimeCounter.count_time('onnxruntime')
+       def __ort_execute(self, io_binding: ort.IOBinding):
+   		# Only do the inference
+           self.sess.run_with_iobinding(io_binding)
+   ```
+4. 从 `BaseBackendManager` 派生接口类，实现 `build_wrapper` 静态方法
+   **例子**
+   ```Python
+        @BACKEND_MANAGERS.register('onnxruntime')
+        class ONNXRuntimeManager(BaseBackendManager):
+            @classmethod
+            def build_wrapper(cls,
+                              backend_files: Sequence[str],
+                              device: str = 'cpu',
+                              input_names: Optional[Sequence[str]] = None,
+                              output_names: Optional[Sequence[str]] = None,
+                              deploy_cfg: Optional[Any] = None,
+                              **kwargs):
+                from .wrapper import ORTWrapper
+                return ORTWrapper(
+                    onnx_file=backend_files[0],
+                    device=device,
+                    output_names=output_names)
+   ```
+5. 为新后端引擎代码添加相关注释和单元测试 :).
+## 将MMDeploy作为第三方库时添加新后端
+前面的部分展示了如何在 MMDeploy 中添加新的后端，这需要更改其源代码。但是，如果我们将 MMDeploy 视为第三方，则上述方法不再有效。为此，添加一个新的后端需要我们预先安装另一个名为 `aenum` 的包。我们可以直接通过`pip install aenum`进行安装。
+成功安装 `aenum` 后，我们可以通过以下方式使用它来添加新的后端：
+```python
+from mmdeploy.utils.constants import Backend
+from aenum import extend_enum
+try:
+    Backend.get('backend_name')
+except Exception:
+    extend_enum(Backend, 'BACKEND', 'backend_name')
+```
+我们可以在使用 MMDeploy 的重写逻辑之前运行上面的代码，这就完成了新后端的添加。
--- a/docs/zh_cn/07-developer-guide/support_new_model.md
+++ b/docs/zh_cn/07-developer-guide/support_new_model.md
+# 如何支持新的模型
+我们提供了多种工具来支持模型转换
+## 函数的重写器
+PyTorch 神经网络是用 python 编写的，可以简化算法的开发。但与此同时 Python 的流程控制和第三方库会使得网络导出为中间语言的过程变得困难。为此我们提供了一个“MonKey path”工具将不支持的功能重写为另一个可支持中间语言导出的功能。下述是一个具体的使用例子：
+```python
+from mmdeploy.core import FUNCTION_REWRITER
+@FUNCTION_REWRITER.register_rewriter(
+    func_name='torch.Tensor.repeat', backend='tensorrt')
+def repeat_static(input, *size):
+    ctx = FUNCTION_REWRITER.get_context()
+    origin_func = ctx.origin_func
+    if input.dim() == 1 and len(size) == 1:
+        return origin_func(input.unsqueeze(0), *([1] + list(size))).squeeze(0)
+    else:
+        return origin_func(input, *size)
+```
+使用函数重写器是十分容易的，只需添加一个带参数的装饰器即可：
+- `func_name`是需要被重载的函数，它可以是其他PyTorch 的函数或者是自定义的函数。模块中的方法也可以通过工具进行重载。
+- `backend`是推理引擎。当模型被导入到引擎的时候，函数会被重载。如果没有给出，重载默认的参数就是重载的参数。如果后端的重载的参数不存在，将会按照预设的默认模式进行重载。
+  当参数与原始的参数相同时，除了把上下文信息`ctx` 作为第一的参数外，上下文也提供了一些有用的信息，例如:部署的配置`ctx.cfg` 和原始的函数（已经被重载）`ctx.origin_func`。
+可参照[这些样例代码](https://github.com/open-mmlab/mmdeploy/blob/main/mmdeploy/codebase/mmpretrain/models/backbones/shufflenet_v2.py)。
+## 模型重载器
+如果您想用另一个模块替换整个模块，我们还有另一个重载器，如下所示：
+```python
+@MODULE_REWRITER.register_rewrite_module(
+    'mmagic.models.backbones.sr_backbones.SRCNN', backend='tensorrt')
+class SRCNNWrapper(nn.Module):
+    def __init__(self,
+                 module,
+                 cfg,
+                 channels=(3, 64, 32, 3),
+                 kernel_sizes=(9, 1, 5),
+                 upscale_factor=4):
+        super(SRCNNWrapper, self).__init__()
+        self._module = module
+        module.img_upsampler = nn.Upsample(
+            scale_factor=module.upscale_factor,
+            mode='bilinear',
+            align_corners=False)
+    def forward(self, *args, **kwargs):
+        """Run forward."""
+        return self._module(*args, **kwargs)
+    def init_weights(self, *args, **kwargs):
+        """Initialize weights."""
+        return self._module.init_weights(*args, **kwargs)
+```
+就像函数重载器一样，可添加一个带参数的装饰器：
+- `module_type` 要重载的模块类。
+- `backend` 是推理引擎。当模型被导入到引擎的时候，函数会被重载。如果没有给出，重载默认的参数就是重载的参数。如果后端的重载的参数不存在，将会按照预设的默认模式进行重载。
+网络中模块的所有实例都将替换为这个新类的实例。原始模块和部署配置将作为前两个参数进行传递。
+## 符号函数重写
+PyTorch 和 ONNX 之间的映射是通过 PyTorch 中的符号函数进行定义的。自定义符号函数可以帮助我们绕过一些推理引擎不支持的 ONNX 节点。
+```python
+@SYMBOLIC_REWRITER.register_symbolic('squeeze', is_pytorch=True)
+def squeeze_default(g, self, dim=None):
+    if dim is None:
+        dims = []
+        for i, size in enumerate(self.type().sizes()):
+            if size == 1:
+                dims.append(i)
+    else:
+        dims = [sym_help._get_const(dim, 'i', 'dim')]
+    return g.op('Squeeze', self, axes_i=dims)
+```
+装饰器的参数
+- `func_name`要添加符号的函数名称。如果是自定义的，请使用完整路径`torch.autograd.Function`。或者如果它是 PyTorch 内置函数，则只用写一个名称即可。
+- `backend`是推理引擎。当模型被导入到引擎的时候，函数会被重载。如果没有给出，重载默认的参数就是重载的参数。如果后端的重载的参数不存在，将会按照预设的默认模式进行重载。
+- 如果函数是 PyTorch 内置函数，则为True。
+- `arg_descriptors` 符号函数参数的描述符，将被传递给`torch.onnx.symbolic_helper._parse_arg`。
+就像函数重载器的`ctx`一样，第一个参数会提供上下文信息。上下文中了一些有用的信息，例如部署配置ctx.cfg和原始功能（已被重载）`ctx.origin_func`。请注意， `ctx.origin_func`只能在`is_pytorch==False`时使用。
+[这里](https://github.com/open-mmlab/mmdeploy/tree/6420e2044515ff2052960c0f8bb9e351e6a7f2c2/mmdeploy/pytorch/symbolics)有很多实现可参考。
--- a/docs/zh_cn/07-developer-guide/test_rewritten_models.md
+++ b/docs/zh_cn/07-developer-guide/test_rewritten_models.md
+# 测试模型重写
+模型 [rewriter](support_new_model.md) 完成后，还需完成对应测试用例，以验证重写是否生效。通常我们需要对比原始模型和重写后的输出。原始模型输出可以调用模型的 forward 函数直接获取，而生成重写模型输出的方法取决于重写的复杂性。
+## 测试简单的重写
+如果对模型的更改很小（例如，仅更改一个或两个变量且无副作用），则可为重写函数/模块构造输入，在`RewriteContext`中运行推理并检查结果。
+```python
+# mmpretrain.models.classfiers.base.py
+class BaseClassifier(BaseModule, metaclass=ABCMeta):
+    def forward(self, img, return_loss=True, **kwargs):
+        if return_loss:
+            return self.forward_train(img, **kwargs)
+        else:
+            return self.forward_test(img, **kwargs)
+# Custom rewritten function
+@FUNCTION_REWRITER.register_rewriter(
+    'mmpretrain.models.classifiers.BaseClassifier.forward', backend='default')
+def forward_of_base_classifier(self, img, *args, **kwargs):
+    """Rewrite `forward` for default backend."""
+    return self.simple_test(img, {})
+```
+在示例中，我们仅更改 forward 函数。我们可以通过编写以下函数来测试这个重写：
+```python
+def test_baseclassfier_forward():
+    input = torch.rand(1)
+    from mmpretrain.models.classifiers import BaseClassifier
+    class DummyClassifier(BaseClassifier):
+        def __init__(self, init_cfg=None):
+            super().__init__(init_cfg=init_cfg)
+        def extract_feat(self, imgs):
+            pass
+        def forward_train(self, imgs):
+            return 'train'
+        def simple_test(self, img, tmp, **kwargs):
+            return 'simple_test'
+    model = DummyClassifier().eval()
+    model_output = model(input)
+    with RewriterContext(cfg=dict()), torch.no_grad():
+        backend_output = model(input)
+    assert model_output == 'train'
+    assert backend_output == 'simple_test'
+```
+在这个测试函数中，我们构造派生类 `BaseClassifier` 来测试重写能否工作。通过直接调用`model(input)`来获得原始输出，并通过在`RewriteContext`中调用`model(input)`来获取重写的输出。最后断检查输出。
+## 测试复杂重写
+有时我们可能会对原始模型函数进行重大更改（例如，消除分支语句以生成正确的计算图）。即使运行在Python中的重写模型的输出是正确的，我们也不能保证重写的模型可以在后端按预期工作。因此，我们需要在后端测试重写的模型。
+```python
+# Custom rewritten function
+@FUNCTION_REWRITER.register_rewriter(
+    func_name='mmseg.models.segmentors.BaseSegmentor.forward')
+def base_segmentor__forward(self, img, img_metas=None, **kwargs):
+    ctx = FUNCTION_REWRITER.get_context()
+    if img_metas is None:
+        img_metas = {}
+    assert isinstance(img_metas, dict)
+    assert isinstance(img, torch.Tensor)
+    deploy_cfg = ctx.cfg
+    is_dynamic_flag = is_dynamic_shape(deploy_cfg)
+    img_shape = img.shape[2:]
+    if not is_dynamic_flag:
+        img_shape = [int(val) for val in img_shape]
+    img_metas['img_shape'] = img_shape
+    return self.simple_test(img, img_metas, **kwargs)
+```
+此重写函数的行为很复杂，我们应该按如下方式测试它：
+```python
+def test_basesegmentor_forward():
+    from mmdeploy.utils.test import (WrapModel, get_model_outputs,
+                                    get_rewrite_outputs)
+    segmentor = get_model()
+    segmentor.cpu().eval()
+    # Prepare data
+    # ...
+    # Get the outputs of original model
+    model_inputs = {
+        'img': [imgs],
+        'img_metas': [img_metas],
+        'return_loss': False
+    }
+    model_outputs = get_model_outputs(segmentor, 'forward', model_inputs)
+    # Get the outputs of rewritten model
+    wrapped_model = WrapModel(segmentor, 'forward', img_metas = None, return_loss = False)
+    rewrite_inputs = {'img': imgs}
+    rewrite_outputs, is_backend_output = get_rewrite_outputs(
+        wrapped_model=wrapped_model,
+        model_inputs=rewrite_inputs,
+        deploy_cfg=deploy_cfg)
+    if is_backend_output:
+        # If the backend plugins have been installed, the rewrite outputs are
+        # generated by backend.
+        rewrite_outputs = torch.tensor(rewrite_outputs)
+        model_outputs = torch.tensor(model_outputs)
+        model_outputs = model_outputs.unsqueeze(0).unsqueeze(0)
+        assert torch.allclose(rewrite_outputs, model_outputs)
+    else:
+        # Otherwise, the outputs are generated by python.
+        assert rewrite_outputs is not None
+```
+我们已经提供了一些使用函数做测试，例如可以先 build 模型，用  `get_model_outputs` 获取原始输出；然后用`WrapModel` 包装重写函数，使用`get_rewrite_outputs` 获取结果。这个例子里会返回输出内容和是否来自后端两个结果。
+因为我们也不确定用户是否正确安装后端，所以得检查结果来自 Python 还是真实后端推理结果。单元测试必须涵盖这两种结果，最后用`torch.allclose` 对比两种结果的差异。
+API 文档中有测试用例完整用法。
--- a/docs/zh_cn/Makefile
+++ b/docs/zh_cn/Makefile
+# Minimal makefile for Sphinx documentation
+#
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+.PHONY: help Makefile
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/zh_cn/_static/css/readthedocs.css
+++ b/docs/zh_cn/_static/css/readthedocs.css
+.header-logo {
+    background-image: url("../image/mmdeploy-logo.png");
+    background-size: 150px 60px;
+    height: 60px;
+    width: 150px;
+}
--- a/docs/zh_cn/_static/image/mmdeploy-logo.png
+++ b/docs/zh_cn/_static/image/mmdeploy-logo.png
--- a/docs/zh_cn/_static/image/quant_model.png
+++ b/docs/zh_cn/_static/image/quant_model.png
--- a/docs/zh_cn/api.rst
+++ b/docs/zh_cn/api.rst
+apis
+-------
+.. automodule:: mmdeploy.apis
+    :members:
+apis/tensorrt
+-------------
+.. automodule:: mmdeploy.apis.tensorrt
+    :members:
+apis/onnxruntime
+----------------
+.. automodule:: mmdeploy.apis.onnxruntime
+    :members:
+apis/ncnn
+---------
+.. automodule:: mmdeploy.apis.ncnn
+    :members:
+apis/pplnn
+----------
+.. automodule:: mmdeploy.apis.pplnn
+    :members:
--- a/docs/zh_cn/appendix/cross_build_snpe_service.md
+++ b/docs/zh_cn/appendix/cross_build_snpe_service.md
+# Ubuntu18.04 交叉编译 NDK snpe 推理服务
+mmdeploy 已提供预编译包，如果你想自己编译、或需要对 .proto 接口做修改，可参考此文档。
+注意 gRPC 官方文档并没有对 NDK 的完整支持。
+## 一、环境说明
+| 项目     | 版本           | 备注                                  |
+| -------- | -------------- | ------------------------------------- |
+| snpe     | 1.59           | 1.60 使用 clang-8.0，可能导致兼容问题 |
+| host OS  | ubuntu18.04    | snpe1.59 指定版本                     |
+| NDK      | r17c           | snpe1.59 指定版本                     |
+| gRPC     | commit 6f698b5 | -                                     |
+| 硬件设备 | qcom888        | 需要 qcom 芯片                        |
+## 二、NDK 交叉编译 gRPC
+1. 拉取 gRPC repo,  在 host 上编译出 `protoc` 和 `grpc_cpp_plugin`
+```bash
+# 安装依赖
+$ apt-get update && apt-get install -y libssl-dev
+# 编译
+$ git clone https://github.com/grpc/grpc --recursive=1 --depth=1
+$ mkdir -p cmake/build
+$ pushd cmake/build
+$ cmake \
+  -DCMAKE_BUILD_TYPE=Release \
+  -DgRPC_INSTALL=ON \
+  -DgRPC_BUILD_TESTS=OFF \
+  -DgRPC_SSL_PROVIDER=package \
+  ../..
+# 需要安装到 host 环境
+$ make -j
+$ sudo make install
+```
+2. 下载 NDK，交叉编译 android aarch64 所需静态库
+```bash
+$ wget https://dl.google.com/android/repository/android-ndk-r17c-linux-x86_64.zip
+$ unzip android-ndk-r17c-linux-x86_64.zip
+# 设置环境变量
+$ export ANDROID_NDK=/path/to/android-ndk-r17c
+# 编译
+$ cd /path/to/grpc
+$ mkdir -p cmake/build_aarch64  && pushd cmake/build_aarch64
+$ cmake ../.. \
+ -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK}/build/cmake/android.toolchain.cmake \
+ -DANDROID_ABI=arm64-v8a \
+ -DANDROID_PLATFORM=android-26 \
+ -DANDROID_TOOLCHAIN=clang \
+ -DANDROID_STL=c++_shared \
+ -DCMAKE_BUILD_TYPE=Release \
+ -DCMAKE_INSTALL_PREFIX=/tmp/android_grpc_install_shared
+$ make -j
+$ make install
+```
+3. 此时 `/tmp/android_grpc_install` 应有完整的安装文件
+```bash
+$ cd /tmp/android_grpc_install
+$ tree -L 1
+.
+├── bin
+├── include
+├── lib
+└── share
+```
+## 三、【可跳过】自测 NDK gRPC 是否正常
+1. 编译 gRPC 自带的 helloworld
+```bash
+$ cd /path/to/grpc/examples/cpp/helloworld/
+$ mkdir cmake/build_aarch64 -p && pushd cmake/build_aarch64
+$ cmake ../.. \
+ -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK}/build/cmake/android.toolchain.cmake \
+ -DANDROID_ABI=arm64-v8a \
+ -DANDROID_PLATFORM=android-26 \
+ -DANDROID_STL=c++_shared \
+ -DANDROID_TOOLCHAIN=clang \
+ -DCMAKE_BUILD_TYPE=Release \
+ -Dabsl_DIR=/tmp/android_grpc_install_shared/lib/cmake/absl \
+ -DProtobuf_DIR=/tmp/android_grpc_install_shared/lib/cmake/protobuf \
+ -DgRPC_DIR=/tmp/android_grpc_install_shared/lib/cmake/grpc
+$ make -j
+$ ls greeter*
+greeter_async_client   greeter_async_server     greeter_callback_server  greeter_server
+greeter_async_client2  greeter_callback_client  greeter_client
+```
+2. 打开手机调试模式，push 编译结果到 `/data/local/tmp` 目录
+tips：对于国产手机，设置 - 版本号，点击 7 次可进入开发者模式，然后才能打开 USB 调试
+```bash
+$ adb push greeter* /data/local/tmp
+```
+3. `adb shell` 进手机，执行 client/server
+```bash
+/data/local/tmp $ ./greeter_client
+Greeter received: Hello world
+```
+## 四、交叉编译 snpe 推理服务
+1. 打开 [snpe tools 官网](https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk/tools)，下载 1.59 版本。 解压并设置环境变量
+**注意 snpe >= 1.60 开始使用 `clang-8.0`，可能导致旧设备与 `libc++_shared.so` 不兼容。**
+```bash
+$ export SNPE_ROOT=/path/to/snpe-1.59.0.3230
+```
+2. 打开 mmdeploy  snpe server 目录，使用交叉编译 gRPC 时的选项
+```bash
+$ cd /path/to/mmdeploy
+$ cd service/snpe/server
+$ mkdir -p build && cd build
+$ export ANDROID_NDK=/path/to/android-ndk-r17c
+$ cmake .. \
+ -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake \
+ -DANDROID_ABI=arm64-v8a \
+ -DANDROID_PLATFORM=android-26 \
+ -DANDROID_STL=c++_shared \
+ -DANDROID_TOOLCHAIN=clang \
+ -DCMAKE_BUILD_TYPE=Release \
+ -Dabsl_DIR=/tmp/android_grpc_install_shared/lib/cmake/absl \
+ -DProtobuf_DIR=/tmp/android_grpc_install_shared/lib/cmake/protobuf \
+ -DgRPC_DIR=/tmp/android_grpc_install_shared/lib/cmake/grpc
+ $ make -j
+ $ file inference_server
+inference_server: ELF 64-bit LSB shared object, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /system/bin/linker64, BuildID[sha1]=252aa04e2b982681603dacb74b571be2851176d2, with debug_info, not stripped
+```
+最终可得到 `infernece_server`，`adb push` 到设备上即可执行。
+## 五、重新生成 proto 接口
+如果改过 `inference.proto`，需要重新生成 .cpp 和 .py 通信接口
+```Shell
+$ python3 -m pip install grpc_tools --user
+$ python3 -m  grpc_tools.protoc -I./ --python_out=./client/ --grpc_python_out=./client/ inference.proto
+$ ln -s `which protoc-gen-grpc`
+$ protoc --cpp_out=./ --grpc_out=./  --plugin=protoc-gen-grpc=grpc_cpp_plugin  inference.proto
+```
+## 参考文档
+- snpe tutorial https://developer.qualcomm.com/sites/default/files/docs/snpe/cplus_plus_tutorial.html
+- gRPC cross build script https://raw.githubusercontent.com/grpc/grpc/master/test/distrib/cpp/run_distrib_test_cmake_aarch64_cross.sh
+- stackoverflow https://stackoverflow.com/questions/54052229/build-grpc-c-for-android-using-ndk-arm-linux-androideabi-clang-compiler
--- a/docs/zh_cn/conf.py
+++ b/docs/zh_cn/conf.py
+#
+# Configuration file for the Sphinx documentation builder.
+#
+# This file does only contain a selection of the most common options. For a
+# full list see the documentation:
+# http://www.sphinx-doc.org/en/master/config
+# -- Path setup --------------------------------------------------------------
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import subprocess
+import sys
+import pytorch_sphinx_theme
+from m2r import MdInclude
+from recommonmark.transform import AutoStructify
+from sphinx.builders.html import StandaloneHTMLBuilder
+sys.path.insert(0, os.path.abspath('../..'))
+version_file = '../../mmdeploy/version.py'
+with open(version_file, 'r') as f:
+    exec(compile(f.read(), version_file, 'exec'))
+__version__ = locals()['__version__']
+# -- Project information -----------------------------------------------------
+project = 'mmdeploy'
+copyright = '2021-2024, OpenMMLab'
+author = 'MMDeploy Authors'
+# The short X.Y version
+version = __version__
+# The full version, including alpha/beta/rc tags
+release = __version__
+# -- General configuration ---------------------------------------------------
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'breathe',
+    'sphinx.ext.autodoc',
+    'sphinx.ext.doctest',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    'sphinx.ext.autosectionlabel',
+    'sphinx_markdown_tables',
+    'myst_parser',
+    'sphinx_copybutton',
+    'sphinxcontrib.mermaid'
+]  # yapf: disable
+breathe_default_project = 'mmdeployapi'
+breathe_projects = {'mmdeployapi': '../cppapi/docs/xml'}
+def generate_doxygen_xml(app):
+    try:
+        folder = '../cppapi'
+        retcode = subprocess.call('cd %s; doxygen' % folder, shell=True)
+        if retcode < 0:
+            sys.stderr.write('doxygen terminated by signal %s' % (-retcode))
+    except Exception as e:
+        sys.stderr.write('doxygen execution failed: %s' % e)
+autodoc_mock_imports = ['tensorrt']
+autosectionlabel_prefix_document = True
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+source_suffix = {
+    '.rst': 'restructuredtext',
+    '.md': 'markdown',
+}
+# The master toctree document.
+master_doc = 'index'
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = 'zh_CN'
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = 'sphinx'
+# -- Options for HTML output -------------------------------------------------
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+# html_theme = 'sphinx_rtd_theme'
+html_theme = 'pytorch_sphinx_theme'
+html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()]
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+html_theme_options = {
+    'logo_url': 'https://mmdeploy.readthedocs.io/zh_CN/latest/',
+    'menu': [{
+        'name': 'GitHub',
+        'url': 'https://github.com/open-mmlab/mmdeploy'
+    }],
+    'menu_lang': 'cn',
+}
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+html_css_files = ['css/readthedocs.css']
+# Enable ::: for my_st
+myst_enable_extensions = ['colon_fence']
+myst_heading_anchors = 5
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# The default sidebars (for documents that don't match any pattern) are
+# defined by theme itself.  Builtin themes are using these templates by
+# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
+# 'searchbox.html']``.
+#
+# html_sidebars = {}
+# -- Options for HTMLHelp output ---------------------------------------------
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'mmdeploydoc'
+# -- Options for LaTeX output ------------------------------------------------
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (master_doc, 'mmdeploy.tex', 'mmdeploy Documentation',
+     'MMDeploy Contributors', 'manual'),
+]
+# -- Options for manual page output ------------------------------------------
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [(master_doc, 'mmdeploy', 'mmdeploy Documentation', [author], 1)]
+# -- Options for Texinfo output ----------------------------------------------
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (master_doc, 'mmdeploy', 'mmdeploy Documentation', author, 'mmdeploy',
+     'One line description of project.', 'Miscellaneous'),
+]
+# -- Options for Epub output -------------------------------------------------
+# Bibliographic Dublin Core info.
+epub_title = project
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+# A unique identification for the text.
+#
+# epub_uid = ''
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+# set priority when building html
+StandaloneHTMLBuilder.supported_image_types = [
+    'image/svg+xml', 'image/gif', 'image/png', 'image/jpeg'
+]
+# -- Extension configuration -------------------------------------------------
+# Ignore >>> when copying code
+copybutton_prompt_text = r'>>> |\.\.\. '
+copybutton_prompt_is_regexp = True
+def setup(app):
+    # Add hook for building doxygen xml when needed
+    app.connect('builder-inited', generate_doxygen_xml)
+    app.add_config_value('no_underscore_emphasis', False, 'env')
+    app.add_config_value('m2r_parse_relative_links', False, 'env')
+    app.add_config_value('m2r_anonymous_references', False, 'env')
+    app.add_config_value('m2r_disable_inline_math', False, 'env')
+    app.add_directive('mdinclude', MdInclude)
+    app.add_config_value('recommonmark_config', {
+        'auto_toc_tree_section': 'Contents',
+        'enable_eval_rst': True,
+    }, True)
+    app.add_transform(AutoStructify)