add part code

4353fa59 · limm · 4353fa59 · 4353fa59 · 4353fa59 · 4353fa59
Commit 4353fa59 authored Jun 25, 2025 by limm
20 changed files
--- a/docs/en/06-custom-ops/tensorrt.md
+++ b/docs/en/06-custom-ops/tensorrt.md
+## TensorRT Ops
+
+<!-- TOC -->
+
+- [TensorRT Ops](#tensorrt-ops)
+  - [TRTBatchedNMS](#trtbatchednms)
+    - [Description](#description)
+    - [Parameters](#parameters)
+    - [Inputs](#inputs)
+    - [Outputs](#outputs)
+    - [Type Constraints](#type-constraints)
+  - [grid_sampler](#grid_sampler)
+    - [Description](#description-1)
+    - [Parameters](#parameters-1)
+    - [Inputs](#inputs-1)
+    - [Outputs](#outputs-1)
+    - [Type Constraints](#type-constraints-1)
+  - [MMCVInstanceNormalization](#mmcvinstancenormalization)
+    - [Description](#description-2)
+    - [Parameters](#parameters-2)
+    - [Inputs](#inputs-2)
+    - [Outputs](#outputs-2)
+    - [Type Constraints](#type-constraints-2)
+  - [MMCVModulatedDeformConv2d](#mmcvmodulateddeformconv2d)
+    - [Description](#description-3)
+    - [Parameters](#parameters-3)
+    - [Inputs](#inputs-3)
+    - [Outputs](#outputs-3)
+    - [Type Constraints](#type-constraints-3)
+  - [MMCVMultiLevelRoiAlign](#mmcvmultilevelroialign)
+    - [Description](#description-4)
+    - [Parameters](#parameters-4)
+    - [Inputs](#inputs-4)
+    - [Outputs](#outputs-4)
+    - [Type Constraints](#type-constraints-4)
+  - [MMCVRoIAlign](#mmcvroialign)
+    - [Description](#description-5)
+    - [Parameters](#parameters-5)
+    - [Inputs](#inputs-5)
+    - [Outputs](#outputs-5)
+    - [Type Constraints](#type-constraints-5)
+  - [ScatterND](#scatternd)
+    - [Description](#description-6)
+    - [Parameters](#parameters-6)
+    - [Inputs](#inputs-6)
+    - [Outputs](#outputs-6)
+    - [Type Constraints](#type-constraints-6)
+  - [TRTBatchedRotatedNMS](#trtbatchedrotatednms)
+    - [Description](#description-7)
+    - [Parameters](#parameters-7)
+    - [Inputs](#inputs-7)
+    - [Outputs](#outputs-7)
+    - [Type Constraints](#type-constraints-7)
+  - [GridPriorsTRT](#gridpriorstrt)
+    - [Description](#description-8)
+    - [Parameters](#parameters-8)
+    - [Inputs](#inputs-8)
+    - [Outputs](#outputs-8)
+    - [Type Constraints](#type-constraints-8)
+  - [ScaledDotProductAttentionTRT](#scaleddotproductattentiontrt)
+    - [Description](#description-9)
+    - [Parameters](#parameters-9)
+    - [Inputs](#inputs-9)
+    - [Outputs](#outputs-9)
+    - [Type Constraints](#type-constraints-9)
+  - [GatherTopk](#gathertopk)
+    - [Description](#description-10)
+    - [Parameters](#parameters-10)
+    - [Inputs](#inputs-10)
+    - [Outputs](#outputs-10)
+    - [Type Constraints](#type-constraints-10)
+  - [MMCVMultiScaleDeformableAttention](#mmcvmultiscaledeformableattention)
+    - [Description](#description-11)
+    - [Parameters](#parameters-11)
+    - [Inputs](#inputs-11)
+    - [Outputs](#outputs-11)
+    - [Type Constraints](#type-constraints-11)
+
+<!-- TOC -->
+
+### TRTBatchedNMS
+
+#### Description
+
+Batched NMS with a fixed number of output bounding boxes.
+
+#### Parameters
+
+| Type    | Parameter             | Description                                                                                                                             |
+| ------- | --------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
+| `int`   | `background_label_id` | The label ID for the background class. If there is no background class, set it to `-1`.                                                 |
+| `int`   | `num_classes`         | The number of classes.                                                                                                                  |
+| `int`   | `topK`                | The number of bounding boxes to be fed into the NMS step.                                                                               |
+| `int`   | `keepTopK`            | The number of total bounding boxes to be kept per-image after the NMS step. Should be less than or equal to the `topK` value.           |
+| `float` | `scoreThreshold`      | The scalar threshold for score (low scoring boxes are removed).                                                                         |
+| `float` | `iouThreshold`        | The scalar threshold for IoU (new boxes that have high IoU overlap with previously selected boxes are removed).                         |
+| `int`   | `isNormalized`        | Set to `false` if the box coordinates are not normalized, meaning they are not in the range `[0,1]`. Defaults to `true`.                |
+| `int`   | `clipBoxes`           | Forcibly restrict bounding boxes to the normalized range `[0,1]`. Only applicable if `isNormalized` is also `true`. Defaults to `true`. |
+
+#### Inputs
+
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>boxes; 4-D tensor of shape (N, num_boxes, num_classes, 4), where N is the batch size; `num_boxes` is the number of boxes; `num_classes` is the number of classes, which could be 1 if the boxes are shared between all classes.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>scores; 4-D tensor of shape (N, num_boxes, 1, num_classes). </dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>dets; 3-D tensor of shape (N, valid_num_boxes, 5), `valid_num_boxes` is the number of boxes after NMS. For each row `dets[i,j,:] = [x0, y0, x1, y1, score]`</dd>
+<dt><tt>outputs[1]</tt>: tensor(int32, Linear)</dt>
+<dd>labels; 2-D tensor of shape (N, valid_num_boxes). </dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear)
+
+### grid_sampler
+
+#### Description
+
+Perform sample from `input` with pixel locations from `grid`.
+
+#### Parameters
+
+| Type  | Parameter            | Description                                                                                                                                                                                                                                                                                     |
+| ----- | -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `int` | `interpolation_mode` | Interpolation mode to calculate output values. (0: `bilinear` , 1: `nearest`)                                                                                                                                                                                                                   |
+| `int` | `padding_mode`       | Padding mode for outside grid values. (0: `zeros`, 1: `border`, 2: `reflection`)                                                                                                                                                                                                                |
+| `int` | `align_corners`      | If `align_corners=1`, the extrema (`-1` and `1`) are considered as referring to the center points of the input's corner pixels. If `align_corners=0`, they are instead considered as referring to the corner points of the input's corner pixels, making the sampling more resolution agnostic. |
+
+#### Inputs
+
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Input feature; 4-D tensor of shape (N, C, inH, inW), where N is the batch size, C is the numbers of channels, inH and inW are the height and width of the data.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>Input offset; 4-D tensor of shape (N, outH, outW, 2), where outH and outW are the height and width of offset and output. </dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Output feature; 4-D tensor of shape (N, C, outH, outW).</dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear)
+
+### MMCVInstanceNormalization
+
+#### Description
+
+Carry out instance normalization as described in the paper https://arxiv.org/abs/1607.08022.
+
+y = scale * (x - mean) / sqrt(variance + epsilon) + B, where mean and variance are computed per instance per channel.
+
+#### Parameters
+
+| Type    | Parameter | Description                                                          |
+| ------- | --------- | -------------------------------------------------------------------- |
+| `float` | `epsilon` | The epsilon value to use to avoid division by zero. Default is 1e-05 |
+
+#### Inputs
+
+<dl>
+<dt><tt>input</tt>: T</dt>
+<dd>Input data tensor from the previous operator; dimensions for image case are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size.</dd>
+<dt><tt>scale</tt>: T</dt>
+<dd>The input 1-dimensional scale tensor of size C.</dd>
+<dt><tt>B</tt>: T</dt>
+<dd>The input 1-dimensional bias tensor of size C.</dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>output</tt>: T</dt>
+<dd>The output tensor of the same shape as input.</dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear)
+
+### MMCVModulatedDeformConv2d
+
+#### Description
+
+Perform Modulated Deformable Convolution on input feature. Read [Deformable ConvNets v2: More Deformable, Better Results](https://arxiv.org/abs/1811.11168?from=timeline) for detail.
+
+#### Parameters
+
+| Type           | Parameter          | Description                                                                           |
+| -------------- | ------------------ | ------------------------------------------------------------------------------------- |
+| `list of ints` | `stride`           | The stride of the convolving kernel. (sH, sW)                                         |
+| `list of ints` | `padding`          | Paddings on both sides of the input. (padH, padW)                                     |
+| `list of ints` | `dilation`         | The spacing between kernel elements. (dH, dW)                                         |
+| `int`          | `deformable_group` | Groups of deformable offset.                                                          |
+| `int`          | `group`            | Split input into groups. `input_channel` should be divisible by the number of groups. |
+
+#### Inputs
+
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Input feature; 4-D tensor of shape (N, C, inH, inW), where N is the batch size, C is the number of channels, inH and inW are the height and width of the data.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>Input offset; 4-D tensor of shape (N, deformable_group* 2* kH* kW, outH, outW), where kH and kW are the height and width of weight, outH and outW are the height and width of offset and output.</dd>
+<dt><tt>inputs[2]</tt>: T</dt>
+<dd>Input mask; 4-D tensor of shape (N, deformable_group* kH* kW, outH, outW), where kH and kW are the height and width of weight, outH and outW are the height and width of offset and output.</dd>
+<dt><tt>inputs[3]</tt>: T</dt>
+<dd>Input weight; 4-D tensor of shape (output_channel, input_channel, kH, kW).</dd>
+<dt><tt>inputs[4]</tt>: T, optional</dt>
+<dd>Input weight; 1-D tensor of shape (output_channel).</dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Output feature; 4-D tensor of shape (N, output_channel, outH, outW).</dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear)
+
+### MMCVMultiLevelRoiAlign
+
+#### Description
+
+Perform RoIAlign on features from multiple levels. Used in bbox_head of most two-stage detectors.
+
+#### Parameters
+
+| Type             | Parameter          | Description                                                                                                   |
+| ---------------- | ------------------ | ------------------------------------------------------------------------------------------------------------- |
+| `int`            | `output_height`    | height of output roi.                                                                                         |
+| `int`            | `output_width`     | width of output roi.                                                                                          |
+| `list of floats` | `featmap_strides`  | feature map stride of each level.                                                                             |
+| `int`            | `sampling_ratio`   | number of input samples to take for each output sample. `0` means to take samples densely for current models. |
+| `float`          | `roi_scale_factor` | RoIs will be scaled by this factor before RoI Align.                                                          |
+| `int`            | `finest_scale`     | Scale threshold of mapping to level 0. Default: 56.                                                           |
+| `int`            | `aligned`          | If `aligned=0`, use the legacy implementation in MMDetection. Else, align the results more perfectly.         |
+
+#### Inputs
+
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>RoIs (Regions of Interest) to pool over; 2-D tensor of shape (num_rois, 5) given as [[batch_index, x1, y1, x2, y2], ...].</dd>
+<dt><tt>inputs[1~]</tt>: T</dt>
+<dd>Input feature map; 4D tensor of shape (N, C, H, W), where N is the batch size, C is the numbers of channels, H and W are the height and width of the data.</dd>
+
+#### Outputs
+
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, output_width). The r-th batch element output[0][r-1] is a pooled feature map corresponding to the r-th RoI inputs[1][r-1].<dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear)
+
+### MMCVRoIAlign
+
+#### Description
+
+Perform RoIAlign on output feature, used in bbox_head of most two-stage detectors.
+
+#### Parameters
+
+| Type    | Parameter        | Description                                                                                                   |
+| ------- | ---------------- | ------------------------------------------------------------------------------------------------------------- |
+| `int`   | `output_height`  | height of output roi                                                                                          |
+| `int`   | `output_width`   | width of output roi                                                                                           |
+| `float` | `spatial_scale`  | used to scale the input boxes                                                                                 |
+| `int`   | `sampling_ratio` | number of input samples to take for each output sample. `0` means to take samples densely for current models. |
+| `str`   | `mode`           | pooling mode in each bin. `avg` or `max`                                                                      |
+| `int`   | `aligned`        | If `aligned=0`, use the legacy implementation in MMDetection. Else, align the results more perfectly.         |
+
+#### Inputs
+
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Input feature map; 4D tensor of shape (N, C, H, W), where N is the batch size, C is the numbers of channels, H and W are the height and width of the data.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>RoIs (Regions of Interest) to pool over; 2-D tensor of shape (num_rois, 5) given as [[batch_index, x1, y1, x2, y2], ...]. The RoIs' coordinates are the coordinate system of inputs[0].</dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, output_width). The r-th batch element output[0][r-1] is a pooled feature map corresponding to the r-th RoI inputs[1][r-1].<dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear)
+
+### ScatterND
+
+#### Description
+
+ScatterND takes three inputs `data` tensor of rank r >= 1, `indices` tensor of rank q >= 1, and `updates` tensor of rank q + r - indices.shape[-1] - 1. The output of the operation is produced by creating a copy of the input `data`, and then updating its value to values specified by updates at specific index positions specified by `indices`. Its output shape is the same as the shape of `data`. Note that `indices` should not have duplicate entries. That is, two or more updates for the same index-location is not supported.
+
+The `output` is calculated via the following equation:
+
+```python
+  output = np.copy(data)
+  update_indices = indices.shape[:-1]
+  for idx in np.ndindex(update_indices):
+      output[indices[idx]] = updates[idx]
+```
+
+#### Parameters
+
+None
+
+#### Inputs
+
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Tensor of rank r>=1.</dd>
+
+<dt><tt>inputs[1]</tt>: tensor(int32, Linear)</dt>
+<dd>Tensor of rank q>=1.</dd>
+
+<dt><tt>inputs[2]</tt>: T</dt>
+<dd>Tensor of rank q + r - indices_shape[-1] - 1.</dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Tensor of rank r >= 1.</dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear), tensor(int32, Linear)
+
+### TRTBatchedRotatedNMS
+
+#### Description
+
+Batched rotated NMS with a fixed number of output bounding boxes.
+
+#### Parameters
+
+| Type    | Parameter             | Description                                                                                                                             |
+| ------- | --------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
+| `int`   | `background_label_id` | The label ID for the background class. If there is no background class, set it to `-1`.                                                 |
+| `int`   | `num_classes`         | The number of classes.                                                                                                                  |
+| `int`   | `topK`                | The number of bounding boxes to be fed into the NMS step.                                                                               |
+| `int`   | `keepTopK`            | The number of total bounding boxes to be kept per-image after the NMS step. Should be less than or equal to the `topK` value.           |
+| `float` | `scoreThreshold`      | The scalar threshold for score (low scoring boxes are removed).                                                                         |
+| `float` | `iouThreshold`        | The scalar threshold for IoU (new boxes that have high IoU overlap with previously selected boxes are removed).                         |
+| `int`   | `isNormalized`        | Set to `false` if the box coordinates are not normalized, meaning they are not in the range `[0,1]`. Defaults to `true`.                |
+| `int`   | `clipBoxes`           | Forcibly restrict bounding boxes to the normalized range `[0,1]`. Only applicable if `isNormalized` is also `true`. Defaults to `true`. |
+
+#### Inputs
+
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>boxes; 4-D tensor of shape (N, num_boxes, num_classes, 5), where N is the batch size; `num_boxes` is the number of boxes; `num_classes` is the number of classes, which could be 1 if the boxes are shared between all classes.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>scores; 4-D tensor of shape (N, num_boxes, 1, num_classes). </dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>dets; 3-D tensor of shape (N, valid_num_boxes, 6), `valid_num_boxes` is the number of boxes after NMS. For each row `dets[i,j,:] = [x0, y0, width, height, theta, score]`</dd>
+<dt><tt>outputs[1]</tt>: tensor(int32, Linear)</dt>
+<dd>labels; 2-D tensor of shape (N, valid_num_boxes). </dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear)
+
+### GridPriorsTRT
+
+#### Description
+
+Generate the anchors for object detection task.
+
+#### Parameters
+
+| Type  | Parameter  | Description                       |
+| ----- | ---------- | --------------------------------- |
+| `int` | `stride_w` | The stride of the feature width.  |
+| `int` | `stride_h` | The stride of the feature height. |
+
+#### Inputs
+
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>The base anchors; 2-D tensor with shape [num_base_anchor, 4].</dd>
+<dt><tt>inputs[1]</tt>: TAny</dt>
+<dd>height provider; 1-D tensor with shape [featmap_height]. The data will never been used.</dd>
+<dt><tt>inputs[2]</tt>: TAny</dt>
+<dd>width provider; 1-D tensor with shape [featmap_width]. The data will never been used.</dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>output anchors; 2-D tensor of shape (num_base_anchor*featmap_height*featmap_widht, 4).</dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear)
+- TAny: Any
+
+### ScaledDotProductAttentionTRT
+
+#### Description
+
+Dot product attention used to support multihead attention, read [Attention Is All You Need](https://arxiv.org/abs/1706.03762?context=cs) for more detail.
+
+#### Parameters
+
+None
+
+#### Inputs
+
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>query; 3-D tensor with shape [batch_size, sequence_length, embedding_size].</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>key; 3-D tensor with shape [batch_size, sequence_length, embedding_size].</dd>
+<dt><tt>inputs[2]</tt>: T</dt>
+<dd>value; 3-D tensor with shape [batch_size, sequence_length, embedding_size].</dd>
+<dt><tt>inputs[3]</tt>: T</dt>
+<dd>mask; 2-D/3-D tensor with shape [sequence_length, sequence_length] or [batch_size, sequence_length, sequence_length]. optional.</dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>3-D tensor of shape [batch_size, sequence_length, embedding_size]. `softmax(q@k.T)@v`</dd>
+<dt><tt>outputs[1]</tt>: T</dt>
+<dd>3-D tensor of shape [batch_size, sequence_length, sequence_length]. `softmax(q@k.T)`</dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear)
+
+### GatherTopk
+
+#### Description
+
+TensorRT 8.2~8.4 would give unexpected result for multi-index gather.
+
+```python
+data[batch_index, bbox_index, ...]
+```
+
+Read [this](https://github.com/NVIDIA/TensorRT/issues/2299) for more details.
+
+#### Parameters
+
+None
+
+#### Inputs
+
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Tensor to be gathered, with shape (A0, ..., An, G0, C0, ...).</dd>
+
+<dt><tt>inputs[1]</tt>: tensor(int32, Linear)</dt>
+<dd>Tensor of index. with shape (A0, ..., An, G1)</dd>
+
+#### Outputs
+
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Tensor of output. With shape (A0, ..., An, G1, C0, ...)</dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear), tensor(int32, Linear)
+
+### MMCVMultiScaleDeformableAttention
+
+#### Description
+
+Perform attention computation over a small set of key sampling points around a reference point rather than looking over all possible spatial locations. Read [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) for detail.
+
+#### Parameters
+
+None
+
+#### Inputs
+
+<dl>
+<dt><tt>inputs[0]</tt>: T</dt>
+<dd>Input feature; 4-D tensor of shape (N, S, M, D), where N is the batch size, S is the length of feature maps, M is the number of attention heads, and D is hidden_dim.</dd>
+<dt><tt>inputs[1]</tt>: T</dt>
+<dd>Input offset; 2-D tensor of shape (L, 2), L is the number of feature maps, `2` is shape of feature maps.</dd>
+<dt><tt>inputs[2]</tt>: T</dt>
+<dd>Input mask; 1-D tensor of shape (L, ), this tensor is used to find the sampling locations for different feature levels as the input feature tensors are flattened.</dd>
+<dt><tt>inputs[3]</tt>: T</dt>
+<dd>Input weight; 6-D tensor of shape (N, Lq, M, L, P, 2). Lq is the length of feature maps(encoder)/length of queries(decoder), P is the number of points</dd>
+<dt><tt>inputs[4]</tt>: T, optional</dt>
+<dd>Input weight; 5-D tensor of shape (N, Lq, M, L, P).</dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>outputs[0]</tt>: T</dt>
+<dd>Output feature; 3-D tensor of shape (N, Lq, M*D).</dd>
+</dl>
+
+#### Type Constraints
+
+- T:tensor(float32, Linear)
--- a/docs/en/07-developer-guide/add_backend_ops_unittest.md
+++ b/docs/en/07-developer-guide/add_backend_ops_unittest.md
+# How to add test units for backend ops
+
+This tutorial introduces how to add unit test for backend ops. When you add a custom op under `backend_ops`, you need to add the corresponding test unit. Test units of ops are included in `tests/test_ops/test_ops.py`.
+
+## Prerequisite
+
+- `Compile new ops`: After adding a new custom op, needs to recompile the relevant backend, referring to [build.md](../01-how-to-build/build_from_source.md).
+
+## 1. Add the test program test_XXXX()
+
+You can put unit test for ops in `tests/test_ops/`. Usually, the following program template can be used for your custom op.
+
+### example of ops unit test
+
+```python
+@pytest.mark.parametrize('backend', [TEST_TENSORRT, TEST_ONNXRT])        # 1.1 backend test class
+@pytest.mark.parametrize('pool_h,pool_w,spatial_scale,sampling_ratio',   # 1.2 set parameters of op
+                         [(2, 2, 1.0, 2), (4, 4, 2.0, 4)])               # [(# Examples of op test parameters),...]
+def test_roi_align(backend,
+                   pool_h,                                               # set parameters of op
+                   pool_w,
+                   spatial_scale,
+                   sampling_ratio,
+                   input_list=None,
+                   save_dir=None):
+    backend.check_env()
+
+    if input_list is None:
+        input = torch.rand(1, 1, 16, 16, dtype=torch.float32)            # 1.3 op input data initialization
+        single_roi = torch.tensor([[0, 0, 0, 4, 4]], dtype=torch.float32)
+    else:
+        input = torch.tensor(input_list[0], dtype=torch.float32)
+        single_roi = torch.tensor(input_list[1], dtype=torch.float32)
+
+    from mmcv.ops import roi_align
+
+    def wrapped_function(torch_input, torch_rois):                       # 1.4 initialize op model to be tested
+        return roi_align(torch_input, torch_rois, (pool_w, pool_h),
+                         spatial_scale, sampling_ratio, 'avg', True)
+
+    wrapped_model = WrapFunction(wrapped_function).eval()
+
+    with RewriterContext(cfg={}, backend=backend.backend_name, opset=11): # 1.5 call the backend test class interface
+        backend.run_and_validate(
+            wrapped_model, [input, single_roi],
+            'roi_align',
+            input_names=['input', 'rois'],
+            output_names=['roi_feat'],
+            save_dir=save_dir)
+```
+
+### 1.1 backend test class
+
+We provide some functions and classes for difference backends, such as `TestOnnxRTExporter`, `TestTensorRTExporter`, `TestNCNNExporter`.
+
+### 1.2 set parameters of op
+
+Set some parameters of op, such as ’pool_h‘, ’pool_w‘, ’spatial_scale‘, ’sampling_ratio‘ in roi_align. You can set multiple parameters to test op.
+
+### 1.3 op input data initialization
+
+Initialization required input data.
+
+### 1.4 initialize op model to be tested
+
+The model containing custom op usually has two forms.
+
+- `torch model`: Torch model with custom operators. Python code related to op is required, refer to `roi_align` unit test.
+- `onnx model`: Onnx model with custom operators. Need to call onnx api to build, refer to `multi_level_roi_align` unit test.
+
+### 1.5 call the backend test class interface
+
+Call the backend test class `run_and_validate` to run and verify the result output by the op on the backend.
+
+```python
+    def run_and_validate(self,
+                         model,
+                         input_list,
+                         model_name='tmp',
+                         tolerate_small_mismatch=False,
+                         do_constant_folding=True,
+                         dynamic_axes=None,
+                         output_names=None,
+                         input_names=None,
+                         expected_result=None,
+                         save_dir=None):
+```
+
+#### Parameter Description
+
+- `model`: Input model to be tested and it can be torch model or any other backend model.
+- `input_list`: List of test data, which is mapped to the order of input_names.
+- `model_name`: The name of the model.
+- `tolerate_small_mismatch`: Whether to allow small errors in the verification of results.
+- `do_constant_folding`: Whether to use constant light folding to optimize the model.
+- `dynamic_axes`: If you need to use dynamic dimensions, enter the dimension information.
+- `output_names`: The node name of the output node.
+- `input_names`: The node name of the input node.
+- `expected_result`: Expected ground truth values for verification.
+- `save_dir`: The folder used to save the output files.
+
+## 2. Test Methods
+
+Use pytest to call the test function to test ops.
+
+```bash
+pytest tests/test_ops/test_ops.py::test_XXXX
+```
--- a/docs/en/07-developer-guide/architecture.md
+++ b/docs/en/07-developer-guide/architecture.md
+# mmdeploy Architecture
+
+This article mainly introduces the functions of each directory of mmdeploy and how it works from model conversion to real inference.
+
+## Take a general look at the directory structure
+
+The entire mmdeploy can be seen as two independent parts: model conversion and SDK.
+
+We introduce the entire repo directory structure and functions, without having to study the source code, just have an impression.
+
+Peripheral directory features:
+
+```bash
+$ cd /path/to/mmdeploy
+$ tree -L 1
+.
+├── CMakeLists.txt    # Compile custom operator and cmake configuration of SDK
+├── configs                   # Algorithm library configuration for model conversion
+├── csrc                          # SDK and custom operator
+├── demo                      # FFI interface examples in various languages, such as csharp, java, python, etc.
+├── docker                   # docker build
+├── mmdeploy           # python package for model conversion
+├── requirements      # python requirements
+├── service                    # Some small boards not support python, we use C/S mode for model conversion, here is server code
+├── tests                         # unittest
+├── third_party           # 3rd party dependencies required by SDK and FFI
+└── tools                        # Tools are also the entrance to all functions, such as onnx2xx.py, profiler.py, test.py, etc.
+```
+
+It should be clear
+
+- Model conversion mainly depends on `tools`, `mmdeploy` and small part of `csrc` directory;
+- SDK is consist of three directories: `csrc`, `third_party` and `demo`.
+
+## Model Conversion
+
+Here we take ViT of mmpretrain as model example, and take ncnn as inference backend example. Other models and inferences are similar.
+
+Let's take a look at the mmdeploy/mmdeploy directory structure and get an impression:
+
+```bash
+.
+├── apis                             # The api used by tools is implemented here, such as onnx2ncnn.py
+│   ├── calibration.py          # trt dedicated collection of quantitative data
+│   ├── core                              # Software infrastructure
+│   ├── extract_model.py  # Use it to export part of onnx
+│   ├── inference.py             # Abstract function, which will actually call torch/ncnn specific inference
+│   ├── ncnn                            # ncnn Wrapper
+│   └── visualize.py              # Still an abstract function, which will actually call torch/ncnn specific inference and visualize
+..
+├── backend                  # Backend wrapper
+│   ├── base                            # Because there are multiple backends, there must be an OO design for the base class
+│   ├── ncnn                           # This calls the ncnn python interface for model conversion
+│   │   ├── init_plugins.py           # Find the path of ncnn custom operators and ncnn tools
+│   │   ├── onnx2ncnn.py            # Wrap `mmdeploy_onnx2ncnn` into a python interface
+│   │   ├── quant.py                       # Wrap `ncnn2int8` as a python interface
+│   │   └── wrapper.py                  # Wrap pyncnn forward API
+..
+├── codebase                #  Algorithm rewriter
+│   ├── base                          # There are multiple algorithms here that we need a bit of OO design
+│   ├── mmpretrain                      #  mmpretrain related model rewrite
+│   │   ├── deploy                       # mmpretrain implementation of base abstract task/model/codebase
+│   │   └── models                      # Real model rewrite
+│   │       ├── backbones                 # Rewrites of backbone network parts, such as multiheadattention
+│   │       ├── heads                           # Such as MultiLabelClsHead
+│   │       ├── necks                            # Such as GlobalAveragePooling
+│..
+├── core                         # Software infrastructure of rewrite mechanism
+├── mmcv                     #  Rewrite mmcv
+├── pytorch                 #  Rewrite pytorch operator for ncnn, such as Gemm
+..
+```
+
+Each line above needs to be read, don't skip it.
+
+When typing `tools/deploy.py` to convert ViT, these are 3 things:
+
+1. Rewrite of mmpretrain ViT forward
+2. ncnn does not support `gather` opr, customize and load it with libncnn.so
+3. Run exported ncnn model with real inference, render output, and make sure the result is correct
+
+### 1. Rewrite `forward`
+
+Because when exporting ViT to onnx, it generates some operators that ncnn doesn't support perfectly, mmdeploy's solution is to hijack the forward code and change it. The output onnx is suitable for ncnn.
+
+For example, rewrite the process of `conv -> shape -> concat_const -> reshape` to `conv -> reshape` to trim off the redundant `shape` and `concat` operator.
+
+All mmpretrain algorithm rewriters are in the `mmdeploy/codebase/mmpretrain/models` directory.
+
+### 2. Custom Operator
+
+Operators customized for ncnn are in the `csrc/mmdeploy/backend_ops/ncnn/` directory, and are loaded together with `libncnn.so` after compilation. The essence is in hotfix ncnn, which currently implements these operators:
+
+- topk
+- tensorslice
+- shape
+- gather
+- expand
+- constantofshape
+
+### 3. Model Conversion and testing
+
+We first use the modified `mmdeploy_onnx2ncnn`to convert model, then inference with`pyncnn` and custom ops.
+
+When encountering a framework such as snpe that does not support python well, we use C/S mode: wrap a server with protocols such as gRPC, and forward the real inference output.
+
+For Rendering, mmdeploy directly uses the rendering API of upstream algorithm codebase.
+
+## SDK
+
+After the model conversion completed, the SDK compiled with C++ can be used to execute on different platforms.
+
+Let's take a look at the csrc/mmdeploy directory structure:
+
+```bash
+.
+├── apis           # csharp, java, go, Rust and other FFI interfaces
+├── backend_ops    # Custom operators for each inference framework
+├── CMakeLists.txt
+├── codebase       # The type of results preferred by each algorithm framework, such as multi-use bbox for detection task
+├── core           # Abstraction of graph, operator, device and so on
+├── device         # Implementation of CPU/GPU device abstraction
+├── execution      # Implementation of the execution abstraction
+├── graph          # Implementation of graph abstraction
+├── model          # Implement both zip-compressed and uncompressed work directory
+├── net            # Implementation of net, such as wrap ncnn forward C API
+├── preprocess     # Implement preprocess
+└── utils          # OCV tools
+```
+
+The essence of the SDK is to design a set of abstraction of the computational graph, and combine the **multiple models'**
+
+- preprocess
+- inference
+- postprocess
+
+Provide FFI in multiple languages at the same time.
--- a/docs/en/07-developer-guide/partition_model.md
+++ b/docs/en/07-developer-guide/partition_model.md
+# How to get partitioned ONNX models
+
+MMDeploy supports exporting PyTorch models to partitioned onnx models. With this feature, users can define their partition policy and get partitioned onnx models at ease. In this tutorial, we will briefly introduce how to support partition a model step by step. In the example, we would break YOLOV3 model into two parts and extract the first part without the post-processing (such as anchor generating and NMS) in the onnx model.
+
+## Step 1: Mark inputs/outpupts
+
+To support the model partition, we need to add `Mark` nodes in the ONNX model. This could be done with mmdeploy's `@mark` decorator. Note that to make the `mark` work, the marking operation should be included in a rewriting function.
+
+At first, we would mark the model input, which could be done by marking the input tensor `img` in the `forward` method of `BaseDetector` class, which is the parent class of all detector classes. Thus we name this marking point as `detector_forward` and mark the inputs as `input`. Since there could be three outputs for detectors such as `Mask RCNN`, the outputs are marked as  `dets`, `labels`, and `masks`. The following code shows the idea of adding mark functions and calling the mark functions in the rewrite. For source code, you could refer to [mmdeploy/codebase/mmdet/models/detectors/single_stage.py](https://github.com/open-mmlab/mmdeploy/blob/4fc8828af84281b62be143012cd9f9dafd1e7cc2/mmdeploy/codebase/mmdet/models/detectors/single_stage.py)
+
+```python
+from mmdeploy.core import FUNCTION_REWRITER, mark
+
+@mark(
+    'detector_forward', inputs=['input'], outputs=['dets', 'labels', 'masks'])
+def __forward_impl(self, img, img_metas=None, **kwargs):
+    ...
+
+
+@FUNCTION_REWRITER.register_rewriter(
+    'mmdet.models.detectors.base.BaseDetector.forward')
+def base_detector__forward(self, img, img_metas=None, **kwargs):
+    ...
+    # call the mark function
+    return __forward_impl(...)
+```
+
+Then, we have to mark the output feature of `YOLOV3Head`, which is the input argument `pred_maps` in `get_bboxes` method of `YOLOV3Head` class. We could add a internal function to only mark the `pred_maps` inside [`yolov3_head__get_bboxes`](https://github.com/open-mmlab/mmdeploy/blob/4fc8828af84281b62be143012cd9f9dafd1e7cc2/mmdeploy/codebase/mmdet/models/dense_heads/yolo_head.py#L16) function as following.
+
+```python
+from mmdeploy.core import FUNCTION_REWRITER, mark
+
+@FUNCTION_REWRITER.register_rewriter(
+    func_name='mmdet.models.dense_heads.YOLOV3Head.get_bboxes')
+def yolov3_head__get_bboxes(self,
+                            pred_maps,
+                            img_metas,
+                            cfg=None,
+                            rescale=False,
+                            with_nms=True):
+    # mark pred_maps
+    @mark('yolo_head', inputs=['pred_maps'])
+    def __mark_pred_maps(pred_maps):
+        return pred_maps
+    pred_maps = __mark_pred_maps(pred_maps)
+    ...
+```
+
+Note that `pred_maps` is a list of `Tensor` and it has three elements. Thus, three `Mark` nodes with op name as `pred_maps.0`, `pred_maps.1`, `pred_maps.2` would be added in the onnx model.
+
+## Step 2: Add partition config
+
+After marking necessary nodes that would be used to split the model, we could add a deployment config file `configs/mmdet/detection/yolov3_partition_onnxruntime_static.py`. If you are not familiar with how to write config, you could check [write_config.md](../02-how-to-run/write_config.md).
+
+In the config file, we need to add `partition_config`. The key part is `partition_cfg`, which contains elements of dict that designates the start nodes and end nodes of each model segments. Since we only want to keep `YOLOV3` without post-processing, we could set the `start` as `['detector_forward:input']`, and `end` as `['yolo_head:input']`. Note that `start` and `end` can have multiple marks.
+
+```python
+_base_ = ['./detection_onnxruntime_static.py']
+
+onnx_config = dict(input_shape=[608, 608])
+partition_config = dict(
+    type='yolov3_partition', # the partition policy name
+    apply_marks=True, # should always be set to True
+    partition_cfg=[
+        dict(
+            save_file='yolov3.onnx', # filename to save the partitioned onnx model
+            start=['detector_forward:input'], # [mark_name:input/output, ...]
+            end=['yolo_head:input'],  # [mark_name:input/output, ...]
+            output_names=[f'pred_maps.{i}' for i in range(3)]) # output names
+    ])
+
+```
+
+## Step 3: Get partitioned onnx models
+
+Once we have marks of nodes and the deployment config with `parition_config` being set properly, we could use the [tool](../02-how-to-run/useful_tools.md) `torch2onnx` to export the model to onnx and get the partition onnx files.
+
+```shell
+python tools/torch2onnx.py \
+configs/mmdet/detection/yolov3_partition_onnxruntime_static.py \
+../mmdetection/configs/yolo/yolov3_d53_8xb8-ms-608-273e_coco.py \
+https://download.openmmlab.com/mmdetection/v2.0/yolo/yolov3_d53_mstrain-608_273e_coco/yolov3_d53_mstrain-608_273e_coco_20210518_115020-a2c3acb8.pth \
+../mmdetection/demo/demo.jpg \
+--work-dir ./work-dirs/mmdet/yolov3/ort/partition
+```
+
+After run the script above, we would have the partitioned onnx file `yolov3.onnx` in the `work-dir`. You can use the visualization tool [netron](https://netron.app/) to check the model structure.
+
+With the partitioned onnx file, you could refer to [useful_tools.md](../02-how-to-run/useful_tools.md) to do the following procedures such as `mmdeploy_onnx2ncnn`, `onnx2tensorrt`.
--- a/docs/en/07-developer-guide/regression_test.md
+++ b/docs/en/07-developer-guide/regression_test.md
+# How to do regression test
+
+This tutorial describes how to do regression test. The deployment configuration file contains  codebase config and inference config.
+
+### 1. Python Environment
+
+```shell
+pip install -r requirements/tests.txt
+```
+
+If pip throw an exception, try to upgrade numpy.
+
+```shell
+pip install -U numpy
+```
+
+## 2. Usage
+
+```shell
+python ./tools/regression_test.py \
+    --codebase "${CODEBASE_NAME}" \
+    --backends "${BACKEND}" \
+    [--models "${MODELS}"] \
+    --work-dir "${WORK_DIR}" \
+    --device "${DEVICE}" \
+    --log-level INFO \
+    [--performance 或 -p] \
+    [--checkpoint-dir "$CHECKPOINT_DIR"]
+```
+
+### Description
+
+- `--codebase` : The codebase to test, eg.`mmdet`. If you want to test multiple codebase, use `mmpretrain mmdet ...`
+- `--backends` : The backend to test. By default, all `backend`s would be tested. You can use `onnxruntime tesensorrt`to choose several backends. If you also need to test the SDK, you need to configure the `sdk_config` in `tests/regression/${codebase}.yml`.
+- `--models` : Specify the model to be tested. All models in `yml` are tested by default. You can also give some model names. For the model name, please refer to the relevant yml configuration file. For example `ResNet SE-ResNet "Mask R-CNN"`. Model name can only contain numbers and letters.
+- `--work-dir` : The directory of model convert and report, use `../mmdeploy_regression_working_dir` by default.
+- `--checkpoint-dir`: The path of downloaded torch model, use `../mmdeploy_checkpoints` by default.
+- `--device` : device type, use `cuda` by default
+- `--log-level` : These options are available:`'CRITICAL', 'FATAL', 'ERROR', 'WARN', 'WARNING', 'INFO', 'DEBUG',  'NOTSET'`. The default value is `INFO`.
+- `-p` or `--performance` : Test precision or not. If not enabled, only model convert would be tested.
+
+### Notes
+
+For Windows user:
+
+1. To use the `&&` connector in shell commands, you need to download `PowerShell 7 Preview 5+`.
+2. If you are using conda env, you may need to change `python3` to `python` in regression_test.py because there is `python3.exe` in `%USERPROFILE%\AppData\Local\Microsoft\WindowsApps` directory.
+
+## Example
+
+1. Test all backends of mmdet and mmpose for **model convert and precision**
+
+```shell
+python ./tools/regression_test.py \
+    --codebase mmdet mmpose \
+    --work-dir "../mmdeploy_regression_working_dir" \
+    --device "cuda" \
+    --log-level INFO \
+    --performance
+```
+
+2. Test **model convert and precision** of some backends of mmdet and mmpose
+
+```shell
+python ./tools/regression_test.py \
+    --codebase mmdet mmpose \
+    --backends onnxruntime tensorrt \
+    --work-dir "../mmdeploy_regression_working_dir" \
+    --device "cuda" \
+    --log-level INFO \
+    -p
+```
+
+3. Test some backends of mmdet and mmpose, **only test model convert**
+
+```shell
+python ./tools/regression_test.py \
+    --codebase mmdet mmpose \
+    --backends onnxruntime tensorrt \
+    --work-dir "../mmdeploy_regression_working_dir" \
+    --device "cuda" \
+    --log-level INFO
+```
+
+4. Test some models of mmdet and mmpretrain, **only test model convert**
+
+```shell
+python ./tools/regression_test.py \
+    --codebase mmdet mmpose \
+    --models ResNet SE-ResNet "Mask R-CNN" \
+    --work-dir "../mmdeploy_regression_working_dir" \
+    --device "cuda" \
+    --log-level INFO
+```
+
+## 3. Regression Test Configuration
+
+### Example and parameter description
+
+```yaml
+globals:
+  codebase_dir: ../mmocr # codebase path to test
+  checkpoint_force_download: False # whether to redownload the model even if it already exists
+  images:
+    img_densetext_det: &img_densetext_det ../mmocr/demo/demo_densetext_det.jpg
+    img_demo_text_det: &img_demo_text_det ../mmocr/demo/demo_text_det.jpg
+    img_demo_text_ocr: &img_demo_text_ocr ../mmocr/demo/demo_text_ocr.jpg
+    img_demo_text_recog: &img_demo_text_recog ../mmocr/demo/demo_text_recog.jpg
+  metric_info: &metric_info
+    hmean-iou: # metafile.Results.Metrics
+      eval_name: hmean-iou #  test.py --metrics args
+      metric_key: 0_hmean-iou:hmean # the key name of eval log
+      tolerance: 0.1 # tolerated threshold interval
+      task_name: Text Detection # the name of metafile.Results.Task
+      dataset: ICDAR2015 # the name of metafile.Results.Dataset
+    word_acc: # same as hmean-iou, also a kind of metric
+      eval_name: acc
+      metric_key: 0_word_acc_ignore_case
+      tolerance: 0.2
+      task_name: Text Recognition
+      dataset: IIIT5K
+  convert_image_det: &convert_image_det # the image that will be used by detection model convert
+    input_img: *img_densetext_det
+    test_img: *img_demo_text_det
+  convert_image_rec: &convert_image_rec
+    input_img: *img_demo_text_recog
+    test_img: *img_demo_text_recog
+  backend_test: &default_backend_test True # whether test model precision for backend
+  sdk: # SDK config
+    sdk_detection_dynamic: &sdk_detection_dynamic configs/mmocr/text-detection/text-detection_sdk_dynamic.py
+    sdk_recognition_dynamic: &sdk_recognition_dynamic configs/mmocr/text-recognition/text-recognition_sdk_dynamic.py
+
+onnxruntime:
+  pipeline_ort_recognition_static_fp32: &pipeline_ort_recognition_static_fp32
+    convert_image: *convert_image_rec # the image used by model conversion
+    backend_test: *default_backend_test # whether inference on the backend
+    sdk_config: *sdk_recognition_dynamic # test SDK or not. If it exists, use a specific SDK config for testing
+    deploy_config: configs/mmocr/text-recognition/text-recognition_onnxruntime_static.py # the deploy cfg path to use, based on mmdeploy path
+
+  pipeline_ort_recognition_dynamic_fp32: &pipeline_ort_recognition_dynamic_fp32
+    convert_image: *convert_image_rec
+    backend_test: *default_backend_test
+    sdk_config: *sdk_recognition_dynamic
+    deploy_config: configs/mmocr/text-recognition/text-recognition_onnxruntime_dynamic.py
+
+  pipeline_ort_detection_dynamic_fp32: &pipeline_ort_detection_dynamic_fp32
+    convert_image: *convert_image_det
+    deploy_config: configs/mmocr/text-detection/text-detection_onnxruntime_dynamic.py
+
+tensorrt:
+  pipeline_trt_recognition_dynamic_fp16: &pipeline_trt_recognition_dynamic_fp16
+    convert_image: *convert_image_rec
+    backend_test: *default_backend_test
+    sdk_config: *sdk_recognition_dynamic
+    deploy_config: configs/mmocr/text-recognition/text-recognition_tensorrt-fp16_dynamic-1x32x32-1x32x640.py
+
+  pipeline_trt_detection_dynamic_fp16: &pipeline_trt_detection_dynamic_fp16
+    convert_image: *convert_image_det
+    backend_test: *default_backend_test
+    sdk_config: *sdk_detection_dynamic
+    deploy_config: configs/mmocr/text-detection/text-detection_tensorrt-fp16_dynamic-320x320-2240x2240.py
+
+openvino:
+  # same as onnxruntime backend configuration
+ncnn:
+  # same as onnxruntime backend configuration
+pplnn:
+  # same as onnxruntime backend configuration
+torchscript:
+  # same as onnxruntime backend configuration
+
+
+models:
+  - name: crnn # model name
+    metafile: configs/textrecog/crnn/metafile.yml # the path of model metafile, based on codebase path
+    codebase_model_config_dir: configs/textrecog/crnn # the basepath of `model_configs`, based on codebase path
+    model_configs: # the config name to teset
+      - crnn_academic_dataset.py
+    pipelines: # pipeline name
+      - *pipeline_ort_recognition_dynamic_fp32
+
+  - name: dbnet
+    metafile: configs/textdet/dbnet/metafile.yml
+    codebase_model_config_dir: configs/textdet/dbnet
+    model_configs:
+      - dbnet_r18_fpnc_1200e_icdar2015.py
+    pipelines:
+      - *pipeline_ort_detection_dynamic_fp32
+      - *pipeline_trt_detection_dynamic_fp16
+
+      # special pipeline can be added like this
+      - convert_image: xxx
+        backend_test: xxx
+        sdk_config: xxx
+        deploy_config: configs/mmocr/text-detection/xxx
+```
+
+## 4. Generated Report
+
+This is an example of mmocr regression test report.
+
+|     | Model | Model Config                                                     | Task             | Checkpoint                                                                                                   | Dataset   | Backend         | Deploy Config                                                                          | Static or Dynamic | Precision Type | Conversion Result | hmean-iou | word_acc | Test Pass |
+| --- | ----- | ---------------------------------------------------------------- | ---------------- | ------------------------------------------------------------------------------------------------------------ | --------- | --------------- | -------------------------------------------------------------------------------------- | ----------------- | -------------- | ----------------- | --------- | -------- | --------- |
+| 0   | crnn  | ../mmocr/configs/textrecog/crnn/crnn_academic_dataset.py         | Text Recognition | ../mmdeploy_checkpoints/mmocr/crnn/crnn_academic-a723a1c5.pth                                                | IIIT5K    | Pytorch         | -                                                                                      | -                 | -              | -                 | -         | 80.5     | -         |
+| 1   | crnn  | ../mmocr/configs/textrecog/crnn/crnn_academic_dataset.py         | Text Recognition | ${WORK_DIR}/mmocr/crnn/onnxruntime/static/crnn_academic-a723a1c5/end2end.onnx                                | x         | onnxruntime     | configs/mmocr/text-recognition/text-recognition_onnxruntime_dynamic.py                 | static            | fp32           | True              | -         | 80.67    | True      |
+| 2   | crnn  | ../mmocr/configs/textrecog/crnn/crnn_academic_dataset.py         | Text Recognition | ${WORK_DIR}/mmocr/crnn/onnxruntime/static/crnn_academic-a723a1c5                                             | x         | SDK-onnxruntime | configs/mmocr/text-recognition/text-recognition_sdk_dynamic.py                         | static            | fp32           | True              | -         | x        | False     |
+| 3   | dbnet | ../mmocr/configs/textdet/dbnet/dbnet_r18_fpnc_1200e_icdar2015.py | Text Detection   | ../mmdeploy_checkpoints/mmocr/dbnet/dbnet_r18_fpnc_sbn_1200e_icdar2015_20210329-ba3ab597.pth                 | ICDAR2015 | Pytorch         | -                                                                                      | -                 | -              | -                 | 0.795     | -        | -         |
+| 4   | dbnet | ../mmocr/configs/textdet/dbnet/dbnet_r18_fpnc_1200e_icdar2015.py | Text Detection   | ../mmdeploy_checkpoints/mmocr/dbnet/dbnet_r18_fpnc_sbn_1200e_icdar2015_20210329-ba3ab597.pth                 | ICDAR     | onnxruntime     | configs/mmocr/text-detection/text-detection_onnxruntime_dynamic.py                     | dynamic           | fp32           | True              | -         | -        | True      |
+| 5   | dbnet | ../mmocr/configs/textdet/dbnet/dbnet_r18_fpnc_1200e_icdar2015.py | Text Detection   | ${WORK_DIR}/mmocr/dbnet/tensorrt/dynamic/dbnet_r18_fpnc_sbn_1200e_icdar2015_20210329-ba3ab597/end2end.engine | ICDAR     | tensorrt        | configs/mmocr/text-detection/text-detection_tensorrt-fp16_dynamic-320x320-2240x2240.py | dynamic           | fp16           | True              | 0.793302  | -        | True      |
+| 6   | dbnet | ../mmocr/configs/textdet/dbnet/dbnet_r18_fpnc_1200e_icdar2015.py | Text Detection   | ${WORK_DIR}/mmocr/dbnet/tensorrt/dynamic/dbnet_r18_fpnc_sbn_1200e_icdar2015_20210329-ba3ab597                | ICDAR     | SDK-tensorrt    | configs/mmocr/text-detection/text-detection_sdk_dynamic.py                             | dynamic           | fp16           | True              | 0.795073  | -        | True      |
+
+## 5. Supported Backends
+
+- [x] ONNX Runtime
+- [x] TensorRT
+- [x] PPLNN
+- [x] ncnn
+- [x] OpenVINO
+- [x] TorchScript
+- [x] SNPE
+- [x] MMDeploy SDK
+
+## 6. Supported Codebase and Metrics
+
+| Codebase   | Metric   | Support            |
+| ---------- | -------- | ------------------ |
+| mmdet      | bbox     | :heavy_check_mark: |
+|            | segm     | :heavy_check_mark: |
+|            | PQ       | :x:                |
+| mmpretrain | accuracy | :heavy_check_mark: |
+| mmseg      | mIoU     | :heavy_check_mark: |
+| mmpose     | AR       | :heavy_check_mark: |
+|            | AP       | :heavy_check_mark: |
+| mmocr      | hmean    | :heavy_check_mark: |
+|            | acc      | :heavy_check_mark: |
+| mmagic     | PSNR     | :heavy_check_mark: |
+|            | SSIM     | :heavy_check_mark: |
--- a/docs/en/07-developer-guide/support_new_backend.md
+++ b/docs/en/07-developer-guide/support_new_backend.md
+# How to support new backends
+
+MMDeploy supports a number of backend engines. We welcome the contribution of new backends. In this tutorial, we will introduce the general procedures to support a new backend in MMDeploy.
+
+## Prerequisites
+
+Before contributing the codes, there are some requirements for the new backend that need to be checked:
+
+- The backend must support ONNX as IR.
+- If the backend requires model files or weight files other than a ".onnx" file, a conversion tool that converts the ".onnx" file to model files and weight files is required. The tool can be a Python API, a script, or an executable program.
+- It is highly recommended that the backend provides a Python interface to load the backend files and inference for validation.
+
+## Support backend conversion
+
+The backends in MMDeploy must support the ONNX. The backend loads the ".onnx" file directly, or converts the ".onnx" to its own format using the conversion tool. In this section, we will introduce the steps to support backend conversion.
+
+1. Add backend constant in `mmdeploy/utils/constants.py` that denotes the name of the backend.
+
+   **Example**:
+
+   ```Python
+   # mmdeploy/utils/constants.py
+
+   class Backend(AdvancedEnum):
+       # Take TensorRT as an example
+       TENSORRT = 'tensorrt'
+   ```
+
+2. Add a corresponding package (a folder with `__init__.py`) in `mmdeploy/backend/`. For example, `mmdeploy/backend/tensorrt`. In the `__init__.py`, there must be a function named `is_available` which checks if users have installed the backend library. If the check is passed, then the remaining files of the package will be loaded.
+
+   **Example**:
+
+   ```Python
+   # mmdeploy/backend/tensorrt/__init__.py
+
+   def is_available():
+       return importlib.util.find_spec('tensorrt') is not None
+
+
+   if is_available():
+       from .utils import from_onnx, load, save
+       from .wrapper import TRTWrapper
+
+       __all__ = [
+           'from_onnx', 'save', 'load', 'TRTWrapper'
+       ]
+   ```
+
+3. Create a config file in `configs/_base_/backends` (e.g., `configs/_base_/backends/tensorrt.py`).  If the backend just takes the '.onnx' file as input, the new config can be simple. The config of the backend only consists of one field denoting the name of the backend (which should be same as the name in `mmdeploy/utils/constants.py`).
+
+   **Example**:
+
+   ```python
+   backend_config = dict(type='onnxruntime')
+   ```
+
+   If the backend requires other files, then the arguments for the conversion from ".onnx" file to backend files should be included in the config file.
+
+   **Example:**
+
+   ```Python
+
+   backend_config = dict(
+       type='tensorrt',
+       common_config=dict(
+           fp16_mode=False, max_workspace_size=0))
+   ```
+
+   After possessing a base backend config file, you can easily construct a complete deploy config through inheritance. Please refer to our [config tutorial](../02-how-to-run/write_config.md) for more details. Here is an example:
+
+   ```Python
+   _base_ = ['../_base_/backends/onnxruntime.py']
+
+   codebase_config = dict(type='mmpretrain', task='Classification')
+   onnx_config = dict(input_shape=None)
+   ```
+
+4. If the backend requires model files or weight files other than a ".onnx" file, create a `onnx2backend.py` file in the corresponding folder (e.g., create `mmdeploy/backend/tensorrt/onnx2tensorrt.py`). Then add a conversion function `onnx2backend` in the file. The function should convert a given ".onnx" file to the required backend files in a given work directory. There are no requirements on other parameters of the function and the implementation details. You can use any tools for conversion. Here are some examples:
+
+   **Use Python script:**
+
+   ```Python
+   def onnx2openvino(input_info: Dict[str, Union[List[int], torch.Size]],
+                     output_names: List[str], onnx_path: str, work_dir: str):
+
+       input_names = ','.join(input_info.keys())
+       input_shapes = ','.join(str(list(elem)) for elem in input_info.values())
+       output = ','.join(output_names)
+
+       mo_args = f'--input_model="{onnx_path}" '\
+                 f'--output_dir="{work_dir}" ' \
+                 f'--output="{output}" ' \
+                 f'--input="{input_names}" ' \
+                 f'--input_shape="{input_shapes}" ' \
+                 f'--disable_fusing '
+       command = f'mo.py {mo_args}'
+       mo_output = run(command, stdout=PIPE, stderr=PIPE, shell=True, check=True)
+   ```
+
+   **Use executable program:**
+
+   ```Python
+   def onnx2ncnn(onnx_path: str, work_dir: str):
+       onnx2ncnn_path = get_onnx2ncnn_path()
+       save_param, save_bin = get_output_model_file(onnx_path, work_dir)
+       call([onnx2ncnn_path, onnx_path, save_param, save_bin])\
+   ```
+
+5. Define APIs in a new package in  `mmdeploy/apis`.
+
+   **Example:**
+
+   ```Python
+   # mmdeploy/apis/ncnn/__init__.py
+
+   from mmdeploy.backend.ncnn import is_available
+
+   __all__ = ['is_available']
+
+   if is_available():
+       from mmdeploy.backend.ncnn.onnx2ncnn import (onnx2ncnn,
+                                                    get_output_model_file)
+       __all__ += ['onnx2ncnn', 'get_output_model_file']
+   ```
+
+   Create a backend manager class which derive from `BaseBackendManager`, implement its `to_backend` static method.
+
+   **Example:**
+
+   ```Python
+    @classmethod
+    def to_backend(cls,
+                ir_files: Sequence[str],
+                deploy_cfg: Any,
+                work_dir: str,
+                log_level: int = logging.INFO,
+                device: str = 'cpu',
+                **kwargs) -> Sequence[str]:
+        return ir_files
+   ```
+
+6. Convert the models of OpenMMLab to backends (if necessary) and inference on backend engine. If you find some incompatible operators when testing, you can try to rewrite the original model for the backend following the [rewriter tutorial](support_new_model.md) or add custom operators.
+
+7. Add docstring and unit tests for new code :).
+
+## Support backend inference
+
+Although the backend engines are usually implemented in C/C++, it is convenient for testing and debugging if the backend provides Python inference interface. We encourage the contributors to support backend inference in the Python interface of MMDeploy. In this section we will introduce the steps to support backend inference.
+
+1. Add a file named `wrapper.py` to corresponding folder in `mmdeploy/backend/{backend}`. For example, `mmdeploy/backend/tensorrt/wrapper.py`. This module should implement and register a wrapper class that inherits the base class `BaseWrapper` in `mmdeploy/backend/base/base_wrapper.py`.
+
+   **Example:**
+
+   ```Python
+   from mmdeploy.utils import Backend
+   from ..base import BACKEND_WRAPPER, BaseWrapper
+
+   @BACKEND_WRAPPER.register_module(Backend.TENSORRT.value)
+   class TRTWrapper(BaseWrapper):
+   ```
+
+2. The wrapper class can initialize the engine in `__init__` function and inference in `forward` function. Note that the `__init__` function must take a parameter `output_names` and pass it to base class to determine the orders of output tensors. The input and output variables of `forward` should be dictionaries denoting the name and value of the tensors.
+
+3. For the convenience of performance testing, the class should define a "execute" function that only calls the inference interface of the backend engine. The `forward` function should call the "execute" function after preprocessing the data.
+
+   **Example:**
+
+   ```Python
+   from mmdeploy.utils import Backend
+   from mmdeploy.utils.timer import TimeCounter
+   from ..base import BACKEND_WRAPPER, BaseWrapper
+
+   @BACKEND_WRAPPER.register_module(Backend.ONNXRUNTIME.value)
+   class ORTWrapper(BaseWrapper):
+
+       def __init__(self,
+                    onnx_file: str,
+                    device: str,
+                    output_names: Optional[Sequence[str]] = None):
+           # Initialization
+           # ...
+           super().__init__(output_names)
+
+       def forward(self, inputs: Dict[str,
+                                      torch.Tensor]) -> Dict[str, torch.Tensor]:
+           # Fetch data
+           # ...
+
+           self.__ort_execute(self.io_binding)
+
+   		# Postprocess data
+           # ...
+
+       @TimeCounter.count_time('onnxruntime')
+       def __ort_execute(self, io_binding: ort.IOBinding):
+   		# Only do the inference
+           self.sess.run_with_iobinding(io_binding)
+   ```
+
+4. Create a backend manager class which derive from `BaseBackendManager`, implement its `build_wrapper` static method.
+
+   **Example:**
+
+   ```Python
+        @BACKEND_MANAGERS.register('onnxruntime')
+        class ONNXRuntimeManager(BaseBackendManager):
+            @classmethod
+            def build_wrapper(cls,
+                              backend_files: Sequence[str],
+                              device: str = 'cpu',
+                              input_names: Optional[Sequence[str]] = None,
+                              output_names: Optional[Sequence[str]] = None,
+                              deploy_cfg: Optional[Any] = None,
+                              **kwargs):
+                from .wrapper import ORTWrapper
+                return ORTWrapper(
+                    onnx_file=backend_files[0],
+                    device=device,
+                    output_names=output_names)
+   ```
+
+5. Add docstring and unit tests for new code :).
+
+## Support new backends using MMDeploy as a third party
+
+Previous parts show how to add a new backend in MMDeploy, which requires changing its source codes. However, if we treat MMDeploy as a third party, the methods above are no longer efficient. To this end, adding a new backend requires us pre-install another package named `aenum`. We can install it directly through `pip install aenum`.
+
+After installing `aenum` successfully, we can use it to add a new backend through:
+
+```python
+from mmdeploy.utils.constants import Backend
+from aenum import extend_enum
+
+try:
+    Backend.get('backend_name')
+except Exception:
+    extend_enum(Backend, 'BACKEND', 'backend_name')
+```
+
+We can run the codes above before we use the rewrite logic of MMDeploy.
--- a/docs/en/07-developer-guide/support_new_model.md
+++ b/docs/en/07-developer-guide/support_new_model.md
+# How to support new models
+
+We provide several tools to support model conversion.
+
+## Function Rewriter
+
+The PyTorch neural network is written in python that eases the development of the algorithm. But the use of Python control flow and third-party libraries make it difficult to export the network to an intermediate representation. We provide a 'monkey patch' tool to rewrite the unsupported function to another one that can be exported. Here is an example:
+
+```python
+from mmdeploy.core import FUNCTION_REWRITER
+
+@FUNCTION_REWRITER.register_rewriter(
+    func_name='torch.Tensor.repeat', backend='tensorrt')
+def repeat_static(input, *size):
+    ctx = FUNCTION_REWRITER.get_context()
+    origin_func = ctx.origin_func
+    if input.dim() == 1 and len(size) == 1:
+        return origin_func(input.unsqueeze(0), *([1] + list(size))).squeeze(0)
+    else:
+        return origin_func(input, *size)
+```
+
+It is easy to use the function rewriter. Just add a decorator with arguments:
+
+- `func_name` is the function to override. It can be either a PyTorch function or a custom function. Methods in modules can also be overridden by this tool.
+- `backend` is the inference engine. The function will be overridden when the model is exported to this engine. If it is not given, this rewrite will be the default rewrite. The default rewrite will be used if the rewrite of the given backend does not exist.
+
+The arguments are the same as the original function, except a context `ctx` as the first argument. The context provides some useful information such as the deployment config `ctx.cfg` and the original function (which has been overridden) `ctx.origin_func`.
+
+## Module Rewriter
+
+If you want to replace a whole module with another one, we have another rewriter as follows:
+
+```python
+@MODULE_REWRITER.register_rewrite_module(
+    'mmagic.models.backbones.sr_backbones.SRCNN', backend='tensorrt')
+class SRCNNWrapper(nn.Module):
+
+    def __init__(self,
+                 module,
+                 cfg,
+                 channels=(3, 64, 32, 3),
+                 kernel_sizes=(9, 1, 5),
+                 upscale_factor=4):
+        super(SRCNNWrapper, self).__init__()
+
+        self._module = module
+
+        module.img_upsampler = nn.Upsample(
+            scale_factor=module.upscale_factor,
+            mode='bilinear',
+            align_corners=False)
+
+    def forward(self, *args, **kwargs):
+        """Run forward."""
+        return self._module(*args, **kwargs)
+
+    def init_weights(self, *args, **kwargs):
+        """Initialize weights."""
+        return self._module.init_weights(*args, **kwargs)
+```
+
+Just like function rewriter, add a decorator with arguments:
+
+- `module_type` the module class to rewrite.
+- `backend` is the inference engine. The function will be overridden when the model is exported to this engine. If it is not given, this rewrite will be the default rewrite. The default rewrite will be used if the rewrite of the given backend does not exist.
+
+All instances of the module in the network will be replaced with instances of this new class. The original module and the deployment config will be passed as the first two arguments.
+
+## Custom Symbolic
+
+The mappings between PyTorch and ONNX are defined in PyTorch with symbolic functions. The custom symbolic function can help us to bypass some ONNX nodes which are unsupported by inference engine.
+
+```python
+@SYMBOLIC_REWRITER.register_symbolic('squeeze', is_pytorch=True)
+def squeeze_default(g, self, dim=None):
+    if dim is None:
+        dims = []
+        for i, size in enumerate(self.type().sizes()):
+            if size == 1:
+                dims.append(i)
+    else:
+        dims = [sym_help._get_const(dim, 'i', 'dim')]
+    return g.op('Squeeze', self, axes_i=dims)
+```
+
+The decorator arguments:
+
+- `func_name` The function name to add symbolic. Use full path if it is a custom `torch.autograd.Function`. Or just a name if it is a PyTorch built-in function.
+- `backend` is the inference engine. The function will be overridden when the model is exported to this engine. If it is not given, this rewrite will be the default rewrite. The default rewrite will be used if the rewrite of the given backend does not exist.
+- `is_pytorch` True if the function is a PyTorch built-in function.
+- `arg_descriptors` the descriptors of the symbolic function arguments. Will be feed to `torch.onnx.symbolic_helper._parse_arg`.
+
+Just like function rewriter, there is a context `ctx` as the first argument. The context provides some useful information such as the deployment config `ctx.cfg` and the original function (which has been overridden) `ctx.origin_func`. Note that the `ctx.origin_func` can be used only when `is_pytorch==False`.
--- a/docs/en/07-developer-guide/test_rewritten_models.md
+++ b/docs/en/07-developer-guide/test_rewritten_models.md
+# How to test rewritten models
+
+After you create a rewritten model using our [rewriter](support_new_model.md), it's better to write a unit test for the model to validate if the model rewrite would come into effect. Generally, we need to get outputs of the original model and rewritten model, then compare them. The outputs of the original model can be acquired directly by calling the forward function of the model, whereas the way to generate the outputs of the rewritten model depends on the complexity of the rewritten model.
+
+## Test rewritten model with small changes
+
+If the changes to the model are small (e.g., only change the behavior of one or two variables and don't introduce side effects), you can construct the input arguments for the rewritten functions/modules，run model's inference in `RewriteContext` and check the results.
+
+```python
+# mmpretrain.models.classfiers.base.py
+class BaseClassifier(BaseModule, metaclass=ABCMeta):
+    def forward(self, img, return_loss=True, **kwargs):
+        if return_loss:
+            return self.forward_train(img, **kwargs)
+        else:
+            return self.forward_test(img, **kwargs)
+
+# Custom rewritten function
+@FUNCTION_REWRITER.register_rewriter(
+    'mmpretrain.models.classifiers.BaseClassifier.forward', backend='default')
+def forward_of_base_classifier(self, img, *args, **kwargs):
+    """Rewrite `forward` for default backend."""
+    return self.simple_test(img, {})
+```
+
+In the example, we only change the function that `forward` calls. We can test this rewritten function by writing the following test function:
+
+```python
+def test_baseclassfier_forward():
+    input = torch.rand(1)
+    from mmpretrain.models.classifiers import BaseClassifier
+    class DummyClassifier(BaseClassifier):
+
+        def __init__(self, init_cfg=None):
+            super().__init__(init_cfg=init_cfg)
+
+        def extract_feat(self, imgs):
+            pass
+
+        def forward_train(self, imgs):
+            return 'train'
+
+        def simple_test(self, img, tmp, **kwargs):
+            return 'simple_test'
+
+    model = DummyClassifier().eval()
+
+    model_output = model(input)
+    with RewriterContext(cfg=dict()), torch.no_grad():
+        backend_output = model(input)
+
+    assert model_output == 'train'
+    assert backend_output == 'simple_test'
+```
+
+In this test function, we construct a derived class of `BaseClassifier` to test if the rewritten model would work in the rewrite context. We get outputs of the original model by directly calling `model(input)` and get the outputs of the rewritten model by calling `model(input)` in `RewriteContext`. Finally, we can check the outputs by asserting their value.
+
+## Test rewritten model with big changes
+
+In the first example, the output is generated in Python. Sometimes we may make big changes to original model functions (e.g., eliminate branch statements to generate correct computing graph). Even if the outputs of a rewritten model running in Python are correct, we cannot assure that the rewritten model can work as expected in the backend. Therefore, we need to test the rewritten model in the backend.
+
+```python
+# Custom rewritten function
+@FUNCTION_REWRITER.register_rewriter(
+    func_name='mmseg.models.segmentors.BaseSegmentor.forward')
+def base_segmentor__forward(self, img, img_metas=None, **kwargs):
+    ctx = FUNCTION_REWRITER.get_context()
+    if img_metas is None:
+        img_metas = {}
+    assert isinstance(img_metas, dict)
+    assert isinstance(img, torch.Tensor)
+
+    deploy_cfg = ctx.cfg
+    is_dynamic_flag = is_dynamic_shape(deploy_cfg)
+    img_shape = img.shape[2:]
+    if not is_dynamic_flag:
+        img_shape = [int(val) for val in img_shape]
+    img_metas['img_shape'] = img_shape
+    return self.simple_test(img, img_metas, **kwargs)
+
+```
+
+The behavior of this rewritten function is complex. We should test it as follows:
+
+```python
+def test_basesegmentor_forward():
+    from mmdeploy.utils.test import (WrapModel, get_model_outputs,
+                                    get_rewrite_outputs)
+
+    segmentor = get_model()
+    segmentor.cpu().eval()
+
+    # Prepare data
+    # ...
+
+    # Get the outputs of original model
+    model_inputs = {
+        'img': [imgs],
+        'img_metas': [img_metas],
+        'return_loss': False
+    }
+    model_outputs = get_model_outputs(segmentor, 'forward', model_inputs)
+
+    # Get the outputs of rewritten model
+    wrapped_model = WrapModel(segmentor, 'forward', img_metas = None, return_loss = False)
+    rewrite_inputs = {'img': imgs}
+    rewrite_outputs, is_backend_output = get_rewrite_outputs(
+        wrapped_model=wrapped_model,
+        model_inputs=rewrite_inputs,
+        deploy_cfg=deploy_cfg)
+    if is_backend_output:
+        # If the backend plugins have been installed, the rewrite outputs are
+        # generated by backend.
+        rewrite_outputs = torch.tensor(rewrite_outputs)
+        model_outputs = torch.tensor(model_outputs)
+        model_outputs = model_outputs.unsqueeze(0).unsqueeze(0)
+        assert torch.allclose(rewrite_outputs, model_outputs)
+    else:
+        # Otherwise, the outputs are generated by python.
+        assert rewrite_outputs is not None
+```
+
+We provide some utilities to test rewritten functions. At first, you can construct a model and call `get_model_outputs` to get outputs of the original model. Then you can wrap the rewritten function with `WrapModel`, which serves as a partial function, and get the results with `get_rewrite_outputs`. `get_rewrite_outputs` returns two values that indicate the content of outputs and whether the outputs come from the backend. Because we cannot assume that everyone has installed the backend, we should check if the results are generated by a Python or backend engine. The unit test must cover both conditions. Finally, we should compare the original and rewritten outputs, which may be done simply by calling `torch.allclose`.
+
+## Note
+
+To learn the complete usage of the test utilities, please refer to our apis document.
--- a/docs/en/Makefile
+++ b/docs/en/Makefile
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/en/_static/css/readthedocs.css
+++ b/docs/en/_static/css/readthedocs.css
+.header-logo {
+    background-image: url("../image/mmdeploy-logo.png");
+    background-size: 150px 60px;
+    height: 60px;
+    width: 150px;
+}
--- a/docs/en/_static/image/mmdeploy-logo.png
+++ b/docs/en/_static/image/mmdeploy-logo.png
--- a/docs/en/_static/image/quant_model.png
+++ b/docs/en/_static/image/quant_model.png
--- a/docs/en/api.rst
+++ b/docs/en/api.rst
+apis
+-------
+.. automodule:: mmdeploy.apis
+    :members:
+
+apis/tensorrt
+-------------
+.. automodule:: mmdeploy.apis.tensorrt
+    :members:
+
+apis/onnxruntime
+----------------
+.. automodule:: mmdeploy.apis.onnxruntime
+    :members:
+
+apis/ncnn
+---------
+.. automodule:: mmdeploy.apis.ncnn
+    :members:
+
+apis/pplnn
+----------
+.. automodule:: mmdeploy.apis.pplnn
+    :members:
--- a/docs/en/appendix/cross_build_snpe_service.md
+++ b/docs/en/appendix/cross_build_snpe_service.md
+# Cross compile snpe inference server on Ubuntu 18
+
+mmdeploy has provided a prebuilt package, if you want to compile it by self, or need to modify the `.proto` file, you can refer to this document.
+
+Note that the official gRPC documentation does not have complete support for the NDK.
+
+## 1. Environment
+
+| Item               | Version        | Remarks                                                   |
+| ------------------ | -------------- | --------------------------------------------------------- |
+| snpe               | 1.59           | 1.60 uses clang-8.0, which may cause compatibility issues |
+| host OS            | ubuntu18.04    | snpe1.59 specified version                                |
+| NDK                | r17c           | snpe1.59 specified version                                |
+| gRPC               | commit 6f698b5 | -                                                         |
+| Hardware equipment | qcom888        | qcom chip required                                        |
+
+## 2. Cross compile gRPC with NDK
+
+1. Pull gRPC repo, compile `protoc` and `grpc_cpp_plugin` on host
+
+```bash
+# Install dependencies
+$ apt-get update && apt-get install -y libssl-dev
+# Compile
+$ git clone https://github.com/grpc/grpc --recursive=1 --depth=1
+$ mkdir -p cmake/build
+$ pushd cmake/build
+
+$ cmake \
+  -DCMAKE_BUILD_TYPE=Release \
+  -DgRPC_INSTALL=ON \
+  -DgRPC_BUILD_TESTS=OFF \
+  -DgRPC_SSL_PROVIDER=package \
+  ../..
+# Install to host
+$ make -j
+$ sudo make install
+```
+
+2. Download the NDK and cross-compile the static libraries with android aarch64 format
+
+```bash
+$ wget https://dl.google.com/android/repository/android-ndk-r17c-linux-x86_64.zip
+$ unzip android-ndk-r17c-linux-x86_64.zip
+
+$ export ANDROID_NDK=/path/to/android-ndk-r17c
+
+$ cd /path/to/grpc
+$ mkdir -p cmake/build_aarch64  && pushd cmake/build_aarch64
+
+$ cmake ../.. \
+ -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK}/build/cmake/android.toolchain.cmake \
+ -DANDROID_ABI=arm64-v8a \
+ -DANDROID_PLATFORM=android-26 \
+ -DANDROID_TOOLCHAIN=clang \
+ -DANDROID_STL=c++_shared \
+ -DCMAKE_BUILD_TYPE=Release \
+ -DCMAKE_INSTALL_PREFIX=/tmp/android_grpc_install_shared
+
+$ make -j
+$ make install
+```
+
+3. At this point `/tmp/android_grpc_install` should have the complete installation file
+
+```bash
+$ cd /tmp/android_grpc_install
+$ tree -L 1
+.
+├── bin
+├── include
+├── lib
+└── share
+```
+
+## 3. (Skipable) Self-test whether NDK gRPC is available
+
+1. Compile the helloworld that comes with gRPC
+
+```bash
+$ cd /path/to/grpc/examples/cpp/helloworld/
+$ mkdir cmake/build_aarch64 -p && pushd cmake/build_aarch64
+
+$ cmake ../.. \
+ -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK}/build/cmake/android.toolchain.cmake \
+ -DANDROID_ABI=arm64-v8a \
+ -DANDROID_PLATFORM=android-26 \
+ -DANDROID_STL=c++_shared \
+ -DANDROID_TOOLCHAIN=clang \
+ -DCMAKE_BUILD_TYPE=Release \
+ -Dabsl_DIR=/tmp/android_grpc_install_shared/lib/cmake/absl \
+ -DProtobuf_DIR=/tmp/android_grpc_install_shared/lib/cmake/protobuf \
+ -DgRPC_DIR=/tmp/android_grpc_install_shared/lib/cmake/grpc
+
+$ make -j
+$ ls greeter*
+greeter_async_client   greeter_async_server     greeter_callback_server  greeter_server
+greeter_async_client2  greeter_callback_client  greeter_client
+```
+
+2. Turn on debug mode on your phone, push the binary to `/data/local/tmp`
+
+```bash
+$ adb push greeter* /data/local/tmp
+```
+
+3. `adb shell` into the phone, execute client/server
+
+```bash
+/data/local/tmp $ ./greeter_client
+Greeter received: Hello world
+```
+
+## 4. Cross compile snpe inference server
+
+1. Open the [snpe tools website](https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk/tools) and download version 1.59. Unzip and set environment variables
+
+> Note that snpe >= 1.60 starts using `clang-8.0`, which may cause incompatibility with `libc++_shared.so` on older devices.
+
+```bash
+$ export SNPE_ROOT=/path/to/snpe-1.59.0.3230
+```
+
+2. Open the snpe server directory within mmdeploy, use the options when cross-compiling gRPC
+
+```bash
+$ cd /path/to/mmdeploy
+$ cd service/snpe/server
+
+$ mkdir -p build && cd build
+$ export ANDROID_NDK=/path/to/android-ndk-r17c
+$ cmake .. \
+ -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK}/build/cmake/android.toolchain.cmake \
+ -DANDROID_ABI=arm64-v8a \
+ -DANDROID_PLATFORM=android-26 \
+ -DANDROID_STL=c++_shared \
+ -DANDROID_TOOLCHAIN=clang \
+ -DCMAKE_BUILD_TYPE=Release \
+ -Dabsl_DIR=/tmp/android_grpc_install_shared/lib/cmake/absl \
+ -DProtobuf_DIR=/tmp/android_grpc_install_shared/lib/cmake/protobuf \
+ -DgRPC_DIR=/tmp/android_grpc_install_shared/lib/cmake/grpc
+
+ $ make -j
+ $ file inference_server
+inference_server: ELF 64-bit LSB shared object, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /system/bin/linker64, BuildID[sha1]=252aa04e2b982681603dacb74b571be2851176d2, with debug_info, not stripped
+```
+
+Finally,  you can see `infernece_server`, `adb push` it to the device and execute.
+
+## 5. Regenerate the proto interface
+
+If you have changed `inference.proto`, you need to regenerate the .cpp and .py interfaces
+
+```Shell
+$ python3 -m pip install grpc_tools --user
+$ python3 -m  grpc_tools.protoc -I./ --python_out=./client/ --grpc_python_out=./client/ inference.proto
+
+$ ln -s `which protoc-gen-grpc`
+$ protoc --cpp_out=./ --grpc_out=./  --plugin=protoc-gen-grpc=grpc_cpp_plugin  inference.proto
+```
+
+## Reference
+
+- snpe tutorial https://developer.qualcomm.com/sites/default/files/docs/snpe/cplus_plus_tutorial.html
+- gRPC cross build script https://raw.githubusercontent.com/grpc/grpc/master/test/distrib/cpp/run_distrib_test_cmake_aarch64_cross.sh
+- stackoverflow https://stackoverflow.com/questions/54052229/build-grpc-c-for-android-using-ndk-arm-linux-androideabi-clang-compiler
--- a/docs/en/conf.py
+++ b/docs/en/conf.py
+#
+# Configuration file for the Sphinx documentation builder.
+#
+# This file does only contain a selection of the most common options. For a
+# full list see the documentation:
+# http://www.sphinx-doc.org/en/master/config
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import subprocess
+import sys
+
+import pytorch_sphinx_theme
+from m2r import MdInclude
+from recommonmark.transform import AutoStructify
+from sphinx.builders.html import StandaloneHTMLBuilder
+
+sys.path.insert(0, os.path.abspath('../..'))
+
+version_file = '../../mmdeploy/version.py'
+with open(version_file, 'r') as f:
+    exec(compile(f.read(), version_file, 'exec'))
+__version__ = locals()['__version__']
+
+# -- Project information -----------------------------------------------------
+
+project = 'mmdeploy'
+copyright = '2021-2024, OpenMMLab'
+author = 'MMDeploy Authors'
+
+# The short X.Y version
+version = __version__
+# The full version, including alpha/beta/rc tags
+release = __version__
+
+# -- General configuration ---------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+
+extensions = [
+    'breathe',
+    'sphinx.ext.autodoc',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    'sphinx.ext.autosectionlabel',
+    'sphinx_markdown_tables',
+    'myst_parser',
+    'sphinx_copybutton',
+    'sphinxcontrib.mermaid'
+]  # yapf: disable
+
+breathe_default_project = 'mmdeployapi'
+breathe_projects = {'mmdeployapi': '../cppapi/docs/xml'}
+
+
+def generate_doxygen_xml(app):
+    try:
+        folder = '../cppapi'
+        retcode = subprocess.call('cd %s; doxygen' % folder, shell=True)
+        if retcode < 0:
+            sys.stderr.write('doxygen terminated by signal %s' % (-retcode))
+    except Exception as e:
+        sys.stderr.write('doxygen execution failed: %s' % e)
+
+
+autodoc_mock_imports = ['tensorrt']
+
+autosectionlabel_prefix_document = True
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+source_suffix = {
+    '.rst': 'restructuredtext',
+    '.md': 'markdown',
+}
+
+# The master toctree document.
+master_doc = 'index'
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = 'en'
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = 'sphinx'
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+# html_theme = 'sphinx_rtd_theme'
+html_theme = 'pytorch_sphinx_theme'
+html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()]
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+html_theme_options = {
+    'logo_url': 'https://mmdeploy.readthedocs.io/en/latest/',
+    'menu': [{
+        'name': 'GitHub',
+        'url': 'https://github.com/open-mmlab/mmdeploy'
+    }],
+    'menu_lang': 'en'
+}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+html_css_files = ['css/readthedocs.css']
+
+# Enable ::: for my_st
+myst_enable_extensions = ['colon_fence']
+myst_heading_anchors = 5
+
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# The default sidebars (for documents that don't match any pattern) are
+# defined by theme itself.  Builtin themes are using these templates by
+# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
+# 'searchbox.html']``.
+#
+# html_sidebars = {}
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'mmdeploydoc'
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (master_doc, 'mmdeploy.tex', 'mmdeploy Documentation',
+     'MMDeploy Contributors', 'manual'),
+]
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [(master_doc, 'mmdeploy', 'mmdeploy Documentation', [author], 1)]
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (master_doc, 'mmdeploy', 'mmdeploy Documentation', author, 'mmdeploy',
+     'One line description of project.', 'Miscellaneous'),
+]
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+
+# set priority when building html
+StandaloneHTMLBuilder.supported_image_types = [
+    'image/svg+xml', 'image/gif', 'image/png', 'image/jpeg'
+]
+
+# -- Extension configuration -------------------------------------------------
+# Ignore >>> when copying code
+copybutton_prompt_text = r'>>> |\.\.\. '
+copybutton_prompt_is_regexp = True
+
+
+def setup(app):
+    # Add hook for building doxygen xml when needed
+    app.connect('builder-inited', generate_doxygen_xml)
+    app.add_config_value('no_underscore_emphasis', False, 'env')
+    app.add_config_value('m2r_parse_relative_links', False, 'env')
+    app.add_config_value('m2r_anonymous_references', False, 'env')
+    app.add_config_value('m2r_disable_inline_math', False, 'env')
+    app.add_directive('mdinclude', MdInclude)
+    app.add_config_value('recommonmark_config', {
+        'auto_toc_tree_section': 'Contents',
+        'enable_eval_rst': True,
+    }, True)
+    app.add_transform(AutoStructify)
--- a/docs/en/experimental/onnx_optimizer.md
+++ b/docs/en/experimental/onnx_optimizer.md
+# ONNX export Optimizer
+
+This is a tool to optimize ONNX model when exporting from PyTorch.
+
+## Installation
+
+Build MMDeploy with `torchscript` support:
+
+```shell
+export Torch_DIR=$(python -c "import torch;print(torch.utils.cmake_prefix_path + '/Torch')")
+
+cmake \
+    -DTorch_DIR=${Torch_DIR} \
+    -DMMDEPLOY_TARGET_BACKENDS="${your_backend};torchscript" \
+    .. # You can also add other build flags if you need
+
+cmake --build . -- -j$(nproc) && cmake --install .
+```
+
+## Usage
+
+```python
+# import model_to_graph_custom_optimizer so we can hijack onnx.export
+from mmdeploy.apis.onnx.optimizer import model_to_graph__custom_optimizer # noqa
+from mmdeploy.core import RewriterContext
+from mmdeploy.apis.onnx.passes import optimize_onnx
+
+# load you model here
+model = create_model()
+
+# export with ONNX Optimizer
+x = create_dummy_input()
+with RewriterContext({}, onnx_custom_passes=optimize_onnx):
+    torch.onnx.export(model, x, output_path)
+```
+
+The model would be optimized after export.
+
+You can also define your own optimizer:
+
+```python
+# create the optimize callback
+def _optimize_onnx(graph, params_dict, torch_out):
+    from mmdeploy.backend.torchscript import ts_optimizer
+    ts_optimizer.onnx._jit_pass_onnx_peephole(graph)
+    return graph, params_dict, torch_out
+
+with RewriterContext({}, onnx_custom_passes=_optimize_onnx):
+    # export your model
+```
--- a/docs/en/faq.md
+++ b/docs/en/faq.md
+## Frequently Asked Questions
+
+### TensorRT
+
+- "WARNING: Half2 support requested on hardware without native FP16 support, performance will be negatively affected."
+
+  Fp16 mode requires a device with full-rate fp16 support.
+
+- "error: parameter check failed at: engine.cpp::setBindingDimensions::1046, condition: profileMinDims.d[i] \<= dimensions.d[i]"
+
+  When building an `ICudaEngine` from an `INetworkDefinition` that has dynamically resizable inputs, users need to specify at least one optimization profile. Which can be set in deploy config:
+
+  ```python
+  backend_config = dict(
+    common_config=dict(max_workspace_size=1 << 30),
+    model_inputs=[
+        dict(
+            input_shapes=dict(
+                input=dict(
+                    min_shape=[1, 3, 320, 320],
+                    opt_shape=[1, 3, 800, 1344],
+                    max_shape=[1, 3, 1344, 1344])))
+    ])
+  ```
+
+  The input tensor shape should be limited between `min_shape` and `max_shape`.
+
+- "error: [TensorRT] INTERNAL ERROR: Assertion failed: cublasStatus == CUBLAS_STATUS_SUCCESS"
+
+  TRT 7.2.1 switches to use cuBLASLt (previously it was cuBLAS). cuBLASLt is the defaulted choice for SM version >= 7.0. You may need CUDA-10.2 Patch 1 (Released Aug 26, 2020) to resolve some cuBLASLt issues. Another option is to use the new TacticSource API and disable cuBLASLt tactics if you dont want to upgrade.
+
+### Libtorch
+
+- Error: `libtorch/share/cmake/Caffe2/Caffe2Config.cmake:96 (message):Your installed Caffe2 version uses cuDNN but I cannot find the cuDNN libraries.  Please set the proper cuDNN prefixes and / or install cuDNN.`
+
+  May `export CUDNN_ROOT=/root/path/to/cudnn` to resolve the build error.
+
+### Windows
+
+- Error: similar like this `OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\cx\miniconda3\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies`
+
+  Solution: according to this [post](https://stackoverflow.com/questions/64837376/how-to-efficiently-run-multiple-pytorch-processes-models-at-once-traceback), the issue may be caused by NVidia and will fix in *CUDA release 11.7*. For now one could use the [fixNvPe.py](https://gist.github.com/cobryan05/7d1fe28dd370e110a372c4d268dcb2e5) script to modify the nvidia dlls in the pytorch lib dir.
+
+  `python fixNvPe.py --input=C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\*.dll`
+
+  You can find your pytorch installation path with:
+
+  ```python
+  import torch
+  print(torch.__file__)
+  ```
+
+- enable_language(CUDA) error
+
+  ```
+  -- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19044.
+  -- Found CUDA: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1 (found version "11.1")
+  CMake Error at C:/Software/cmake/cmake-3.23.1-windows-x86_64/share/cmake-3.23/Modules/CMakeDetermineCompilerId.cmake:491 (message):
+    No CUDA toolset found.
+  Call Stack (most recent call first):
+    C:/Software/cmake/cmake-3.23.1-windows-x86_64/share/cmake-3.23/Modules/CMakeDetermineCompilerId.cmake:6 (CMAKE_DETERMINE_COMPILER_ID_BUILD)
+    C:/Software/cmake/cmake-3.23.1-windows-x86_64/share/cmake-3.23/Modules/CMakeDetermineCompilerId.cmake:59 (__determine_compiler_id_test)
+    C:/Software/cmake/cmake-3.23.1-windows-x86_64/share/cmake-3.23/Modules/CMakeDetermineCUDACompiler.cmake:339 (CMAKE_DETERMINE_COMPILER_ID)
+    C:/workspace/mmdeploy-0.6.0-windows-amd64-cuda11.1-tensorrt8.2.3.0/sdk/lib/cmake/MMDeploy/MMDeployConfig.cmake:27 (enable_language)
+    CMakeLists.txt:5 (find_package)
+  ```
+
+  **Cause：** CUDA Toolkit 11.1 was installed before Visual Studio, so the VS plugin was not installed. Or the version of VS is too new, so that the installation of the VS plugin is skipped during the installation of the CUDA Toolkit
+
+  **Solution：** This problem can be solved by manually copying the four files in `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\extras\visual_studio_integration\MSBuildExtensions` to `C:\Software\Microsoft Visual Studio\2022\Community\Msbuild\Microsoft\VC\v170\BuildCustomizations` The specific path should be changed according to the actual situation.
+
+### ONNX Runtime
+
+- Under Windows system, when visualizing model inference result failed with the following error:
+  ```
+  onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Failed to load library, error code: 193
+  ```
+  **Cause：** In latest Windows systems, there are two `onnxruntime.dll` under the system path, and they will be loaded first, causing conflicts.
+  ```
+  C:\Windows\SysWOW64\onnxruntime.dll
+  C:\Windows\System32\onnxruntime.dll
+  ```
+  **Solution：** Choose one of the following two options
+  1. Copy the dll in the lib directory of the downloaded onnxruntime to the directory where mmdeploy_onnxruntime_ops.dll locates (It is recommended to use Everything to search the ops dll)
+  2. Rename the two dlls in the system path so that they cannot be loaded.
+
+### Pip
+
+- pip installed package but could not `import` them.
+
+  Make sure your are using conda pip.
+
+  ```bash
+  $ which pip
+  # /path/to/.local/bin/pip
+  /path/to/miniconda3/lib/python3.9/site-packages/pip
+  ```
--- a/docs/en/get_started.md
+++ b/docs/en/get_started.md
+# Get Started
+
+MMDeploy provides useful tools for deploying OpenMMLab models to various platforms and devices.
+
+With the help of them, you can not only do model deployment using our pre-defined pipelines but also customize your own deployment pipeline.
+
+## Introduction
+
+In MMDeploy, the deployment pipeline can be illustrated by a sequential modules, i.e., Model Converter, MMDeploy Model and Inference SDK.
+
+![deploy-pipeline](https://user-images.githubusercontent.com/4560679/172306700-31b4c922-2f04-42ed-a1d6-c360f2f3048c.png)
+
+### Model Converter
+
+Model Converter aims at converting training models from OpenMMLab into backend models that can be run on target devices.
+It is able to transform PyTorch model into IR model, i.e., ONNX, TorchScript, as well as convert IR model to backend model. By combining them together, we can achieve one-click **end-to-end** model deployment.
+
+### MMDeploy Model
+
+MMDeploy Model is the result package exported by Model Converter.
+Beside the backend models, it also includes the model meta info, which will be used by Inference SDK.
+
+### Inference SDK
+
+Inference SDK is developed by C/C++, wrapping the preprocessing, model forward and postprocessing modules in model inference.
+It supports FFI such as C, C++, Python, C#, Java and so on.
+
+## Prerequisites
+
+In order to do an end-to-end model deployment, MMDeploy requires Python 3.6+ and PyTorch 1.8+.
+
+**Step 0.** Download and install Miniconda from the [official website](https://docs.conda.io/en/latest/miniconda.html).
+
+**Step 1.** Create a conda environment and activate it.
+
+```shell
+conda create --name mmdeploy python=3.8 -y
+conda activate mmdeploy
+```
+
+**Step 2.** Install PyTorch following [official instructions](https://pytorch.org/get-started/locally/), e.g.
+
+On GPU platforms:
+
+```shell
+conda install pytorch=={pytorch_version} torchvision=={torchvision_version} cudatoolkit={cudatoolkit_version} -c pytorch -c conda-forge
+```
+
+On CPU platforms:
+
+```shell
+conda install pytorch=={pytorch_version} torchvision=={torchvision_version} cpuonly -c pytorch
+```
+
+```{note}
+On GPU platform, please ensure that {cudatoolkit_version} matches your host CUDA toolkit version. Otherwise, it probably brings in conflicts when deploying model with TensorRT.
+```
+
+## Installation
+
+We recommend that users follow our best practices installing MMDeploy.
+
+**Step 0.** Install [MMCV](https://github.com/open-mmlab/mmcv).
+
+```shell
+pip install -U openmim
+mim install mmengine
+mim install "mmcv>=2.0.0rc2"
+```
+
+**Step 1.** Install MMDeploy and inference engine
+
+We recommend using MMDeploy precompiled package as our best practice. Currently, we support model converter and sdk inference pypi package, and the sdk c/cpp library is provided [here](https://github.com/open-mmlab/mmdeploy/releases). You can download them according to your target platform and device.
+
+The supported platform and device matrix is presented as following:
+
+<table>
+<thead>
+  <tr>
+    <th>OS-Arch</th>
+    <th>Device</th>
+    <th>ONNX Runtime</th>
+    <th>TensorRT</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td rowspan="2">Linux-x86_64</td>
+    <td>CPU</td>
+    <td>Y</td>
+    <td>N/A</td>
+  </tr>
+  <tr>
+    <td>CUDA</td>
+    <td>Y</td>
+    <td>Y</td>
+  </tr>
+  <tr>
+    <td rowspan="2">Windows-x86_64</td>
+    <td>CPU</td>
+    <td>Y</td>
+    <td>N/A</td>
+  </tr>
+  <tr>
+    <td>CUDA</td>
+    <td>Y</td>
+    <td>Y</td>
+  </tr>
+</tbody>
+</table>
+
+**Note: if MMDeploy prebuilt package doesn't meet your target platforms or devices, please [build MMDeploy from source](01-how-to-build/build_from_source.md)**
+
+Take the latest precompiled package as example, you can install it as follows:
+
+<details open>
+<summary><b>Linux-x86_64</b></summary>
+
+```shell
+# 1. install MMDeploy model converter
+pip install mmdeploy==1.3.1
+
+# 2. install MMDeploy sdk inference
+# you can install one to install according whether you need gpu inference
+# 2.1 support onnxruntime
+pip install mmdeploy-runtime==1.3.1
+# 2.2 support onnxruntime-gpu, tensorrt
+pip install mmdeploy-runtime-gpu==1.3.1
+
+# 3. install inference engine
+# 3.1 install TensorRT
+# !!! If you want to convert a tensorrt model or inference with tensorrt,
+# download TensorRT-8.2.3.0 CUDA 11.x tar package from NVIDIA, and extract it to the current directory
+pip install TensorRT-8.2.3.0/python/tensorrt-8.2.3.0-cp38-none-linux_x86_64.whl
+pip install pycuda
+export TENSORRT_DIR=$(pwd)/TensorRT-8.2.3.0
+export LD_LIBRARY_PATH=${TENSORRT_DIR}/lib:$LD_LIBRARY_PATH
+# !!! Moreover, download cuDNN 8.2.1 CUDA 11.x tar package from NVIDIA, and extract it to the current directory
+export CUDNN_DIR=$(pwd)/cuda
+export LD_LIBRARY_PATH=$CUDNN_DIR/lib64:$LD_LIBRARY_PATH
+
+# 3.2 install ONNX Runtime
+# you can install one to install according whether you need gpu inference
+# 3.2.1 onnxruntime
+wget https://github.com/microsoft/onnxruntime/releases/download/v1.8.1/onnxruntime-linux-x64-1.8.1.tgz
+tar -zxvf onnxruntime-linux-x64-1.8.1.tgz
+export ONNXRUNTIME_DIR=$(pwd)/onnxruntime-linux-x64-1.8.1
+export LD_LIBRARY_PATH=$ONNXRUNTIME_DIR/lib:$LD_LIBRARY_PATH
+# 3.2.2 onnxruntime-gpu
+pip install onnxruntime-gpu==1.8.1
+wget https://github.com/microsoft/onnxruntime/releases/download/v1.8.1/onnxruntime-linux-x64-gpu-1.8.1.tgz
+tar -zxvf onnxruntime-linux-x64-gpu-1.8.1.tgz
+export ONNXRUNTIME_DIR=$(pwd)/onnxruntime-linux-x64-gpu-1.8.1
+export LD_LIBRARY_PATH=$ONNXRUNTIME_DIR/lib:$LD_LIBRARY_PATH
+```
+
+</details>
+
+<details open>
+<summary><b>Windows-x86_64</b></summary>
+</details>
+
+Please learn its prebuilt package from [this](02-how-to-run/prebuilt_package_windows.md) guide.
+
+## Convert Model
+
+After the installation, you can enjoy the model deployment journey starting from converting PyTorch model to backend model by running `tools/deploy.py`.
+
+Based on the above settings, we provide an example to convert the Faster R-CNN in [MMDetection](https://github.com/open-mmlab/mmdetection) to TensorRT as below:
+
+```shell
+# clone mmdeploy to get the deployment config. `--recursive` is not necessary
+git clone -b main https://github.com/open-mmlab/mmdeploy.git
+
+# clone mmdetection repo. We have to use the config file to build PyTorch nn module
+git clone -b 3.x https://github.com/open-mmlab/mmdetection.git
+cd mmdetection
+mim install -v -e .
+cd ..
+
+# download Faster R-CNN checkpoint
+wget -P checkpoints https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
+
+# run the command to start model conversion
+python mmdeploy/tools/deploy.py \
+    mmdeploy/configs/mmdet/detection/detection_tensorrt_dynamic-320x320-1344x1344.py \
+    mmdetection/configs/faster_rcnn/faster-rcnn_r50_fpn_1x_coco.py \
+    checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
+    mmdetection/demo/demo.jpg \
+    --work-dir mmdeploy_model/faster-rcnn \
+    --device cuda \
+    --dump-info
+```
+
+The converted model and its meta info will be found in the path specified by `--work-dir`.
+And they make up of MMDeploy Model that can be fed to MMDeploy SDK to do model inference.
+
+For more details about model conversion, you can read [how_to_convert_model](02-how-to-run/convert_model.md). If you want to customize the conversion pipeline, you can edit the config file by following [this](02-how-to-run/write_config.md) tutorial.
+
+```{tip}
+You can convert the above model to onnx model and perform ONNX Runtime inference
+just by changing 'detection_tensorrt_dynamic-320x320-1344x1344.py' to 'detection_onnxruntime_dynamic.py' and making '--device' as 'cpu'.
+```
+
+## Inference Model
+
+After model conversion, we can perform inference not only by Model Converter but also by Inference SDK.
+
+### Inference by Model Converter
+
+Model Converter provides a unified API named as `inference_model` to do the job, making all inference backends API transparent to users.
+Take the previous converted Faster R-CNN tensorrt model for example,
+
+```python
+from mmdeploy.apis import inference_model
+result = inference_model(
+  model_cfg='mmdetection/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py',
+  deploy_cfg='mmdeploy/configs/mmdet/detection/detection_tensorrt_dynamic-320x320-1344x1344.py',
+  backend_files=['mmdeploy_model/faster-rcnn/end2end.engine'],
+  img='mmdetection/demo/demo.jpg',
+  device='cuda:0')
+```
+
+```{note}
+'backend_files' in this API refers to backend engine file path, which MUST be put in a list, since some inference engines like OpenVINO and ncnn separate the network structure and its weights into two files.
+```
+
+### Inference by SDK
+
+You can directly run MMDeploy demo programs in the precompiled package to get inference results.
+
+```shell
+wget https://github.com/open-mmlab/mmdeploy/releases/download/v1.3.1/mmdeploy-1.3.1-linux-x86_64-cuda11.8.tar.gz
+tar xf mmdeploy-1.3.1-linux-x86_64-cuda11.8
+cd mmdeploy-1.3.1-linux-x86_64-cuda11.8
+# run python demo
+python example/python/object_detection.py cuda ../mmdeploy_model/faster-rcnn ../mmdetection/demo/demo.jpg
+# run C/C++ demo
+# build the demo according to the README.md in the folder.
+./bin/object_detection cuda ../mmdeploy_model/faster-rcnn ../mmdetection/demo/demo.jpg
+```
+
+```{note}
+In the above command, the input model is SDK Model path. It is NOT engine file path but actually the path passed to --work-dir. It not only includes engine files but also meta information like 'deploy.json' and 'pipeline.json'.
+```
+
+In the next section, we will provide examples of deploying the converted Faster R-CNN model talked above with SDK different FFI (Foreign Function Interface).
+
+#### Python API
+
+```python
+from mmdeploy_runtime import Detector
+import cv2
+
+img = cv2.imread('mmdetection/demo/demo.jpg')
+
+# create a detector
+detector = Detector(model_path='mmdeploy_models/faster-rcnn', device_name='cuda', device_id=0)
+# run the inference
+bboxes, labels, _ = detector(img)
+# Filter the result according to threshold
+indices = [i for i in range(len(bboxes))]
+for index, bbox, label_id in zip(indices, bboxes, labels):
+  [left, top, right, bottom], score = bbox[0:4].astype(int),  bbox[4]
+  if score < 0.3:
+      continue
+  cv2.rectangle(img, (left, top), (right, bottom), (0, 255, 0))
+
+cv2.imwrite('output_detection.png', img)
+```
+
+You can find more examples from [here](https://github.com/open-mmlab/mmdeploy/tree/main/demo/python).
+
+#### C++ API
+
+Using SDK C++ API should follow next pattern,
+
+![image](https://user-images.githubusercontent.com/4560679/182554739-7fff57fc-5c84-44ed-b139-4749fae27404.png)
+
+Now let's apply this procedure on the above Faster R-CNN model.
+
+```C++
+#include <cstdlib>
+#include <opencv2/opencv.hpp>
+#include "mmdeploy/detector.hpp"
+
+int main() {
+  const char* device_name = "cuda";
+  int device_id = 0;
+  std::string model_path = "mmdeploy_model/faster-rcnn";
+  std::string image_path = "mmdetection/demo/demo.jpg";
+
+  // 1. load model
+  mmdeploy::Model model(model_path);
+  // 2. create predictor
+  mmdeploy::Detector detector(model, mmdeploy::Device{device_name, device_id});
+  // 3. read image
+  cv::Mat img = cv::imread(image_path);
+  // 4. inference
+  auto dets = detector.Apply(img);
+  // 5. deal with the result. Here we choose to visualize it
+  for (int i = 0; i < dets.size(); ++i) {
+    const auto& box = dets[i].bbox;
+    fprintf(stdout, "box %d, left=%.2f, top=%.2f, right=%.2f, bottom=%.2f, label=%d, score=%.4f\n",
+            i, box.left, box.top, box.right, box.bottom, dets[i].label_id, dets[i].score);
+    if (dets[i].score < 0.3) {
+      continue;
+    }
+    cv::rectangle(img, cv::Point{(int)box.left, (int)box.top},
+                  cv::Point{(int)box.right, (int)box.bottom}, cv::Scalar{0, 255, 0});
+  }
+  cv::imwrite("output_detection.png", img);
+  return 0;
+}
+```
+
+When you build this example, try to add MMDeploy package in your CMake project as following. Then pass `-DMMDeploy_DIR` to cmake, which indicates the path where `MMDeployConfig.cmake` locates. You can find it in the prebuilt package.
+
+```Makefile
+find_package(MMDeploy REQUIRED)
+target_link_libraries(${name} PRIVATE mmdeploy ${OpenCV_LIBS})
+```
+
+For more SDK C++ API usages, please read these [samples](https://github.com/open-mmlab/mmdeploy/tree/main/demo/csrc/cpp).
+
+For the rest C, C# and Java API usages, please read [C demos](https://github.com/open-mmlab/mmdeploy/tree/main/demo/csrc/c), [C# demos](https://github.com/open-mmlab/mmdeploy/tree/main/demo/csharp) and [Java demos](https://github.com/open-mmlab/mmdeploy/tree/main/demo/java) respectively.
+We'll talk about them more in our next release.
+
+#### Accelerate preprocessing（Experimental）
+
+If you want to fuse preprocess for acceleration，please refer to this [doc](./02-how-to-run/fuse_transform.md)
+
+## Evaluate Model
+
+You can test the performance of deployed model using `tool/test.py`. For example,
+
+```shell
+python ${MMDEPLOY_DIR}/tools/test.py \
+    ${MMDEPLOY_DIR}/configs/detection/detection_tensorrt_dynamic-320x320-1344x1344.py \
+    ${MMDET_DIR}/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
+    --model ${BACKEND_MODEL_FILES} \
+    --metrics ${METRICS} \
+    --device cuda:0
+```
+
+```{note}
+Regarding the --model option, it represents the converted engine files path when using Model Converter to do performance test. But when you try to test the metrics by Inference SDK, this option refers to the directory path of MMDeploy Model.
+```
+
+You can read [how to evaluate a model](02-how-to-run/profile_model.md) for more details.
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
+Welcome to MMDeploy's documentation!
+====================================
+
+You can switch between Chinese and English documents in the lower-left corner of the layout.
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Get Started
+
+   get_started.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Build
+
+   01-how-to-build/build_from_source.md
+   01-how-to-build/build_from_docker.md
+   01-how-to-build/build_from_script.md
+   01-how-to-build/cmake_option.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Run & Test
+
+   02-how-to-run/convert_model.md
+   02-how-to-run/write_config.md
+   02-how-to-run/profile_model.md
+   02-how-to-run/quantize_model.md
+   02-how-to-run/useful_tools.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: SDK Usage
+
+   sdk_usage/index.rst
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Benchmark
+
+   03-benchmark/supported_models.md
+   03-benchmark/benchmark.md
+   03-benchmark/benchmark_edge.md
+   03-benchmark/benchmark_tvm.md
+   03-benchmark/quantization.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: OpenMMLab Codebase Support
+
+   04-supported-codebases/mmpretrain.md
+   04-supported-codebases/mmdet.md
+   04-supported-codebases/mmseg.md
+   04-supported-codebases/mmagic.md
+   04-supported-codebases/mmocr.md
+   04-supported-codebases/mmpose.md
+   04-supported-codebases/mmdet3d.md
+   04-supported-codebases/mmrotate.md
+   04-supported-codebases/mmaction2.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Backend Support
+
+   05-supported-backends/ncnn.md
+   05-supported-backends/onnxruntime.md
+   05-supported-backends/openvino.md
+   05-supported-backends/pplnn.md
+   05-supported-backends/snpe.md
+   05-supported-backends/tensorrt.md
+   05-supported-backends/torchscript.md
+   05-supported-backends/rknn.md
+   05-supported-backends/tvm.md
+   05-supported-backends/coreml.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Custom Ops
+
+   06-custom-ops/onnxruntime.md
+   06-custom-ops/tensorrt.md
+   06-custom-ops/ncnn.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Developer Guide
+
+   07-developer-guide/architecture.md
+   07-developer-guide/support_new_model.md
+   07-developer-guide/support_new_backend.md
+   07-developer-guide/add_backend_ops_unittest.md
+   07-developer-guide/test_rewritten_models.md
+   07-developer-guide/partition_model.md
+   07-developer-guide/regression_test.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Experimental feature
+
+   experimental/onnx_optimizer.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Appendix
+
+   appendix/cross_build_snpe_service.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: FAQ
+
+   faq.md
+
+.. toctree::
+   :caption: Switch Language
+
+   switch_language.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: API Reference
+
+   api.rst
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`search`
--- a/docs/en/make.bat
+++ b/docs/en/make.bat
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.http://sphinx-doc.org/
+	exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
+
+:end
+popd