Pick MLU modifications from master (1.x) to main (2.x) (#2704)

* [Feature] Support Voxelization with cambricon MLU device (#2500) * [Feature] Support hard_voxelize with cambricon MLU backend * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Enhance] Optimize the performace of ms_deform_attn for MLU device (#2510) * ms_opt * ms_opt * ms_opt * ms_opt * ms_opt * [Feature] ms_deform_attn performance optimization * [Feature] ms_deform_attn performance optimization * [Feature] ms_deform_attn performance optimization * [Feature] Support ball_query with cambricon MLU backend and mlu-ops library. (#2520) * [Feature] Support ball_query with cambricon MLU backend and mlu-ops library. * [Fix] update operator data layout setting. * [Fix] add cxx compile option to avoid symbol conflict. * [Fix] fix lint errors. * [Fix] update ops.md with info of ball_query support by MLU backend. * [Feature] Fix typo. * [Fix] Remove print. * [Fix] get mlu-ops from MMCV_MLU_OPS_PATH env. * [Fix] update MMCV_MLU_OPS_PATH check logic. * [Fix] update error info when failed to download mlu-ops. * [Fix] check mlu-ops version matching info in mmcv. * [Fix] revise wrong filename. * [Fix] remove f.close and re. * [Docs] Steps to compile mmcv-full on MLU machine (#2571) * [Docs] Steps to compile mmcv-full on MLU machine * [Docs] Adjust paragraph order * Update docs/zh_cn/get_started/build.md Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * Update docs/zh_cn/get_started/build.md Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * Update docs/en/get_started/build.md Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * Update docs/en/get_started/build.md Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * [Docs] Modify the format --------- Co-authored-by: budefei <budefei@cambricon.com> Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * [Fix] Fix tensor descriptor setting in MLU ball_query. (#2579) * [Feature] Add MLU support for Sparse Convolution op (#2589) * [Feature] Add sparse convolution MLU API * [Feature] update cpp code style * end-of-file * delete libext.a * code style * update ops.md --------- Co-authored-by: budefei <budefei@cambricon.com> * [Enhancement] Replace the implementation of deform_roi_pool with mlu-ops (#2598) * [Feature] Replace the implementation of deform_roi_pool with mlu-ops * [Feature] Modify code --------- Co-authored-by: budefei <budefei@cambricon.com> * [Enhancement] ms_deform_attn performance optimization (#2616) * ms_opt_v2 * ms_opt_v2_1 * optimize MultiScaleDeformableAttention ops for MLU * ms_opt_v2_1 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 --------- Co-authored-by: dongchengwei <dongchengwei@cambricon.com> * [Feature] Support NmsRotated with cambricon MLU backend (#2643) * [Feature] Support NmsRotated with cambricon MLU backend * [Feature] remove foolproofs in nms_rotated_mlu.cpp * [Feature] fix lint in test_nms_rotated.py * [Feature] fix kMLU not found in nms_rotated.cpp * [Feature] modify mlu support in nms.py * [Feature] modify nms_rotated support in ops.md * [Feature] modify ops/nms.py * [Enhance] Add a default value for MMCV_MLU_ARGS (#2688) * add mlu_args * add mlu_args * Modify the code --------- Co-authored-by: budefei <budefei@cambricon.com> * [Enhance] Ignore mlu-ops files (#2691) Co-authored-by: budefei <budefei@cambricon.com> --------- Co-authored-by: ZShaopeng <108382403+ZShaopeng@users.noreply.github.com> Co-authored-by: BinZheng <38182684+Wickyzheng@users.noreply.github.com> Co-authored-by: liuduanhui <103939338+DanieeelLiu@users.noreply.github.com> Co-authored-by: budefei <budefei@cambricon.com> Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> Co-authored-by: duzekun <108381389+duzekunKTH@users.noreply.github.com> Co-authored-by: dongchengwei <dongchengwei@cambricon.com> Co-authored-by: liuyuan1-v <125547457+liuyuan1-v@users.noreply.github.com>

Pick MLU modifications from master (1.x) to main (2.x) (#2704)
* [Feature] Support Voxelization with cambricon MLU device (#2500) * [Feature] Support hard_voxelize with cambricon MLU backend * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Feature](bangc-ops): add voxelization op * [Enhance] Optimize the performace of ms_deform_attn for MLU device (#2510) * ms_opt * ms_opt * ms_opt * ms_opt * ms_opt * [Feature] ms_deform_attn performance optimization * [Feature] ms_deform_attn performance optimization * [Feature] ms_deform_attn performance optimization * [Feature] Support ball_query with cambricon MLU backend and mlu-ops library. (#2520) * [Feature] Support ball_query with cambricon MLU backend and mlu-ops library. * [Fix] update operator data layout setting. * [Fix] add cxx compile option to avoid symbol conflict. * [Fix] fix lint errors. * [Fix] update ops.md with info of ball_query support by MLU backend. * [Feature] Fix typo. * [Fix] Remove print. * [Fix] get mlu-ops from MMCV_MLU_OPS_PATH env. * [Fix] update MMCV_MLU_OPS_PATH check logic. * [Fix] update error info when failed to download mlu-ops. * [Fix] check mlu-ops version matching info in mmcv. * [Fix] revise wrong filename. * [Fix] remove f.close and re. * [Docs] Steps to compile mmcv-full on MLU machine (#2571) * [Docs] Steps to compile mmcv-full on MLU machine * [Docs] Adjust paragraph order * Update docs/zh_cn/get_started/build.md Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * Update docs/zh_cn/get_started/build.md Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * Update docs/en/get_started/build.md Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * Update docs/en/get_started/build.md Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * [Docs] Modify the format --------- Co-authored-by: budefei <budefei@cambricon.com> Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * [Fix] Fix tensor descriptor setting in MLU ball_query. (#2579) * [Feature] Add MLU support for Sparse Convolution op (#2589) * [Feature] Add sparse convolution MLU API * [Feature] update cpp code style * end-of-file * delete libext.a * code style * update ops.md --------- Co-authored-by: budefei <budefei@cambricon.com> * [Enhancement] Replace the implementation of deform_roi_pool with mlu-ops (#2598) * [Feature] Replace the implementation of deform_roi_pool with mlu-ops * [Feature] Modify code --------- Co-authored-by: budefei <budefei@cambricon.com> * [Enhancement] ms_deform_attn performance optimization (#2616) * ms_opt_v2 * ms_opt_v2_1 * optimize MultiScaleDeformableAttention ops for MLU * ms_opt_v2_1 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 * [Feature] ms_deform_attn performance optimization V2 --------- Co-authored-by: dongchengwei <dongchengwei@cambricon.com> * [Feature] Support NmsRotated with cambricon MLU backend (#2643) * [Feature] Support NmsRotated with cambricon MLU backend * [Feature] remove foolproofs in nms_rotated_mlu.cpp * [Feature] fix lint in test_nms_rotated.py * [Feature] fix kMLU not found in nms_rotated.cpp * [Feature] modify mlu support in nms.py * [Feature] modify nms_rotated support in ops.md * [Feature] modify ops/nms.py * [Enhance] Add a default value for MMCV_MLU_ARGS (#2688) * add mlu_args * add mlu_args * Modify the code --------- Co-authored-by: budefei <budefei@cambricon.com> * [Enhance] Ignore mlu-ops files (#2691) Co-authored-by: budefei <budefei@cambricon.com> --------- Co-authored-by: ZShaopeng <108382403+ZShaopeng@users.noreply.github.com> Co-authored-by: BinZheng <38182684+Wickyzheng@users.noreply.github.com> Co-authored-by: liuduanhui <103939338+DanieeelLiu@users.noreply.github.com> Co-authored-by: budefei <budefei@cambricon.com> Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> Co-authored-by: duzekun <108381389+duzekunKTH@users.noreply.github.com> Co-authored-by: dongchengwei <dongchengwei@cambricon.com> Co-authored-by: liuyuan1-v <125547457+liuyuan1-v@users.noreply.github.com>
733e6ff8 · bdf · GitHub · 1f161f68 · 733e6ff8 · 733e6ff8
Unverified Commit 733e6ff8 authored Apr 19, 2023 by bdf Committed by GitHub Apr 19, 2023
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -27,6 +27,8 @@ wheels/
 .installed.cfg
 *.egg
 MANIFEST
+mlu-ops/
+mlu-ops.*

 # PyInstaller
 #  Usually these files are written by a python script from a template

--- a/docs/en/get_started/build.md
+++ b/docs/en/get_started/build.md
@@ -290,3 +290,60 @@ If you need to use PyTorch-related modules, make sure PyTorch has been successfu
   ```bash
   python -c 'import mmcv;print(mmcv.__version__)'
   ```
+
+### Build mmcv-full on Cambricon MLU Devices
+
+#### Install torch_mlu
+
+##### Option1: Install mmcv-full based on Cambricon docker image
+
+Firstly, install and pull Cambricon docker image (please email service@cambricon.com for the latest release docker):
+
+```bash
+docker pull ${docker image}
+```
+
+Run and attach to the docker, [Install mmcv-full on MLU device](#install-mmcv\-full-on-cambricon-mlu-device) and [make sure you've installed mmcv-full on MLU device successfully](#test-code)
+
+##### Option2: Install mmcv-full from compiling Cambricon PyTorch source code
+
+Please email service@cambricon.com or contact with Cambricon engineers for a suitable version of CATCH package. After you get the suitable version of CATCH package, please follow the steps in ${CATCH-path}/CONTRIBUTING.md to install Cambricon PyTorch.
+
+#### Install mmcv-full on Cambricon MLU device
+
+Clone the repo
+
+```bash
+git clone https://github.com/open-mmlab/mmcv.git
+```
+
+The mlu-ops library will be downloaded to the default directory (mmcv/mlu-ops) while building MMCV. You can also set `MMCV_MLU_OPS_PATH` to an existing mlu-ops library before building as follows:
+
+```bash
+export MMCV_MLU_OPS_PATH=/xxx/xxx/mlu-ops
+```
+
+Install mmcv-full
+
+```bash
+cd mmcv
+export MMCV_WITH_OPS=1
+export FORCE_MLU=1
+python setup.py install
+```
+
+#### Test Code
+
+After finishing previous steps, you can run the following python code to make sure that you've installed mmcv-full on MLU device successfully
+
+```python
+import torch
+import torch_mlu
+from mmcv.ops import sigmoid_focal_loss
+x = torch.randn(3, 10).mlu()
+x.requires_grad = True
+y = torch.tensor([1, 5, 3]).mlu()
+w = torch.ones(10).float().mlu()
+output = sigmoid_focal_loss(x, y, 2.0, 0.25, w, 'none')
+print(output)
+```
--- a/docs/en/understand_mmcv/ops.md
+++ b/docs/en/understand_mmcv/ops.md
@@ -6,7 +6,7 @@ We implement common ops used in detection, segmentation, etc.
 | ---------------------------- | --- | ---- | --- | --- | ------ |
 | ActiveRotatedFilter          | √   | √    |     |     |        |
 | AssignScoreWithK             |     | √    |     |     |        |
-| BallQuery                    |     | √    |     |     |        |
+| BallQuery                    |     | √    | √   |     |        |
 | BBoxOverlaps                 |     | √    | √   | √   | √      |
 | BorderAlign                  |     | √    |     |     |        |
 | BoxIouRotated                | √   | √    |     |     |        |
@@ -35,7 +35,7 @@ We implement common ops used in detection, segmentation, etc.
 | ModulatedDeformConv2d        | √   | √    |     |     | √      |
 | MultiScaleDeformableAttn     |     | √    | √   |     |        |
 | NMS                          | √   | √    | √   |     | √      |
-| NMSRotated                   | √   | √    |     |     | √      |
+| NMSRotated                   | √   | √    | √   |     | √      |
 | NMSQuadri                    | √   | √    |     |     |        |
 | PixelGroup                   | √   |      |     |     |        |
 | PointsInBoxes                | √   | √    |     |     |        |
@@ -52,13 +52,13 @@ We implement common ops used in detection, segmentation, etc.
 | SigmoidFocalLoss             |     | √    | √   |     | √      |
 | SoftmaxFocalLoss             |     | √    |     |     | √      |
 | SoftNMS                      |     | √    |     |     |        |
-| Sparse Convolution           |     | √    |     |     |        |
+| Sparse Convolution           |     | √    | √   |     |        |
 | Synchronized BatchNorm       |     | √    |     |     |        |
 | ThreeInterpolate             |     | √    |     |     |        |
 | ThreeNN                      |     | √    | √   |     |        |
 | TINShift                     |     | √    | √   |     |        |
 | UpFirDn2d                    |     | √    |     |     |        |
-| Voxelization                 | √   | √    |     |     | √      |
+| Voxelization                 | √   | √    | √   |     | √      |
 | PrRoIPool                    |     | √    |     |     |        |
 | BezierAlign                  | √   | √    |     |     |        |
 | BiasAct                      |     | √    |     |     |        |

--- a/docs/zh_cn/get_started/build.md
+++ b/docs/zh_cn/get_started/build.md
@@ -298,3 +298,59 @@ mmcv 有两个版本：
   ```bash
   python -c 'import mmcv;print(mmcv.__version__)'
   ```
+
+### 在寒武纪 MLU 机器编译 mmcv-full
+
+#### 安装 torch_mlu
+
+##### 选项1: 基于寒武纪 docker image 安装
+
+首先请下载并且拉取寒武纪 docker (请向 service@cambricon.com 发邮件以获得最新的寒武纪 pytorch 发布 docker)。
+
+```
+docker pull ${docker image}
+```
+
+进入 docker, [编译 MMCV MLU](#编译mmcv-mlu) 并[进行验证](#验证是否成功安装)。
+
+##### 选项2：基于 cambricon pytorch 源码编译安装
+
+请向 service@cambricon.com 发送邮件或联系 Cambricon 工程师以获取合适版本的 CATCH 软件包，在您获得合适版本的 CATCH 软件包后，请参照 ${CATCH-path}/CONTRIBUTING.md 中的步骤安装 CATCH。
+
+#### 编译 MMCV
+
+克隆代码仓库
+
+```bash
+git clone https://github.com/open-mmlab/mmcv.git
+```
+
+算子库 mlu-ops 在编译 MMCV 时自动下载到默认路径(mmcv/mlu-ops)，你也可以在编译前设置环境变量 MMCV_MLU_OPS_PATH 指向已经存在的 mlu-ops 算子库路径。
+
+```bash
+export MMCV_MLU_OPS_PATH=/xxx/xxx/mlu-ops
+```
+
+开始编译
+
+```bash
+cd mmcv
+export MMCV_WITH_OPS=1
+export FORCE_MLU=1
+python setup.py install
+```
+
+#### 验证是否成功安装
+
+完成上述安装步骤之后，您可以尝试运行下面的 Python 代码以测试您是否成功在 MLU 设备上安装了 mmcv-full
+
+```python
+import torch
+import torch_mlu
+from mmcv.ops import sigmoid_focal_loss
+x = torch.randn(3, 10).mlu()
+x.requires_grad = True
+y = torch.tensor([1, 5, 3]).mlu()
+w = torch.ones(10).float().mlu()
+output = sigmoid_focal_loss(x, y, 2.0, 0.25, w, 'none')
+```
--- a/docs/zh_cn/understand_mmcv/ops.md
+++ b/docs/zh_cn/understand_mmcv/ops.md
@@ -6,7 +6,7 @@ MMCV 提供了检测、分割等任务中常用的算子
 | ---------------------------- | --- | ---- | --- | --- | ------ |
 | ActiveRotatedFilter          | √   | √    |     |     |        |
 | AssignScoreWithK             |     | √    |     |     |        |
-| BallQuery                    |     | √    |     |     |        |
+| BallQuery                    |     | √    | √   |     |        |
 | BBoxOverlaps                 |     | √    | √   | √   | √      |
 | BorderAlign                  |     | √    |     |     |        |
 | BoxIouRotated                | √   | √    |     |     |        |
@@ -35,7 +35,7 @@ MMCV 提供了检测、分割等任务中常用的算子
 | ModulatedDeformConv2d        | √   | √    |     |     | √      |
 | MultiScaleDeformableAttn     |     | √    | √   |     |        |
 | NMS                          | √   | √    | √   |     | √      |
-| NMSRotated                   | √   | √    |     |     | √      |
+| NMSRotated                   | √   | √    | √   |     | √      |
 | NMSQuadri                    | √   | √    |     |     |        |
 | PixelGroup                   | √   |      |     |     |        |
 | PointsInBoxes                | √   | √    |     |     |        |
@@ -52,13 +52,13 @@ MMCV 提供了检测、分割等任务中常用的算子
 | SigmoidFocalLoss             |     | √    | √   |     | √      |
 | SoftmaxFocalLoss             |     | √    |     |     | √      |
 | SoftNMS                      |     | √    |     |     |        |
-| Sparse Convolution           |     | √    |     |     |        |
+| Sparse Convolution           |     | √    | √   |     |        |
 | Synchronized BatchNorm       |     | √    |     |     |        |
 | ThreeInterpolate             |     | √    |     |     |        |
 | ThreeNN                      |     | √    | √   |     |        |
 | TINShift                     |     | √    | √   |     |        |
 | UpFirDn2d                    |     | √    |     |     |        |
-| Voxelization                 | √   | √    |     |     | √      |
+| Voxelization                 | √   | √    | √   |     | √      |
 | PrRoIPool                    |     | √    |     |     |        |
 | BezierAlign                  | √   | √    |     |     |        |
 | BiasAct                      |     | √    |     |     |        |

--- a/mmcv/ops/csrc/common/mlu/deform_roi_pool_mlu_kernel.mlu
+++ b/mmcv/ops/csrc/common/mlu/deform_roi_pool_mlu_kernel.mlu
--- a/mmcv/ops/csrc/common/mlu/ms_deform_attn_mlu_kernel.mlu
+++ b/mmcv/ops/csrc/common/mlu/ms_deform_attn_mlu_kernel.mlu
--- a/mmcv/ops/csrc/common/mlu/voxelization_mlu_kernel.mlu
+++ b/mmcv/ops/csrc/common/mlu/voxelization_mlu_kernel.mlu
--- a/mmcv/ops/csrc/pytorch/mlu/ball_query_mlu.cpp
+++ b/mmcv/ops/csrc/pytorch/mlu/ball_query_mlu.cpp
+/*************************************************************************
+ * Copyright (C) 2022 Cambricon.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *************************************************************************/
+#include "mlu_common_helper.h"
+
+void ball_query_forward_mlu(int b, int n, int m, float min_radius,
+                            float max_radius, int nsample, const Tensor new_xyz,
+                            const Tensor xyz, Tensor idx) {
+  auto new_xyz_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      new_xyz, new_xyz.suggest_memory_format());
+  auto xyz_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      xyz, new_xyz.suggest_memory_format());
+  auto idx_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      idx, new_xyz.suggest_memory_format());
+
+  MluOpTensorDescriptor new_xyz_desc, xyz_desc, idx_desc;
+  new_xyz_desc.set(new_xyz_contiguous);
+  xyz_desc.set(xyz_contiguous);
+  idx_desc.set(idx_contiguous);
+
+  auto new_xyz_impl = torch_mlu::getMluTensorImpl(new_xyz_contiguous);
+  auto xyz_impl = torch_mlu::getMluTensorImpl(xyz_contiguous);
+  auto idx_impl = torch_mlu::getMluTensorImpl(idx_contiguous);
+  auto new_xyz_ptr = new_xyz_impl->cnnlMalloc();
+  auto xyz_ptr = xyz_impl->cnnlMalloc();
+  auto idx_ptr = idx_impl->cnnlMalloc();
+
+  auto handle = mluOpGetCurrentHandle();
+  mluOpBallQuery(handle, new_xyz_desc.desc(), new_xyz_ptr, xyz_desc.desc(),
+                 xyz_ptr, min_radius, max_radius, nsample, idx_desc.desc(),
+                 idx_ptr);
+}
+
+void ball_query_forward_impl(int b, int n, int m, float min_radius,
+                             float max_radius, int nsample,
+                             const Tensor new_xyz, const Tensor xyz,
+                             Tensor idx);
+
+REGISTER_DEVICE_IMPL(ball_query_forward_impl, MLU, ball_query_forward_mlu);
--- a/mmcv/ops/csrc/pytorch/mlu/deform_roi_pool_mlu.cpp
+++ b/mmcv/ops/csrc/pytorch/mlu/deform_roi_pool_mlu.cpp
@@ -9,254 +9,59 @@
 * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 *************************************************************************/
-#include "pytorch_device_registry.hpp"
-#include "pytorch_mlu_helper.hpp"
-
-void KernelDeformRoIPoolForward(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
-                                cnrtQueue_t queue, cnrtDataType_t data_type,
-                                const void *input, const void *rois,
-                                const void *offset, void *output,
-                                const int channels, const int height,
-                                const int width, const int num_rois,
-                                const int pooled_height, const int pooled_width,
-                                const float spatial_scale,
-                                const int sampling_ratio, const float gamma);
-
-void KernelDeformRoIPoolBackward(
-    cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
-    cnrtDataType_t data_type, const void *grad_output, const void *input,
-    const void *rois, const void *offset, void *grad_input, void *grad_offset,
-    const int channels, const int height, const int width, const int num_rois,
-    const int pooled_height, const int pooled_width, const float spatial_scale,
-    const int sampling_ratio, const float gamma);
-
-// policy function for forward and backward
-static void policyFunc(const int bin_num, cnrtDim3_t *k_dim,
-                       cnrtFunctionType_t *k_type) {
-  const size_t cluster_limit = torch_mlu::getDeviceAttr(cnrtAttrClusterCount);
-  ;
-  const size_t core_limit = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
-  const size_t bin_num_align = CEIL_ALIGN(bin_num, core_limit);
-  k_dim->x = core_limit;
-  k_dim->y = (bin_num_align / core_limit) > cluster_limit
-                 ? cluster_limit
-                 : (bin_num_align / core_limit);
-  k_dim->z = 1;
-  *k_type = CNRT_FUNC_TYPE_UNION1;
-}
+#include "mlu_common_helper.h"

 void DeformRoIPoolForwardMLUKernelLauncher(Tensor input, Tensor rois,
                                           Tensor offset, Tensor output,
                                           int pooled_height, int pooled_width,
                                           float spatial_scale,
                                           int sampling_ratio, float gamma) {
-  // Check dtype.
-  TORCH_CHECK(
-      input.scalar_type() == at::kFloat || input.scalar_type() == at::kHalf,
-      "input type should be Float or Half, got ", input.scalar_type());
-  TORCH_CHECK(input.scalar_type() == rois.scalar_type(),
-              "rois should have the same type as input");
-
-  // Check shape.
-  TORCH_CHECK(input.dim() == 4, "input should be 4d tensor, got ", input.dim(),
-              "D.");
-  TORCH_CHECK(rois.dim() == 2, "rois should be 2d tensor, got ", rois.dim(),
-              "D.");
-  if (offset.defined() && offset.numel() > 0) {
-    TORCH_CHECK(input.scalar_type() == offset.scalar_type(),
-                "offset should have the same type as input");
-    TORCH_CHECK(offset.dim() == 4, "offset should be 4d tensor, got ",
-                offset.dim(), "D.");
-    TORCH_CHECK(
-        (offset.size(0) == rois.size(0)), "offset.size(0) = ", offset.size(0),
-        "while rois.size(0)) = ", rois.size(0), ". They should be the same.");
-    TORCH_CHECK((offset.size(1) == 2), "offset.size(1) should be 2, ",
-                "but now offset.size(1) = ", offset.size(1), ".");
-    TORCH_CHECK((offset.size(2) == output.size(2)),
-                "offset.size(2) = ", offset.size(2),
-                "while output.size(2)) = ", output.size(2),
-                ". They should be the same.");
-    TORCH_CHECK((offset.size(3) == output.size(3)),
-                "offset.size(3) = ", offset.size(3),
-                "while output.size(3)) = ", output.size(3),
-                ". They should be the same.");
-  }
-
-  TORCH_CHECK(spatial_scale > 0 && spatial_scale <= 1,
-              "spatial_scale should be within (0, 1], got ", spatial_scale,
-              ".");
-
-  // compute kernel params
-  auto height = input.size(2);
-  auto width = input.size(3);
-  auto channels = input.size(1);
-  auto num_rois = output.size(0);
-
-  if (output.numel() == 0) {
-    output = at::zeros({num_rois, channels, pooled_height, pooled_width},
-                       input.options());
-    return;
-  }
-
-  // zero element check
-  TORCH_CHECK(input.size(0) != 0, "input.size(0) should not be zero, got ",
-              input.size(0));
-  TORCH_CHECK(rois.numel() != 0, "rois.numel() should not be zero, got ",
-              rois.numel());
-  if (input.numel() == 0 || output.numel() == 0) {
-    return;
-  }
-
-  // large tensor check
-  const size_t max_input_num = 2147483648;  // 2^31, 2G num
-  TORCH_CHECK(input.numel() < max_input_num,
-              "input.numel() should be less than 2147483648, got ",
-              input.numel());
-  TORCH_CHECK(rois.numel() < max_input_num,
-              "rois.numel() should be less than 2147483648, got ",
-              rois.numel());
-  TORCH_CHECK(output.numel() < max_input_num,
-              "output.numel() should be less than 2147483648, got ",
-              output.numel());
-  TORCH_CHECK(!offset.defined() || offset.numel() < max_input_num,
-              "offset.numel() should be less than 2147483648, got ",
-              offset.numel());
-
  auto memory_format =
      torch_mlu::cnnl::ops::get_channels_last_memory_format(input.dim());
  auto input_ = torch_mlu::cnnl::ops::cnnl_contiguous(input, memory_format);
-
-  at::Tensor output_ =
-      at::empty({num_rois, channels, pooled_height, pooled_width},
-                input.options(), memory_format);
-
-  // calculate task dimension
-  cnrtDim3_t k_dim;
-  cnrtFunctionType_t k_type;
-  policyFunc(num_rois * pooled_height * pooled_width, &k_dim, &k_type);
-
-  // get compute queue
-  auto queue = torch_mlu::getCurQueue();
+  auto rois_contiguous =
+      torch_mlu::cnnl::ops::cnnl_contiguous(rois, rois.suggest_memory_format());
+  auto output_contiguous =
+      torch_mlu::cnnl::ops::cnnl_contiguous(output, memory_format);
+
+  MluOpTensorDescriptor input_desc, rois_desc, offset_desc, output_desc;
+  input_desc.set_with_layout(input_, MLUOP_LAYOUT_NHWC);
+  rois_desc.set(rois_contiguous);
+  output_desc.set_with_layout(output_contiguous, MLUOP_LAYOUT_NHWC);
+
+  mluOpTensorDescriptor_t offset_real_desc = NULL;
+  void *offset_ptr = NULL;
+  if (offset.defined() && offset.numel() > 0) {
+    auto offset_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+        offset, offset.suggest_memory_format());
+    offset_desc.set(offset_contiguous);
+    offset_real_desc = offset_desc.desc();
+    auto offset_impl = torch_mlu::getMluTensorImpl(offset_contiguous);
+    offset_ptr = offset_impl->cnnlMalloc();
+  }

  // get ptr of tensors
  auto input_impl = torch_mlu::getMluTensorImpl(input_);
  auto input_ptr = input_impl->cnnlMalloc();
-  auto rois_impl = torch_mlu::getMluTensorImpl(rois);
+  auto rois_impl = torch_mlu::getMluTensorImpl(rois_contiguous);
  auto rois_ptr = rois_impl->cnnlMalloc();
-  auto offset_impl = torch_mlu::getMluTensorImpl(offset);
-  auto offset_ptr = offset_impl->cnnlMalloc();
-  auto output_impl = torch_mlu::getMluTensorImpl(output_);
+  auto output_impl = torch_mlu::getMluTensorImpl(output_contiguous);
  auto output_ptr = output_impl->cnnlMalloc();

-  // get comput dtype of input
-  cnrtDataType_t data_type = torch_mlu::toCnrtDtype(input_.dtype());
-
-  // launch kernel
-  CNLOG(INFO) << "Launch Kernel MLUKernelDeformRoIPoolForward<<<" << k_dim.x
-              << ", " << k_dim.y << ", " << k_dim.z << ">>>";
+  // get compute handle
+  auto handle = mluOpGetCurrentHandle();
+  mluOpDeformRoiPoolForward(
+      handle, input_desc.desc(), input_ptr, rois_desc.desc(), rois_ptr,
+      offset_real_desc, offset_ptr, pooled_height, pooled_width, spatial_scale,
+      sampling_ratio, gamma, output_desc.desc(), output_ptr);

-  KernelDeformRoIPoolForward(k_dim, k_type, queue, data_type, input_ptr,
-                             rois_ptr, offset_ptr, output_ptr, channels, height,
-                             width, num_rois, pooled_height, pooled_width,
-                             spatial_scale, sampling_ratio, gamma);
-
-  output.copy_(output_);
+  output.copy_(output_contiguous);
 }

 void DeformRoIPoolBackwardMLUKernelLauncher(
    Tensor grad_output, Tensor input, Tensor rois, Tensor offset,
    Tensor grad_input, Tensor grad_offset, int pooled_height, int pooled_width,
    float spatial_scale, int sampling_ratio, float gamma) {
-  // Check dtype.
-  TORCH_CHECK(
-      input.scalar_type() == at::kFloat || input.scalar_type() == at::kHalf,
-      "input type should be Float or Half, got ", input.scalar_type());
-  TORCH_CHECK(input.scalar_type() == grad_output.scalar_type(),
-              "grad_output should have the same type as input");
-  TORCH_CHECK(input.scalar_type() == rois.scalar_type(),
-              "rois should have the same type as input");
-  TORCH_CHECK(input.scalar_type() == grad_input.scalar_type(),
-              "grad_input should have the same type as input");
-
-  // Check shape.
-  TORCH_CHECK(grad_output.dim() == 4, "grad_output should be 4d tensor, got ",
-              grad_output.dim(), "D.");
-  TORCH_CHECK(input.dim() == 4, "input should be 4d tensor, got ", input.dim(),
-              "D.");
-  TORCH_CHECK(rois.dim() == 2, "rois should be 2d tensor, got ", rois.dim(),
-              "D.");
-  if (offset.defined() && offset.numel() > 0) {
-    TORCH_CHECK(input.scalar_type() == offset.scalar_type(),
-                "offset should have the same type as input");
-    TORCH_CHECK(offset.dim() == 4, "offset should be 4d tensor, got ",
-                offset.dim(), "D.");
-    TORCH_CHECK(
-        (offset.size(0) == rois.size(0)), "offset.size(0) = ", offset.size(0),
-        "while rois.size(0)) = ", rois.size(0), ". They should be the same.");
-    TORCH_CHECK((offset.size(1) == 2), "offset.size(1) should be 2, ",
-                "but now offset.size(1) = ", offset.size(1), ".");
-    TORCH_CHECK((offset.size(2) == grad_output.size(2)),
-                "offset.size(2) = ", offset.size(2),
-                "while grad_output.size(2)) = ", grad_output.size(2),
-                ". They should be the same.");
-    TORCH_CHECK((offset.size(3) == grad_output.size(3)),
-                "offset.size(3) = ", offset.size(3),
-                "while grad_output.size(3)) = ", grad_output.size(3),
-                ". They should be the same.");
-  }
-
-  TORCH_CHECK(spatial_scale > 0 && spatial_scale <= 1,
-              "spatial_scale should be within (0, 1], got ", spatial_scale);
-
-  // Check relationship between tensor.
-  TORCH_CHECK((grad_output.size(0) == rois.size(0)),
-              "grad_output.size(0) = ", grad_output.size(0),
-              "while rois.size(0)) = ", rois.size(0),
-              ". They should be the same.");
-  TORCH_CHECK((grad_output.size(1) == input.size(1)),
-              "grad_output.size(1) = ", grad_output.size(1),
-              "while input.size(1)) = ", input.size(1),
-              ". They should be the same.");
-  TORCH_CHECK((grad_output.size(2) == pooled_height),
-              "grad_output.size(2) = ", grad_output.size(2),
-              "while pooled_height = ", pooled_height,
-              ". They should be the same.");
-  TORCH_CHECK((grad_output.size(3) == pooled_width),
-              "grad_output.size(3) = ", grad_output.size(3),
-              "while pooled_width = ", pooled_width,
-              ". They should be the same.");
-
-  // compute kernel params
-  auto batch = input.size(0);
-  auto channels = input.size(1);
-  auto height = input.size(2);
-  auto width = input.size(3);
-  auto num_rois = grad_output.size(0);
-
-  // zero element check
-  TORCH_CHECK(input.size(0) != 0, "input.size(0) should not be zero, got ",
-              input.size(0));
-  TORCH_CHECK(rois.numel() != 0, "rois.numel() should not be zero, got ",
-              rois.numel());
-  if (input.numel() == 0 || grad_output.numel() == 0) {
-    return;
-  }
-
-  // large tensor check
-  const size_t max_input_num = 2147483648;  // 2^31, 2G num
-  TORCH_CHECK(input.numel() < max_input_num,
-              "input.numel() should be less than 2147483648, got ",
-              input.numel());
-  TORCH_CHECK(rois.numel() < max_input_num,
-              "rois.numel() should be less than 2147483648, got ",
-              rois.numel());
-  TORCH_CHECK(grad_output.numel() < max_input_num,
-              "grad_output.numel() should be less than 2147483648, got ",
-              grad_output.numel());
-  TORCH_CHECK(!offset.defined() || offset.numel() < max_input_num,
-              "offset.numel() should be less than 2147483648, got ",
-              offset.numel());
-
  auto memory_format =
      torch_mlu::cnnl::ops::get_channels_last_memory_format(grad_output.dim());
  auto grad_output_ =
@@ -264,45 +69,56 @@ void DeformRoIPoolBackwardMLUKernelLauncher(
  memory_format =
      torch_mlu::cnnl::ops::get_channels_last_memory_format(input.dim());
  auto input_ = torch_mlu::cnnl::ops::cnnl_contiguous(input, memory_format);
-  at::Tensor grad_input_ = at::empty({batch, channels, height, width},
-                                     input.options(), memory_format)
-                               .zero_();
-
-  // calculate task dimension
-  cnrtDim3_t k_dim;
-  cnrtFunctionType_t k_type;
-  policyFunc(num_rois * pooled_height * pooled_width, &k_dim, &k_type);
-
-  // get compute queue
-  auto queue = torch_mlu::getCurQueue();
+  auto rois_contiguous =
+      torch_mlu::cnnl::ops::cnnl_contiguous(rois, rois.suggest_memory_format());
+  auto grad_input_ =
+      torch_mlu::cnnl::ops::cnnl_contiguous(grad_input, memory_format);

  // get ptr of tensors
  auto grad_output_impl = torch_mlu::getMluTensorImpl(grad_output_);
  auto grad_output_ptr = grad_output_impl->cnnlMalloc();
  auto input_impl = torch_mlu::getMluTensorImpl(input_);
  auto input_ptr = input_impl->cnnlMalloc();
-  auto rois_impl = torch_mlu::getMluTensorImpl(rois);
+  auto rois_impl = torch_mlu::getMluTensorImpl(rois_contiguous);
  auto rois_ptr = rois_impl->cnnlMalloc();
-  auto offset_impl = torch_mlu::getMluTensorImpl(offset);
-  auto offset_ptr = offset_impl->cnnlMalloc();
  auto grad_input_impl = torch_mlu::getMluTensorImpl(grad_input_);
  auto grad_input_ptr = grad_input_impl->cnnlMalloc();
-  auto grad_offset_impl = torch_mlu::getMluTensorImpl(grad_offset);
-  auto grad_offset_ptr = grad_offset_impl->cnnlMalloc();
-
-  // get comput dtype of input
-  cnrtDataType_t data_type = torch_mlu::toCnrtDtype(input.dtype());

-  // launch kernel
-  CNLOG(INFO) << "Launch Kernel KernelDeformRoIPoolBackward<<<" << k_dim.x
-              << ", " << k_dim.y << ", " << k_dim.z << ">>>";
-
-  KernelDeformRoIPoolBackward(k_dim, k_type, queue, data_type, grad_output_ptr,
-                              input_ptr, rois_ptr, offset_ptr, grad_input_ptr,
-                              grad_offset_ptr, channels, height, width,
-                              num_rois, pooled_height, pooled_width,
-                              spatial_scale, sampling_ratio, gamma);
+  MluOpTensorDescriptor grad_output_desc, input_desc, rois_desc, offset_desc,
+      grad_input_desc, grad_offset_desc;
+  grad_output_desc.set_with_layout(grad_output_, MLUOP_LAYOUT_NHWC);
+  input_desc.set_with_layout(input_, MLUOP_LAYOUT_NHWC);
+  rois_desc.set(rois_contiguous);
+  grad_input_desc.set_with_layout(grad_input_, MLUOP_LAYOUT_NHWC);
+  mluOpTensorDescriptor_t offset_real_desc = NULL;
+  void *offset_ptr = NULL;
+  if (offset.defined() && offset.numel() > 0) {
+    auto offset_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+        offset, offset.suggest_memory_format());
+    offset_desc.set(offset_contiguous);
+    offset_real_desc = offset_desc.desc();
+    auto offset_impl = torch_mlu::getMluTensorImpl(offset_contiguous);
+    offset_ptr = offset_impl->cnnlMalloc();
+  }
+  mluOpTensorDescriptor_t grad_offset_real_desc = NULL;
+  void *grad_offset_ptr = NULL;
+  if (grad_offset.defined() && grad_offset.numel() > 0) {
+    auto grad_offset_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+        grad_offset, grad_offset.suggest_memory_format());
+    grad_offset_desc.set(grad_offset_contiguous);
+    grad_offset_real_desc = grad_offset_desc.desc();
+    auto grad_offset_impl = torch_mlu::getMluTensorImpl(grad_offset_contiguous);
+    grad_offset_ptr = grad_offset_impl->cnnlMalloc();
+  }

+  // get compute handle
+  auto handle = mluOpGetCurrentHandle();
+  mluOpDeformRoiPoolBackward(
+      handle, grad_output_desc.desc(), grad_output_ptr, input_desc.desc(),
+      input_ptr, rois_desc.desc(), rois_ptr, offset_real_desc, offset_ptr,
+      pooled_height, pooled_width, spatial_scale, sampling_ratio, gamma,
+      grad_input_desc.desc(), grad_input_ptr, grad_offset_real_desc,
+      grad_offset_ptr);
  grad_input.copy_(grad_input_);
 }


--- a/mmcv/ops/csrc/pytorch/mlu/mlu_common_helper.cpp
+++ b/mmcv/ops/csrc/pytorch/mlu/mlu_common_helper.cpp
+/*************************************************************************
+ * Copyright (C) 2022 Cambricon.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *************************************************************************/
+#include "mlu_common_helper.h"
+
+// Descriptors
+mluOpDataType_t getMluOpDataType(const caffe2::TypeMeta& data_type) {
+  const std::map<std::string, mluOpDataType_t> mapping_type = {
+      {std::string("c10::Half"), MLUOP_DTYPE_HALF},
+      {std::string("float"), MLUOP_DTYPE_FLOAT},
+      {std::string("double"), MLUOP_DTYPE_DOUBLE},
+      {std::string("int8"), MLUOP_DTYPE_INT8},
+      {std::string("signed char"), MLUOP_DTYPE_INT8},
+      {std::string("short int"), MLUOP_DTYPE_INT16},
+      {std::string("short"), MLUOP_DTYPE_INT16},
+      {std::string("int"), MLUOP_DTYPE_INT32},
+      {std::string("long int"), MLUOP_DTYPE_INT64},
+      {std::string("long"), MLUOP_DTYPE_INT64},
+      {std::string("unsigned char"), MLUOP_DTYPE_UINT8},
+      {std::string("bool"), MLUOP_DTYPE_BOOL},
+      {std::string("c10::complex<c10::Half>"), MLUOP_DTYPE_COMPLEX_HALF},
+      {std::string("c10::complex<float>"), MLUOP_DTYPE_COMPLEX_FLOAT}};
+
+  if (mapping_type.find(std::string(data_type.name())) != mapping_type.end()) {
+    return mapping_type.find(std::string(data_type.name()))->second;
+  }
+  return MLUOP_DTYPE_INVALID;
+}
+
+// laytout
+mluOpTensorLayout_t getMluOpSuggestLayout(const at::Tensor& input) {
+  auto suggest_memory_format = input.suggest_memory_format();
+  mluOpTensorLayout_t layout = MLUOP_LAYOUT_ARRAY;
+  switch (input.dim()) {
+    case 4:
+      layout = (suggest_memory_format == at::MemoryFormat::ChannelsLast)
+                   ? MLUOP_LAYOUT_NHWC
+                   : MLUOP_LAYOUT_NCHW;
+      break;
+    case 5:
+      layout = (suggest_memory_format == at::MemoryFormat::ChannelsLast3d)
+                   ? MLUOP_LAYOUT_NDHWC
+                   : MLUOP_LAYOUT_NCDHW;
+      break;
+    default:
+      layout = MLUOP_LAYOUT_ARRAY;
+  }
+  return layout;
+}
+
+void MluOpTensorDescriptor::set(Tensor t) {
+  mluOpDataType_t data_type = getMluOpDataType(t.dtype());
+  mluOpTensorLayout_t layout = getMluOpSuggestLayout(t);
+  int t_dim = t.dim();
+  std::vector<int> dim_array;
+  if (t_dim == 0) {
+    dim_array.push_back(
+        1);  // ScalarTensor(0-dim 1-item Tensor) view like size = 1 as default;
+  } else {
+    for (int i = 0; i < t_dim; i++) {
+      dim_array.push_back(static_cast<int>(t.sizes().vec()[i]));
+    }
+  }
+  set_desc(t, layout, data_type, dim_array);
+}
+
+void MluOpTensorDescriptor::set_with_layout(Tensor t,
+                                            mluOpTensorLayout_t layout) {
+  mluOpDataType_t data_type = getMluOpDataType(t.dtype());
+  int t_dim = t.dim();
+  std::vector<int> shape_info = checkUpperBoundAndCastTo<int>(t.sizes().vec());
+  std::vector<int> stride_info =
+      checkUpperBoundAndCastTo<int>(t.strides().vec());
+  if (layout == MLUOP_LAYOUT_NHWC || layout == MLUOP_LAYOUT_NDHWC ||
+      layout == MLUOP_LAYOUT_NLC) {
+    convertShapeAndStride(shape_info, stride_info);
+  } else if (layout == MLUOP_LAYOUT_HWCN) {
+    auto convertDepthWiseConvShapeStride = [](const std::vector<int64_t>& vec,
+                                              std::vector<int>& target_vec,
+                                              std::vector<int>& stride_vec) {
+      // NCHW --> HWCN
+      target_vec[0] = static_cast<int>(vec[2]);
+      target_vec[1] = static_cast<int>(vec[3]);
+      target_vec[2] = static_cast<int>(vec[1]);
+      target_vec[3] = static_cast<int>(vec[0]);
+      // Calculate Stride just like contiguous of HWCN.
+      stride_vec[3] = 1;
+      stride_vec[2] = target_vec[3] * stride_vec[3];
+      stride_vec[1] = target_vec[2] * stride_vec[2];
+      stride_vec[0] = target_vec[1] * stride_vec[1];
+    };
+    convertDepthWiseConvShapeStride(t.sizes().vec(), shape_info, stride_info);
+  }
+  TORCH_CHECK(mluOpSetTensorDescriptorEx(
+                  desc_, layout, data_type, t_dim, shape_info.data(),
+                  stride_info.data()) == MLUOP_STATUS_SUCCESS,
+              "mluOpSetTensorDescriptorEx execution failed.");
+}
+
+void MluOpTensorDescriptor::set_desc(const at::Tensor& t,
+                                     mluOpTensorLayout_t layout,
+                                     mluOpDataType_t dtype,
+                                     std::vector<int>& dims) {
+  int dimNb = dims.size();
+  mluOpSetTensorDescriptor(desc_, layout, dtype, dimNb, dims.data());
+}
+
+// Handles
+std::once_flag mmcv_mluop_init_flag;
+std::mutex mmcv_mluop_mutex;
+static std::vector<MluOpHandle> mmcv_mluop_handles;
+
+mluOpHandle_t mluOpGetCurrentHandle(c10::DeviceIndex device_index) {
+  std::call_once(mmcv_mluop_init_flag,
+                 []()  // Init mmcv_mluop_handles 1-device <-> 1-handle
+                 {
+                   c10::DeviceIndex num_devices = torch_mlu::device_count();
+                   mmcv_mluop_handles.resize(num_devices);
+                 });
+
+  if (device_index == -1) {
+    device_index = torch_mlu::current_device();
+  }
+  std::lock_guard<std::mutex> mmcv_mluop_guard(mmcv_mluop_mutex);
+  auto queue = torch_mlu::getCurrentQueue(device_index).queue();
+  mmcv_mluop_handles[device_index].setQueue(queue);
+  return mmcv_mluop_handles[device_index].handle;
+}
--- a/mmcv/ops/csrc/pytorch/mlu/mlu_common_helper.h
+++ b/mmcv/ops/csrc/pytorch/mlu/mlu_common_helper.h
+/*************************************************************************
+ * Copyright (C) 2022 Cambricon.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *************************************************************************/
+#pragma once
+#include <ATen/ATen.h>
+#include <c10/core/ScalarType.h>
+
+#include "aten.h"
+#include "mlu_op.h"
+#include "pytorch_device_registry.hpp"
+
+#define MLUOP_MAJOR 0
+#define MLUOP_MINOR 5
+#define MLUOP_PATCHLEVEL 302
+
+mluOpDataType_t getMluOpDataType(const caffe2::TypeMeta& data_type);
+mluOpTensorLayout_t getMluOpSuggestLayout(const at::Tensor& input);
+
+class MluOpTensorDescriptor {
+ public:
+  MluOpTensorDescriptor() { mluOpCreateTensorDescriptor(&desc_); };
+  ~MluOpTensorDescriptor() { mluOpDestroyTensorDescriptor(desc_); }
+
+  void set(at::Tensor);
+  void set_with_layout(at::Tensor, mluOpTensorLayout_t layout);
+  mluOpTensorDescriptor_t desc() { return desc_; }
+
+ private:
+  mluOpTensorDescriptor_t desc_;
+  void set_desc(const at::Tensor&, mluOpTensorLayout_t, mluOpDataType_t,
+                std::vector<int>& dims);
+};
+
+mluOpHandle_t mluOpGetCurrentHandle(c10::DeviceIndex device_index = -1);
+
+class MluOpHandle {
+ public:
+  MluOpHandle() : handle(nullptr) { mluOpCreate(&handle); }
+  ~MluOpHandle() {
+    if (handle) {
+      mluOpDestroy(handle);
+      handle = nullptr;
+    }
+  }
+  void setQueue(cnrtQueue_t queue) { mluOpSetQueue(handle, queue); }
+  mluOpHandle_t handle;
+};
+
+// modify tensor size and stride order based on
+// channels_first to channels_last or channels_last_3d.
+// which this is not same with pytorch original layout,
+// this real layout is based on data storage real order.
+// example: modify channels_last tensor dim to nhwc tensor desc.
+//            N    C H W  -->   N    H W C
+//          C*H*W  1 W C  --> C*H*W  W C 1
+template <typename T>
+void convertShapeAndStride(std::vector<T>& shape_info,
+                           std::vector<T>& stride_info) {
+  TORCH_MLU_CHECK(shape_info.size() == stride_info.size(),
+                  "shape size need equal to stride size.");
+  const int dim = shape_info.size();
+  std::vector<T> temp_shape_info(dim);
+  std::vector<T> temp_stride_info(dim);
+  temp_shape_info[0] = shape_info[0];
+  temp_stride_info[0] = stride_info[0];
+  for (size_t i = 0; i < dim - 1; ++i) {
+    const int index = (i + 1) % (dim - 1) + 1;
+    temp_shape_info[i + 1] = shape_info[index];
+    temp_stride_info[i + 1] = stride_info[index];
+  }
+  shape_info.assign(temp_shape_info.begin(), temp_shape_info.end());
+  stride_info.assign(temp_stride_info.begin(), temp_stride_info.end());
+}
+
+// torch tensor provides int64_t type of shape and stride,
+// but mluops descriptor requires type int32.
+// use this function to ensure safe CAST, or report an error.
+template <typename DST_T, typename SRC_T>
+std::vector<DST_T> checkUpperBoundAndCastTo(const std::vector<SRC_T>& input) {
+  std::vector<DST_T> output;
+  output.reserve(input.size());
+  for (const auto& val : input) {
+    if (val > std::numeric_limits<DST_T>::max()) {
+      TORCH_MLU_CHECK(false, "Requires dim size not greater than ",
+                      std::numeric_limits<DST_T>::max(), ". But got ", val,
+                      ".");
+    }
+    output.push_back(static_cast<DST_T>(val));
+  }
+  return output;
+}
--- a/mmcv/ops/csrc/pytorch/mlu/ms_deform_attn_mlu.cpp
+++ b/mmcv/ops/csrc/pytorch/mlu/ms_deform_attn_mlu.cpp
@@ -14,7 +14,15 @@

 #define MIN(a, b) (((a) < (b)) ? (a) : (b))

-void KernelMsDeformAttnForward(
+typedef enum {
+  MS_DEFORM_ATTN_FORWARD_INVALID = 0, /*!< Index is invalid. */
+  MS_DEFORM_ATTN_FORWARD_DEFAULT =
+      1, /*!< MLUKernelMsDeformAttnForwardDefault */
+  MS_DEFORM_ATTN_FORWARD_SMALL_CHANNEL =
+      2, /*!< MLUKernelMsDeformAttnForwardSmallChannel */
+} MsDeformAttnForwardPolicy;
+
+void KernelMsDeformAttnForwardDefault(
    cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
    const cnrtDataType_t d_type, const char* data_value_gdram,
    const char* data_spatial_shapes_gdram,
@@ -23,7 +31,37 @@ void KernelMsDeformAttnForward(
    const int32_t batch_size, const int32_t num_keys, const int32_t num_heads,
    const int32_t channels, const int32_t num_levels, const int32_t num_queries,
    const int32_t num_points, char* data_col_gdram);
-void KernelMsDeformAttnBackward(
+void KernelMsDeformAttnForwardSmallChannel(
+    cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
+    const cnrtDataType_t d_type, const char* data_value_gdram,
+    const char* data_spatial_shapes_gdram,
+    const char* data_level_start_index_gdram,
+    const char* data_sampling_loc_gdram, const char* data_attn_weight_gdram,
+    const int32_t batch_size, const int32_t num_keys, const int32_t num_heads,
+    const int32_t channels, const int32_t num_levels, const int32_t num_queries,
+    const int32_t num_points, char* data_col_gdram);
+
+typedef enum {
+  MS_DEFORM_ATTN_BACKWARD_DEFAULT = 0,
+  MS_DEFORM_ATTN_BACKWARD_SMALL_CHANNEL = 1,
+} MsDeformAttnBackwardKernelPolicy;
+
+MsDeformAttnBackwardKernelPolicy msDeformAttnBackwardPolicyFunc(
+    const int32_t channels, const int32_t num_levels, const int32_t num_points,
+    const int32_t num_heads) {
+  const int32_t nram_size = torch_mlu::getDeviceAttr(cnrtAttrNramSizePerMcore);
+  const int num_hlp = num_heads * num_levels * num_points;
+  int num_per_time_theory = (nram_size - num_levels * sizeof(float) -
+                             3 * num_levels * sizeof(int32_t)) /
+                            sizeof(float) / (8 * PAD_UP(channels, 32) + 28) /
+                            PAD_UP((num_hlp), 32);
+  if (num_per_time_theory >= 1) {
+    return MS_DEFORM_ATTN_BACKWARD_SMALL_CHANNEL;
+  }
+  return MS_DEFORM_ATTN_BACKWARD_DEFAULT;
+}
+
+void KernelMsDeformAttnBackwardDefaultKernel(
    cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
    const cnrtDataType_t d_type, const float* data_value,
    const int32_t* spatial_shapes, const int32_t* data_level_start_index,
@@ -32,10 +70,23 @@ void KernelMsDeformAttnBackward(
    const int32_t num_heads, const int32_t channels, const int32_t num_levels,
    const int32_t num_queries, const int32_t num_points, float* grad_value,
    float* grad_sampling_loc, float* grad_attn_weight);
+
+void KernelMsDeformAttnBackwardSmallChannelsKernel(
+    cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
+    const cnrtDataType_t d_type, const float* data_value,
+    const int32_t* spatial_shapes, const int32_t* data_level_start_index,
+    const float* data_sampling_loc, const float* data_attn_weight,
+    const float* grad_output, const int32_t batch, const int32_t spatial_size,
+    const int32_t num_heads, const int32_t channels, const int32_t num_levels,
+    const int32_t num_query, const int32_t num_points, float* grad_value,
+    float* grad_sampling_loc, float* grad_attn_weight);
+
 // policy function
-static void policyFuncForward(cnrtDim3_t* k_dim, cnrtFunctionType_t* k_type,
-                              const int batch_size, const int num_queries,
-                              const int num_heads) {
+MsDeformAttnForwardPolicy msDeformAttnForwardPolicyFunc(
+    cnrtDim3_t* k_dim, cnrtFunctionType_t* k_type, const int32_t batch_size,
+    const int32_t num_keys, const int32_t num_heads, const int32_t channels,
+    const int32_t num_levels, const int32_t num_queries,
+    const int32_t num_points) {
  k_dim->x = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
  k_dim->y =
      MIN((batch_size * num_queries * num_heads + k_dim->x - 1) / k_dim->x,
@@ -46,6 +97,16 @@ static void policyFuncForward(cnrtDim3_t* k_dim, cnrtFunctionType_t* k_type,
 #else
  *k_type = CNRT_FUNC_TYPE_UNION1;
 #endif
+
+  int32_t nram_size = torch_mlu::getDeviceAttr(cnrtAttrNramSizePerMcore);
+  if (num_levels * num_points * 3 * sizeof(int32_t) > nram_size) {
+    return MS_DEFORM_ATTN_FORWARD_DEFAULT;
+  } else if (channels > nram_size / 12 / sizeof(float) || channels > 96 ||
+             channels < 16) {
+    return MS_DEFORM_ATTN_FORWARD_DEFAULT;
+  } else {
+    return MS_DEFORM_ATTN_FORWARD_SMALL_CHANNEL;
+  }
 }

 // policy function for backward
@@ -196,7 +257,9 @@ Tensor ms_deform_attn_mlu_forward(const Tensor& value,
  // calculate task dimension
  cnrtDim3_t k_dim;
  cnrtFunctionType_t k_type;
-  policyFuncForward(&k_dim, &k_type, batch_size, num_queries, num_heads);
+  MsDeformAttnForwardPolicy policy = msDeformAttnForwardPolicyFunc(
+      &k_dim, &k_type, batch_size, num_keys, num_heads, channels, num_levels,
+      num_queries, num_points);

  // get compute queue
  auto queue = torch_mlu::getCurQueue();
@@ -222,15 +285,33 @@ Tensor ms_deform_attn_mlu_forward(const Tensor& value,
  cnrtDataType_t data_type = torch_mlu::toCnrtDtype(value.dtype());

  // launch kernel
-  CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnForward<<<" << k_dim.x
-              << ", " << k_dim.y << ", " << k_dim.z << ">>>";
-
-  KernelMsDeformAttnForward(
+  switch (policy) {
+    default: {
+      VLOG(5) << "MsDeformAttnForward Policy not supported";
+    }; break;
+    case MS_DEFORM_ATTN_FORWARD_DEFAULT: {
+      CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnForwardDefault<<<"
+                  << k_dim.x << ", " << k_dim.y << ", " << k_dim.z << ">>>";
+      KernelMsDeformAttnForwardDefault(
+          k_dim, k_type, queue, data_type, (char*)value_ptr,
+          (char*)spatial_shapes_ptr, (char*)level_start_index_ptr,
+          (char*)sampling_loc_ptr, (char*)attn_weight_ptr, batch_size, num_keys,
+          num_heads, channels, num_levels, num_queries, num_points,
+          (char*)output_ptr);
+      break;
+    }
+    case MS_DEFORM_ATTN_FORWARD_SMALL_CHANNEL: {
+      CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnForwardSmallChannel<<<"
+                  << k_dim.x << ", " << k_dim.y << ", " << k_dim.z << ">>>";
+      KernelMsDeformAttnForwardSmallChannel(
          k_dim, k_type, queue, data_type, (char*)value_ptr,
          (char*)spatial_shapes_ptr, (char*)level_start_index_ptr,
          (char*)sampling_loc_ptr, (char*)attn_weight_ptr, batch_size, num_keys,
          num_heads, channels, num_levels, num_queries, num_points,
          (char*)output_ptr);
+      break;
+    }
+  }

  output = output.view({batch_size, num_queries, num_heads * channels});
  return output;
@@ -391,14 +472,32 @@ void ms_deform_attn_mlu_backward(
  // launch kernel
  CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnBackward<<<" << k_dim.x
              << ", " << k_dim.y << ", " << k_dim.z << ">>>";
-
-  KernelMsDeformAttnBackward(
+  MsDeformAttnBackwardKernelPolicy kernelPolicy =
+      msDeformAttnBackwardPolicyFunc(channels, num_levels, num_points,
+                                     num_heads);
+  switch (kernelPolicy) {
+    default: {
+      VLOG(5) << "NotImplemented.";
+    } break;
+    case MS_DEFORM_ATTN_BACKWARD_DEFAULT: {
+      KernelMsDeformAttnBackwardDefaultKernel(
+          k_dim, k_type, queue, data_type, (float*)value_ptr,
+          (int32_t*)spatial_shapes_ptr, (int32_t*)level_start_index_ptr,
+          (float*)sampling_loc_ptr, (float*)attn_weight_ptr,
+          (float*)grad_output_ptr, batch_size, num_keys, num_heads, channels,
+          num_levels, num_queries, num_points, (float*)grad_value_ptr,
+          (float*)grad_sampling_loc_ptr, (float*)grad_attn_weight_ptr);
+    } break;
+    case MS_DEFORM_ATTN_BACKWARD_SMALL_CHANNEL: {
+      KernelMsDeformAttnBackwardSmallChannelsKernel(
          k_dim, k_type, queue, data_type, (float*)value_ptr,
          (int32_t*)spatial_shapes_ptr, (int32_t*)level_start_index_ptr,
          (float*)sampling_loc_ptr, (float*)attn_weight_ptr,
          (float*)grad_output_ptr, batch_size, num_keys, num_heads, channels,
          num_levels, num_queries, num_points, (float*)grad_value_ptr,
          (float*)grad_sampling_loc_ptr, (float*)grad_attn_weight_ptr);
+    } break;
+  }
 }

 Tensor ms_deform_attn_impl_forward(const Tensor& value,

--- a/mmcv/ops/csrc/pytorch/mlu/nms_rotated_mlu.cpp
+++ b/mmcv/ops/csrc/pytorch/mlu/nms_rotated_mlu.cpp
+/*************************************************************************
+ * Copyright (C) 2021 Cambricon.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *************************************************************************/
+#include "mlu_common_helper.h"
+
+Tensor nms_rotated_mlu(Tensor boxes, Tensor scores, float iou_threshold) {
+  if (boxes.numel() == 0) {
+    return at::empty({0}, boxes.options().dtype(at::kLong));
+  }
+
+  int boxes_num = boxes.size(0);
+  auto boxes_ = torch_mlu::cnnl::ops::cnnl_contiguous(boxes);
+  auto scores_ = torch_mlu::cnnl::ops::cnnl_contiguous(scores);
+  auto output = at::empty({boxes_num}, boxes.options().dtype(at::kInt));
+  auto output_size = at::empty({1}, scores.options().dtype(at::kInt));
+
+  MluOpTensorDescriptor boxes_desc, scores_desc, output_desc;
+  boxes_desc.set(boxes_);
+  scores_desc.set(scores_);
+  output_desc.set(output);
+
+  // workspace
+  size_t workspace_size = 0;
+  auto handle = mluOpGetCurrentHandle();
+  mluOpGetNmsRotatedWorkspaceSize(handle, boxes_desc.desc(), &workspace_size);
+  auto workspace = at::empty(workspace_size, boxes.options().dtype(at::kByte));
+
+  auto boxes_impl = torch_mlu::getMluTensorImpl(boxes_);
+  auto boxes_ptr = boxes_impl->cnnlMalloc();
+  auto scores_impl = torch_mlu::getMluTensorImpl(scores_);
+  auto scores_ptr = scores_impl->cnnlMalloc();
+  auto workspace_impl = torch_mlu::getMluTensorImpl(workspace);
+  auto workspace_ptr = workspace_impl->cnnlMalloc();
+  auto output_impl = torch_mlu::getMluTensorImpl(output);
+  auto output_ptr = output_impl->cnnlMalloc();
+  auto output_size_impl = torch_mlu::getMluTensorImpl(output_size);
+  auto output_size_ptr = output_size_impl->cnnlMalloc();
+
+  mluOpNmsRotated(handle, iou_threshold, boxes_desc.desc(), boxes_ptr,
+                  scores_desc.desc(), scores_ptr, workspace_ptr, workspace_size,
+                  output_desc.desc(), output_ptr, (int *)output_size_ptr);
+  int output_num = *static_cast<int *>(output_size.cpu().data_ptr());
+  auto ret = output.to(boxes.options().dtype(at::kLong));
+  return ret.slice(0, 0, output_num);
+}
--- a/mmcv/ops/csrc/pytorch/mlu/sparse_conv_mlu.cpp
+++ b/mmcv/ops/csrc/pytorch/mlu/sparse_conv_mlu.cpp
+/*************************************************************************
+ * Copyright (C) 2022 Cambricon.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *************************************************************************/
+#include <torch/script.h>
+
+#include <vector>
+
+#include "mlu_common_helper.h"
+#include "pytorch_device_registry.hpp"
+#include "pytorch_mlu_helper.hpp"
+
+template <unsigned NDim>
+std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher(
+    torch::Tensor indices, int64_t batchSize,
+    std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
+    std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
+    std::vector<int64_t> padding, std::vector<int64_t> dilation,
+    std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose) {
+  // The following code is copied from
+  // mmcv/ops/csrc/pytorch/cuda/spconv_ops_cuda.cu to ensure the output is
+  // available for network train. The outputs of this function have correct
+  // shape but wrong value.
+  auto numAct = indices.size(0);
+  auto kernelVolume = kernelSize[0];
+  int sub_m = (int)_subM;
+  int transpose = (int)_transpose;
+  int batch = (int)batchSize;
+  auto coorDim = indices.size(1) - 1;
+
+  for (int i = 1; i < kernelSize.size(); ++i) {
+    kernelVolume *= kernelSize[i];
+  }
+
+  auto outputVolume = outSpatialShape[0];
+  for (int i = 1; i < outSpatialShape.size(); ++i) {
+    outputVolume *= outSpatialShape[i];
+  }
+  torch::Tensor indicePairs = at::full({kernelVolume, 2, numAct}, -1,
+                                       indices.options().dtype(at::kInt));
+  torch::Tensor indiceNum =
+      at::zeros({kernelVolume}, indices.options().dtype(at::kInt));
+  int out_size = sub_m == 1
+                     ? numAct
+                     : std::min(numAct * kernelVolume, batch * outputVolume);
+  torch::Tensor out_indices =
+      at::zeros({out_size, coorDim + 1}, indices.options().dtype(at::kInt));
+  auto indices_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      indices, at::MemoryFormat::Contiguous);
+  auto indicePairs_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      indicePairs, at::MemoryFormat::Contiguous);
+  auto indiceNum_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      indiceNum, at::MemoryFormat::Contiguous);
+  auto out_indices_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      out_indices, at::MemoryFormat::Contiguous);
+
+  std::vector<int> input_space;
+  std::vector<int> filter_space;
+  std::vector<int> output_space;
+  std::vector<int> padding32;
+  std::vector<int> stride32;
+  std::vector<int> dilation32;
+  for (int i = 0; i < NDim; i++) {
+    input_space.push_back(spatialShape[i]);
+    filter_space.push_back(kernelSize[i]);
+    output_space.push_back(outSpatialShape[i]);
+    padding32.push_back(padding[i]);
+    stride32.push_back(stride[i]);
+    dilation32.push_back(dilation[i]);
+  }
+  MluOpTensorDescriptor indices_desc, out_indices_desc, indicePairs_desc,
+      indiceNum_desc;
+  indices_desc.set(indices_contiguous);
+  indicePairs_desc.set(indicePairs_contiguous);
+  indiceNum_desc.set(indiceNum_contiguous);
+  out_indices_desc.set(out_indices_contiguous);
+  {
+    mluOpTensorLayout_t layout = MLUOP_LAYOUT_ARRAY;
+    mluOpDataType_t dtype = MLUOP_DTYPE_INT32;
+    std::vector<int> dims;
+    dims = {numAct, coorDim + 1};
+    mluOpSetTensorDescriptor(indices_desc.desc(), layout, dtype, dims.size(),
+                             dims.data());
+    dims = {kernelVolume, 2, numAct};
+    mluOpSetTensorDescriptor(indicePairs_desc.desc(), layout, dtype,
+                             dims.size(), dims.data());
+    dims = {kernelVolume};
+    mluOpSetTensorDescriptor(indiceNum_desc.desc(), layout, dtype, dims.size(),
+                             dims.data());
+    dims = {out_size, coorDim + 1};
+    mluOpSetTensorDescriptor(out_indices_desc.desc(), layout, dtype,
+                             dims.size(), dims.data());
+  }
+
+  mluOpSparseConvolutionDescriptor_t sparse_conv_desc;
+  mluOpCreateSparseConvolutionDescriptor(&sparse_conv_desc);
+  mluOpSetSparseConvolutionDescriptor(
+      sparse_conv_desc, NDim + 2, batch, padding32.data(), stride32.data(),
+      dilation32.data(), input_space.data(), filter_space.data(),
+      output_space.data(), sub_m, transpose, 0);
+
+  auto handle = mluOpGetCurrentHandle();
+  size_t workspace_size = 0;
+  mluOpGetIndicePairsWorkspaceSize(
+      handle, sparse_conv_desc, indices_desc.desc(), indicePairs_desc.desc(),
+      out_indices_desc.desc(), indiceNum_desc.desc(), &workspace_size);
+  auto indice_workspace_size =
+      at::empty(workspace_size, indices.options().dtype(at::kByte));
+
+  auto indices_impl = torch_mlu::getMluTensorImpl(indices_contiguous);
+  auto out_indices_impl = torch_mlu::getMluTensorImpl(out_indices_contiguous);
+  auto indicePairs_impl = torch_mlu::getMluTensorImpl(indicePairs_contiguous);
+  auto indiceNum_impl = torch_mlu::getMluTensorImpl(indiceNum_contiguous);
+  auto indice_workspace_impl =
+      torch_mlu::getMluTensorImpl(indice_workspace_size);
+
+  auto indices_ptr = indices_impl->cnnlMalloc();
+  auto out_indices_ptr = out_indices_impl->cnnlMalloc();
+  auto indicePairs_ptr = indicePairs_impl->cnnlMalloc();
+  auto indiceNum_ptr = indiceNum_impl->cnnlMalloc();
+  auto indice_workspace_ptr = indice_workspace_impl->cnnlMalloc();
+
+  mluOpGetIndicePairs(handle, sparse_conv_desc, indices_desc.desc(),
+                      indices_ptr, indice_workspace_ptr, workspace_size,
+                      indicePairs_desc.desc(), indicePairs_ptr,
+                      out_indices_desc.desc(), out_indices_ptr,
+                      indiceNum_desc.desc(), indiceNum_ptr);
+  int num_act_out = 0;
+  mluOpGetSparseConvolutionNumActOut(sparse_conv_desc, &num_act_out);
+  mluOpDestroySparseConvolutionDescriptor(sparse_conv_desc);
+  if (!sub_m) {
+    return {out_indices.slice(0, 0, num_act_out), indicePairs, indiceNum};
+  } else {
+    return {indices, indicePairs, indiceNum};
+  }
+}
+
+torch::Tensor IndiceConvForwardMLUKernelLauncher(
+    torch::Tensor features, torch::Tensor filters, torch::Tensor indicePairs,
+    torch::Tensor indiceNum, int64_t numActOut, int64_t _inverse,
+    int64_t _subM) {
+  auto indice_num_cpu = indiceNum.to({torch::kCPU});
+  auto indice_num_cpu_64 = indice_num_cpu.data_ptr<int>();
+  int indice_num_len = indiceNum.numel();
+  int64_t indice_num[indice_num_len];
+  for (int i = 0; i < indice_num_len; ++i) {
+    indice_num[i] = (int64_t)(((int *)indice_num_cpu_64)[i]);
+  }
+
+  // generate empty output
+  int C = filters.dim() == 4 ? filters.size(3) : filters.size(4);
+  torch::Tensor output =
+      at::zeros({numActOut, C}, features.options().dtype(at::kFloat));
+  // generate descriptor
+  auto features_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      features, at::MemoryFormat::Contiguous);
+  auto filters_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      filters, at::MemoryFormat::Contiguous);
+  auto indice_pairs_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      indicePairs, at::MemoryFormat::Contiguous);
+  auto output_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      output, at::MemoryFormat::Contiguous);
+
+  MluOpTensorDescriptor features_desc, filters_desc, indice_pairs_desc,
+      output_desc;
+  features_desc.set(features_contiguous);
+  filters_desc.set(filters_contiguous);
+  indice_pairs_desc.set(indice_pairs_contiguous);
+  output_desc.set(output_contiguous);
+
+  // set layout
+  {
+    mluOpTensorLayout_t layout;
+    mluOpDataType_t dtype;
+    int dim;
+    int dims[8];
+
+    // features_desc
+    mluOpGetTensorDescriptor(features_desc.desc(), &layout, &dtype, &dim, dims);
+    mluOpSetTensorDescriptor(features_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
+                             dim, dims);
+
+    // filters_desc
+    mluOpGetTensorDescriptor(filters_desc.desc(), &layout, &dtype, &dim, dims);
+    mluOpSetTensorDescriptor(filters_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
+                             dim, dims);
+
+    // indice_pairs_desc
+    mluOpGetTensorDescriptor(indice_pairs_desc.desc(), &layout, &dtype, &dim,
+                             dims);
+    mluOpSetTensorDescriptor(indice_pairs_desc.desc(), MLUOP_LAYOUT_ARRAY,
+                             dtype, dim, dims);
+
+    // output_desc
+    mluOpGetTensorDescriptor(output_desc.desc(), &layout, &dtype, &dim, dims);
+    mluOpSetTensorDescriptor(output_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype, dim,
+                             dims);
+  }
+
+  auto handle = mluOpGetCurrentHandle();
+  size_t workspace_size = 0;
+  mluOpGetIndiceConvolutionForwardWorkspaceSize(
+      handle, features_desc.desc(), filters_desc.desc(),
+      indice_pairs_desc.desc(), output_desc.desc(), indice_num, numActOut,
+      _inverse, _subM, &workspace_size);
+
+  auto workspace =
+      at::empty(workspace_size, features.options().dtype(at::kByte));
+
+  auto features_impl = torch_mlu::getMluTensorImpl(features_contiguous);
+  auto filters_impl = torch_mlu::getMluTensorImpl(filters_contiguous);
+  auto indice_pairs_impl = torch_mlu::getMluTensorImpl(indice_pairs_contiguous);
+  auto workspace_impl = torch_mlu::getMluTensorImpl(workspace);
+
+  auto features_ptr = features_impl->cnnlMalloc();
+  auto filters_ptr = filters_impl->cnnlMalloc();
+  auto indice_pairs_ptr = indice_pairs_impl->cnnlMalloc();
+  auto workspace_ptr = workspace_impl->cnnlMalloc();
+
+  //  outputs
+  auto output_impl = torch_mlu::getMluTensorImpl(output);
+  auto output_ptr = output_impl->cnnlMalloc();
+  mluOpIndiceConvolutionForward(
+      handle, features_desc.desc(), features_ptr, filters_desc.desc(),
+      filters_ptr, indice_pairs_desc.desc(), indice_pairs_ptr, indice_num,
+      numActOut, _inverse, _subM, workspace_ptr, workspace_size,
+      output_desc.desc(), output_ptr);
+
+  return output;
+}
+
+std::vector<torch::Tensor> IndiceConvBackwardMLUKernelLauncher(
+    torch::Tensor features, torch::Tensor filters, torch::Tensor outGrad,
+    torch::Tensor indicePairs, torch::Tensor indiceNum, int64_t _inverse,
+    int64_t _subM) {
+  auto indice_num_cpu = indiceNum.to({torch::kCPU});
+  auto indice_num_cpu_64 = indice_num_cpu.data_ptr<int>();
+  int indice_num_len = indiceNum.numel();
+  int64_t indice_num[indice_num_len];
+  for (int i = 0; i < indice_num_len; ++i) {
+    indice_num[i] = (int64_t)(((int *)(indice_num_cpu_64))[i]);
+  }
+
+  // generate empty input_grad
+  torch::Tensor input_grad = at::zeros({features.size(0), features.size(1)},
+                                       features.options().dtype(at::kFloat));
+  torch::Tensor filters_grad;
+  if (filters.dim() == 4) {
+    int h = filters.size(0);
+    int w = filters.size(1);
+    int c = filters.size(2);
+    int n = filters.size(3);
+    filters_grad = at::zeros({h, w, c, n}, filters.options().dtype(at::kFloat));
+  } else if (filters.dim() == 5) {
+    int d = filters.size(0);
+    int h = filters.size(1);
+    int w = filters.size(2);
+    int c = filters.size(3);
+    int n = filters.size(4);
+    filters_grad =
+        at::zeros({d, h, w, c, n}, filters.options().dtype(at::kFloat));
+  }
+
+  auto features_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      features, at::MemoryFormat::Contiguous);
+  auto filters_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      filters, at::MemoryFormat::Contiguous);
+  auto output_grad_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      outGrad, at::MemoryFormat::Contiguous);
+  auto indice_pairs_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      indicePairs, at::MemoryFormat::Contiguous);
+  auto input_grad_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      features, at::MemoryFormat::Contiguous);
+  auto filters_grad_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
+      filters, at::MemoryFormat::Contiguous);
+
+  MluOpTensorDescriptor features_desc, output_grad_desc, filters_desc,
+      indice_pairs_desc, input_grad_desc, filters_grad_desc;
+  features_desc.set(features_contiguous);
+  filters_desc.set(filters_contiguous);
+  output_grad_desc.set(output_grad_contiguous);
+  indice_pairs_desc.set(indice_pairs_contiguous);
+  input_grad_desc.set(input_grad_contiguous);
+  filters_grad_desc.set(filters_grad_contiguous);
+
+  // need to set desc layout with mluOp functions
+  {
+    mluOpTensorLayout_t layout;
+    mluOpDataType_t dtype;
+    int dim;
+    int dims[8];
+
+    // features_desc
+    mluOpGetTensorDescriptor(features_desc.desc(), &layout, &dtype, &dim, dims);
+    mluOpSetTensorDescriptor(features_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
+                             dim, dims);
+
+    // filters_desc
+    mluOpGetTensorDescriptor(filters_desc.desc(), &layout, &dtype, &dim, dims);
+    if (dim == 4) {
+      mluOpSetTensorDescriptor(filters_desc.desc(), MLUOP_LAYOUT_HWCN, dtype,
+                               dim, dims);
+    } else {
+      mluOpSetTensorDescriptor(filters_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
+                               dim, dims);
+    }
+
+    // output_grad_desc
+    mluOpGetTensorDescriptor(output_grad_desc.desc(), &layout, &dtype, &dim,
+                             dims);
+    mluOpSetTensorDescriptor(output_grad_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
+                             dim, dims);
+
+    // indice_pairs_desc
+    mluOpGetTensorDescriptor(indice_pairs_desc.desc(), &layout, &dtype, &dim,
+                             dims);
+    mluOpSetTensorDescriptor(indice_pairs_desc.desc(), MLUOP_LAYOUT_ARRAY,
+                             dtype, dim, dims);
+
+    // input_grad_desc
+    mluOpGetTensorDescriptor(input_grad_desc.desc(), &layout, &dtype, &dim,
+                             dims);
+    mluOpSetTensorDescriptor(input_grad_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
+                             dim, dims);
+  }
+
+  auto handle = mluOpGetCurrentHandle();
+  size_t data_workspace_size = 0;
+  mluOpGetIndiceConvolutionBackwardDataWorkspaceSize(
+      handle, output_grad_desc.desc(), filters_desc.desc(),
+      indice_pairs_desc.desc(), input_grad_desc.desc(), indice_num, _inverse,
+      &data_workspace_size);
+
+  size_t filters_workspace_size = 0;
+  mluOpGetIndiceConvolutionBackwardFilterWorkspaceSize(
+      handle, features_desc.desc(), output_grad_desc.desc(),
+      indice_pairs_desc.desc(), filters_grad_desc.desc(), indice_num, _inverse,
+      _subM, &filters_workspace_size);
+
+  auto indice_convbpdata_workspace =
+      at::empty(data_workspace_size, features.options().dtype(at::kByte));
+  auto indice_convbpfilter_workspace =
+      at::empty(filters_workspace_size, filters.options().dtype(at::kByte));
+
+  auto features_impl = torch_mlu::getMluTensorImpl(features_contiguous);
+  auto filters_impl = torch_mlu::getMluTensorImpl(filters_contiguous);
+  auto output_grad_impl = torch_mlu::getMluTensorImpl(output_grad_contiguous);
+  auto indice_pairs_impl = torch_mlu::getMluTensorImpl(indice_pairs_contiguous);
+  auto indice_convbpdata_workspace_impl =
+      torch_mlu::getMluTensorImpl(indice_convbpdata_workspace);
+  auto indice_convbpfilter_workspace_impl =
+      torch_mlu::getMluTensorImpl(indice_convbpfilter_workspace);
+
+  auto features_ptr = features_impl->cnnlMalloc();
+  auto filters_ptr = filters_impl->cnnlMalloc();
+  auto output_grad_ptr = output_grad_impl->cnnlMalloc();
+  auto indice_pairs_ptr = indice_pairs_impl->cnnlMalloc();
+  auto indice_convbpdata_workspace_ptr =
+      indice_convbpdata_workspace_impl->cnnlMalloc();
+  auto indice_convbpfilter_workspace_ptr =
+      indice_convbpfilter_workspace_impl->cnnlMalloc();
+
+  // outputs
+  auto input_grad_impl = torch_mlu::getMluTensorImpl(input_grad);
+  auto input_grad_ptr = input_grad_impl->cnnlMalloc();
+  auto filters_grad_impl = torch_mlu::getMluTensorImpl(filters_grad);
+  auto filters_grad_ptr = filters_grad_impl->cnnlMalloc();
+
+  mluOpIndiceConvolutionBackwardData(
+      handle, output_grad_desc.desc(), output_grad_ptr, filters_desc.desc(),
+      filters_ptr, indice_pairs_desc.desc(), indice_pairs_ptr, indice_num,
+      _inverse, _subM, indice_convbpdata_workspace_ptr, data_workspace_size,
+      input_grad_desc.desc(), input_grad_ptr);
+
+  mluOpIndiceConvolutionBackwardFilter(
+      handle, features_desc.desc(), features_ptr, output_grad_desc.desc(),
+      output_grad_ptr, indice_pairs_desc.desc(), indice_pairs_ptr, indice_num,
+      _inverse, _subM, indice_convbpfilter_workspace_ptr,
+      filters_workspace_size, filters_grad_desc.desc(), filters_grad_ptr);
+
+  std::vector<torch::Tensor> result;
+  result.push_back(input_grad);
+  result.push_back(filters_grad);
+  return result;
+}
+
+torch::Tensor indice_conv_forward_mlu(torch::Tensor features,
+                                      torch::Tensor filters,
+                                      torch::Tensor indicePairs,
+                                      torch::Tensor indiceNum,
+                                      int64_t numActOut, int64_t _inverse,
+                                      int64_t _subM) {
+  return IndiceConvForwardMLUKernelLauncher(
+      features, filters, indicePairs, indiceNum, numActOut, _inverse, _subM);
+}
+
+std::vector<torch::Tensor> indice_conv_backward_mlu(
+    torch::Tensor features, torch::Tensor filters, torch::Tensor outGrad,
+    torch::Tensor indicePairs, torch::Tensor indiceNum, int64_t _inverse,
+    int64_t _subM) {
+  return IndiceConvBackwardMLUKernelLauncher(
+      features, filters, outGrad, indicePairs, indiceNum, _inverse, _subM);
+}
+
+torch::Tensor indice_conv_forward_impl(torch::Tensor features,
+                                       torch::Tensor filters,
+                                       torch::Tensor indicePairs,
+                                       torch::Tensor indiceNum,
+                                       int64_t numActOut, int64_t _inverse,
+                                       int64_t _subM);
+
+std::vector<torch::Tensor> indice_conv_backward_impl(
+    torch::Tensor features, torch::Tensor filters, torch::Tensor outGrad,
+    torch::Tensor indicePairs, torch::Tensor indiceNum, int64_t _inverse,
+    int64_t _subM);
+
+REGISTER_DEVICE_IMPL(indice_conv_forward_impl, MLU, indice_conv_forward_mlu);
+REGISTER_DEVICE_IMPL(indice_conv_backward_impl, MLU, indice_conv_backward_mlu);
+
+template std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher<2>(
+    torch::Tensor indices, int64_t batchSize,
+    std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
+    std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
+    std::vector<int64_t> padding, std::vector<int64_t> dilation,
+    std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
+
+template std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher<3>(
+    torch::Tensor indices, int64_t batchSize,
+    std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
+    std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
+    std::vector<int64_t> padding, std::vector<int64_t> dilation,
+    std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
+
+template std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher<4>(
+    torch::Tensor indices, int64_t batchSize,
+    std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
+    std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
+    std::vector<int64_t> padding, std::vector<int64_t> dilation,
+    std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
--- a/mmcv/ops/csrc/pytorch/mlu/voxelization_mlu.cpp
+++ b/mmcv/ops/csrc/pytorch/mlu/voxelization_mlu.cpp
+/*************************************************************************
+ * Copyright (C) 2022 by Cambricon.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *************************************************************************/
+#include "pytorch_device_registry.hpp"
+#include "pytorch_mlu_helper.hpp"
+
+#define MIN(a, b) (((a) < (b)) ? (a) : (b))
+
+void KernelDynamicVoxelize(
+    cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
+    const void *points, void *coors, const float voxel_x, const float voxel_y,
+    const float voxel_z, const float coors_x_min, const float coors_y_min,
+    const float coors_z_min, const float coors_x_max, const float coors_y_max,
+    const float coors_z_max, const int32_t grid_x, const int32_t grid_y,
+    const int32_t grid_z, const int32_t num_points, const int32_t num_features);
+
+void KernelPoint2Voxel(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
+                       cnrtQueue_t queue, void *coors, void *point_to_pointidx,
+                       void *point_to_voxelidx, const int32_t num_points,
+                       const int32_t max_points);
+
+void KernelCalcPointsPerVoxel(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
+                              cnrtQueue_t queue, void *point_to_pointidx,
+                              void *point_to_voxelidx, void *coor_to_voxelidx,
+                              void *num_points_per_voxel, void *voxel_num,
+                              const int32_t max_voxels,
+                              const int32_t num_points);
+
+void KernelAssignVoxelsCoors(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
+                             cnrtQueue_t queue, const void *points,
+                             void *temp_coors, void *point_to_voxelidx,
+                             void *coor_to_voxelidx, void *voxels, void *coors,
+                             const int32_t max_points, const int32_t num_points,
+                             const int32_t num_features);
+
+// policy function
+static void policyFuncDefault(cnrtDim3_t *k_dim, cnrtFunctionType_t *k_type,
+                              const int num_points) {
+  k_dim->x = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
+  k_dim->y = MIN((num_points + k_dim->x - 1) / k_dim->x,
+                 torch_mlu::getDeviceAttr(cnrtAttrClusterCount));
+  k_dim->z = 1;
+  *k_type = CNRT_FUNC_TYPE_UNION1;
+}
+
+// policy function
+static void policyFuncCalcPointsPerVoxel(cnrtDim3_t *k_dim,
+                                         cnrtFunctionType_t *k_type,
+                                         const int num_points) {
+  k_dim->x = 1;
+  k_dim->y = 1;
+  k_dim->z = 1;
+  *k_type = CNRT_FUNC_TYPE_BLOCK;
+}
+
+int HardVoxelizeForwardMLUKernelLauncher(
+    const at::Tensor &points, at::Tensor &voxels, at::Tensor &coors,
+    at::Tensor &num_points_per_voxel, const std::vector<float> voxel_size,
+    const std::vector<float> coors_range, const int max_points,
+    const int max_voxels, const int NDim = 3) {
+  // check datatype
+  TORCH_CHECK(points.scalar_type() == at::kFloat,
+              "points type should be Float, got ", points.scalar_type(), ".");
+  TORCH_CHECK(voxels.scalar_type() == at::kFloat,
+              "voxels type should be Float, got ", voxels.scalar_type(), ".");
+  TORCH_CHECK(coors.scalar_type() == at::kInt,
+              "coors type should be Float, got ", coors.scalar_type(), ".");
+  TORCH_CHECK(num_points_per_voxel.scalar_type() == at::kInt,
+              "num_points_per_voxel type should be Float, got ",
+              num_points_per_voxel.scalar_type(), ".");
+
+  // check shape
+  TORCH_CHECK(points.dim() == 2, "points should be a 2d tensor, got ",
+              points.dim(), "D.");
+  TORCH_CHECK(voxels.dim() == 3, "voxels should be a 3d tensor, got ",
+              voxels.dim(), "D.");
+  TORCH_CHECK(coors.dim() == 2, "coors should be a 2d tensor, got ",
+              coors.dim(), "D.");
+  TORCH_CHECK(num_points_per_voxel.dim() == 1,
+              "num_points_per_voxel should be a 1d tensor, got ",
+              num_points_per_voxel.dim(), "D.");
+
+  const int num_points = points.size(0);
+  const int num_features = points.size(1);
+
+  TORCH_CHECK(points.size(0) == num_points,
+              "the 1st dimensions of points should be num_points, got ",
+              points.size(0), ".");
+  TORCH_CHECK(points.size(1) == num_features,
+              "the 2nd dimensions of points should be num_features, got ",
+              points.size(1), ".");
+  TORCH_CHECK(voxels.size(0) == max_voxels,
+              "the 1st dimensions of voxels should be max_voxels, got ",
+              voxels.size(0), ".");
+  TORCH_CHECK(voxels.size(1) == max_points,
+              "the 2nd dimensions of voxels should be max_points, got ",
+              voxels.size(1), ".");
+  TORCH_CHECK(voxels.size(2) == num_features,
+              "the 3rd dimensions of voxels should be num_features, got ",
+              voxels.size(2), ".");
+  TORCH_CHECK(coors.size(0) == max_voxels,
+              "the 1st dimensions of coors should be max_voxels, got ",
+              coors.size(0), ".");
+  TORCH_CHECK(coors.size(1) == 3,
+              "the 2nd dimensions of coors should be 3, got ", coors.size(1),
+              ".");
+  TORCH_CHECK(num_points_per_voxel.size(0) == max_voxels,
+              "the 1st dimensions of num_points_per_voxel should be 3, got ",
+              num_points_per_voxel.size(0), ".");
+
+  // large tensor check
+  const size_t max_input_size = 2147483648;
+  TORCH_CHECK(points.numel() < max_input_size,
+              "points element num should be less than 2^31, got ",
+              points.numel(), ".");
+  TORCH_CHECK(voxels.numel() < max_input_size,
+              "voxels element num should be less than 2^31, got ",
+              voxels.numel(), ".");
+  TORCH_CHECK(coors.numel() < max_input_size,
+              "coors element num should be less than 2^31, got ", coors.numel(),
+              ".");
+
+  // check zero element
+  if (max_points == 0 || max_voxels == 0) {
+    return 0;
+  }
+
+  // get compute queue
+  auto queue = torch_mlu::getCurQueue();
+
+  // get ptr of tensors
+  auto points_ = points.contiguous();
+  auto points_impl = torch_mlu::getMluTensorImpl(points_);
+  auto points_ptr = points_impl->cnnlMalloc();
+  auto voxels_ = voxels.contiguous();
+  auto voxels_impl = torch_mlu::getMluTensorImpl(voxels_);
+  auto voxels_ptr = voxels_impl->cnnlMalloc();
+  auto coors_ = coors.contiguous();
+  auto coors_impl = torch_mlu::getMluTensorImpl(coors_);
+  auto coors_ptr = coors_impl->cnnlMalloc();
+  auto num_points_per_voxel_ = num_points_per_voxel.contiguous();
+  auto num_points_per_voxel_impl =
+      torch_mlu::getMluTensorImpl(num_points_per_voxel_);
+  auto num_points_per_voxel_ptr = num_points_per_voxel_impl->cnnlMalloc();
+
+  // calculate task dimension
+  cnrtDim3_t k_dim;
+  cnrtFunctionType_t k_type;
+  policyFuncDefault(&k_dim, &k_type, num_points);
+
+  // 1. link point to corresponding voxel coors
+  const float voxel_x = voxel_size[0];
+  const float voxel_y = voxel_size[1];
+  const float voxel_z = voxel_size[2];
+  const float coors_x_min = coors_range[0];
+  const float coors_y_min = coors_range[1];
+  const float coors_z_min = coors_range[2];
+  const float coors_x_max = coors_range[3];
+  const float coors_y_max = coors_range[4];
+  const float coors_z_max = coors_range[5];
+
+  const int grid_x = round((coors_x_max - coors_x_min) / voxel_x);
+  const int grid_y = round((coors_y_max - coors_y_min) / voxel_y);
+  const int grid_z = round((coors_z_max - coors_z_min) / voxel_z);
+
+  auto temp_coors =
+      at::zeros({NDim, num_points}, points.options().dtype(at::kInt))
+          .contiguous();
+  auto temp_coors_impl = torch_mlu::getMluTensorImpl(temp_coors);
+  auto temp_coors_ptr = temp_coors_impl->cnnlMalloc();
+
+  KernelDynamicVoxelize(k_dim, k_type, queue, points_ptr, temp_coors_ptr,
+                        voxel_x, voxel_y, voxel_z, coors_x_min, coors_y_min,
+                        coors_z_min, coors_x_max, coors_y_max, coors_z_max,
+                        grid_x, grid_y, grid_z, num_points, num_features);
+
+  // 2. map point to the idx of the corresponding voxel, find duplicate coor
+  auto point_to_pointidx = at::zeros(
+                               {
+                                   num_points,
+                               },
+                               points.options().dtype(at::kInt))
+                               .contiguous();
+  auto point_to_pointidx_impl = torch_mlu::getMluTensorImpl(point_to_pointidx);
+  auto point_to_pointidx_ptr = point_to_pointidx_impl->cnnlMalloc();
+  auto point_to_voxelidx = at::zeros(
+                               {
+                                   num_points,
+                               },
+                               points.options().dtype(at::kInt))
+                               .contiguous();
+  auto point_to_voxelidx_impl = torch_mlu::getMluTensorImpl(point_to_voxelidx);
+  auto point_to_voxelidx_ptr = point_to_voxelidx_impl->cnnlMalloc();
+
+  KernelPoint2Voxel(k_dim, k_type, queue, temp_coors_ptr, point_to_pointidx_ptr,
+                    point_to_voxelidx_ptr, num_points, max_points);
+
+  // calculate task dimension
+  cnrtDim3_t k_dim_calc_points_per_voxel;
+  cnrtFunctionType_t k_type_calc_points_per_voxel;
+  policyFuncCalcPointsPerVoxel(&k_dim_calc_points_per_voxel,
+                               &k_type_calc_points_per_voxel, num_points);
+
+  // 3. determine voxel num and voxel's coor index
+  auto coor_to_voxelidx = at::zeros(
+                              {
+                                  num_points,
+                              },
+                              points.options().dtype(at::kInt))
+                              .contiguous();
+  auto coor_to_voxelidx_impl = torch_mlu::getMluTensorImpl(coor_to_voxelidx);
+  auto coor_to_voxelidx_ptr = coor_to_voxelidx_impl->cnnlMalloc();
+  auto voxel_num = at::zeros(
+                       {
+                           1,
+                       },
+                       points.options().dtype(at::kInt))
+                       .contiguous();
+  auto voxel_num_impl = torch_mlu::getMluTensorImpl(voxel_num);
+  auto voxel_num_ptr = voxel_num_impl->cnnlMalloc();
+
+  KernelCalcPointsPerVoxel(
+      k_dim_calc_points_per_voxel, k_type_calc_points_per_voxel, queue,
+      point_to_pointidx_ptr, point_to_voxelidx_ptr, coor_to_voxelidx_ptr,
+      num_points_per_voxel_ptr, voxel_num_ptr, max_voxels, num_points);
+
+  // 4. copy point features and coors of each voxels to voxels
+  KernelAssignVoxelsCoors(k_dim, k_type, queue, points_ptr, temp_coors_ptr,
+                          point_to_voxelidx_ptr, coor_to_voxelidx_ptr,
+                          voxels_ptr, coors_ptr, max_points, num_points,
+                          num_features);
+
+  auto voxel_num_cpu = voxel_num.to(at::kCPU);
+  int voxel_num_int = voxel_num_cpu.data_ptr<int>()[0];
+
+  return voxel_num_int;
+}
+
+int hard_voxelize_forward_mlu(const at::Tensor &points, at::Tensor &voxels,
+                              at::Tensor &coors,
+                              at::Tensor &num_points_per_voxel,
+                              const std::vector<float> voxel_size,
+                              const std::vector<float> coors_range,
+                              const int max_points, const int max_voxels,
+                              const int NDim) {
+  return HardVoxelizeForwardMLUKernelLauncher(
+      points, voxels, coors, num_points_per_voxel, voxel_size, coors_range,
+      max_points, max_voxels, NDim);
+};
+
+int hard_voxelize_forward_impl(const at::Tensor &points, at::Tensor &voxels,
+                               at::Tensor &coors,
+                               at::Tensor &num_points_per_voxel,
+                               const std::vector<float> voxel_size,
+                               const std::vector<float> coors_range,
+                               const int max_points, const int max_voxels,
+                               const int NDim);
+
+REGISTER_DEVICE_IMPL(hard_voxelize_forward_impl, MLU,
+                     hard_voxelize_forward_mlu);
--- a/mmcv/ops/csrc/pytorch/nms_rotated.cpp
+++ b/mmcv/ops/csrc/pytorch/nms_rotated.cpp
@@ -17,6 +17,11 @@ Tensor nms_rotated_npu(const Tensor dets, const Tensor scores,
                       const Tensor labels, const float iou_threshold);
 #endif

+#ifdef MMCV_WITH_MLU
+Tensor nms_rotated_mlu(const Tensor dets, const Tensor scores,
+                       const float iou_threshold);
+#endif
+
 // Interface for Python
 // inline is needed to prevent multiple function definitions when this header is
 // included by different cpps
@@ -36,6 +41,10 @@ Tensor nms_rotated(const Tensor dets, const Tensor scores, const Tensor order,
    return nms_rotated_npu(dets, scores, labels, iou_threshold);
 #else
    AT_ERROR("Not compiled with NPU support");
+#endif
+#ifdef MMCV_WITH_MLU
+  } else if (dets.device().type() == at::kMLU) {
+    return nms_rotated_mlu(dets, scores, iou_threshold);
 #endif
  }


--- a/mmcv/ops/csrc/pytorch/spconv_ops.cpp
+++ b/mmcv/ops/csrc/pytorch/spconv_ops.cpp
@@ -35,6 +35,26 @@ std::vector<torch::Tensor> get_indice_pairs_forward_cuda(
      padding, dilation, outPadding, _subM, _transpose);
 };

+template <unsigned NDim>
+std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher(
+    torch::Tensor indices, int64_t batchSize,
+    std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
+    std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
+    std::vector<int64_t> padding, std::vector<int64_t> dilation,
+    std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
+
+template <unsigned NDim>
+std::vector<torch::Tensor> get_indice_pairs_forward_mlu(
+    torch::Tensor indices, int64_t batchSize,
+    std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
+    std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
+    std::vector<int64_t> padding, std::vector<int64_t> dilation,
+    std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose) {
+  return GetIndicePairsForwardMLUKernelLauncher<NDim>(
+      indices, batchSize, outSpatialShape, spatialShape, kernelSize, stride,
+      padding, dilation, outPadding, _subM, _transpose);
+}
+
 template <unsigned NDim>
 std::vector<torch::Tensor> GetIndicePairsBackwardCUDAKernelLauncher(
    torch::Tensor indices, torch::Tensor gridOut, int64_t batchSize,
@@ -71,6 +91,12 @@ std::vector<torch::Tensor> get_indice_pairs_forward(
        padding, dilation, outPadding, _subM, _transpose);
 #else
    AT_ERROR("get_indice_pairs is not compiled with GPU support");
+#endif
+#ifdef MMCV_WITH_MLU
+  } else if (indices.device().type() == at::kMLU) {
+    return get_indice_pairs_forward_mlu<NDim>(
+        indices, batchSize, outSpatialShape, spatialShape, kernelSize, stride,
+        padding, dilation, outPadding, _subM, _transpose);
 #endif
  } else {
    AT_ERROR("get_indice_pairs is not implemented on CPU");

--- a/mmcv/ops/nms.py
+++ b/mmcv/ops/nms.py
@@ -410,8 +410,9 @@ def nms_rotated(dets: Tensor,
        input_labels = scores.new_empty(0, dtype=torch.int)
    else:
        input_labels = labels
-    if dets.device.type == 'npu':
+    if dets.device.type in ('npu', 'mlu'):
        order = scores.new_empty(0, dtype=torch.long)
+        if dets.device.type == 'npu':
            coefficient = 57.29578  # 180 / PI
            for i in range(dets.size()[0]):
                dets_cw[i][4] *= coefficient  # radians to angle

--- a/setup.py
+++ b/setup.py
@@ -211,6 +211,7 @@ def get_extensions():

        include_dirs = []

+        extra_objects = []
        is_rocm_pytorch = False
        try:
            from torch.utils.cpp_extension import ROCM_HOME
@@ -238,16 +239,98 @@ def get_extensions():
                torch.is_mlu_available()) or \
                os.getenv('FORCE_MLU', '0') == '1':
            from torch_mlu.utils.cpp_extension import MLUExtension
+
+            def get_mluops_version(file_path):
+                with open(file_path) as f:
+                    for line in f:
+                        if re.search('MLUOP_MAJOR', line):
+                            major = line.strip().split(' ')[2]
+                        if re.search('MLUOP_MINOR', line):
+                            minor = line.strip().split(' ')[2]
+                        if re.search('MLUOP_PATCHLEVEL', line):
+                            patchlevel = line.strip().split(' ')[2]
+                mluops_version = f'v{major}.{minor}.{patchlevel}'
+                return mluops_version
+
+            mmcv_mluops_version = get_mluops_version(
+                './mmcv/ops/csrc/pytorch/mlu/mlu_common_helper.h')
+            mlu_ops_path = os.getenv('MMCV_MLU_OPS_PATH')
+            if mlu_ops_path:
+                exists_mluops_version = get_mluops_version(
+                    mlu_ops_path + '/bangc-ops/mlu_op.h')
+                if exists_mluops_version != mmcv_mluops_version:
+                    print('the version of mlu-ops provided is %s,'
+                          ' while %s is needed.' %
+                          (exists_mluops_version, mmcv_mluops_version))
+                    exit()
+                try:
+                    if os.path.exists('mlu-ops'):
+                        if os.path.islink('mlu-ops'):
+                            os.remove('mlu-ops')
+                            os.symlink(mlu_ops_path, 'mlu-ops')
+                        elif os.path.abspath('mlu-ops') != mlu_ops_path:
+                            os.symlink(mlu_ops_path, 'mlu-ops')
+                    else:
+                        os.symlink(mlu_ops_path, 'mlu-ops')
+                except Exception:
+                    raise FileExistsError(
+                        'mlu-ops already exists, please move it out,'
+                        'or rename or remove it.')
+            else:
+                if not os.path.exists('mlu-ops'):
+                    import requests
+                    mluops_url = 'https://github.com/Cambricon/mlu-ops/' + \
+                        'archive/refs/tags/' + mmcv_mluops_version + '.zip'
+                    req = requests.get(mluops_url)
+                    with open('./mlu-ops.zip', 'wb') as f:
+                        try:
+                            f.write(req.content)
+                        except Exception:
+                            raise ImportError('failed to download mlu-ops')
+
+                    from zipfile import BadZipFile, ZipFile
+                    with ZipFile('./mlu-ops.zip', 'r') as archive:
+                        try:
+                            archive.extractall()
+                            dir_name = archive.namelist()[0].split('/')[0]
+                            os.rename(dir_name, 'mlu-ops')
+                        except BadZipFile:
+                            print('invalid mlu-ops.zip file')
+                else:
+                    exists_mluops_version = get_mluops_version(
+                        './mlu-ops/bangc-ops/mlu_op.h')
+                    if exists_mluops_version != mmcv_mluops_version:
+                        print('the version of provided mlu-ops is %s,'
+                              ' while %s is needed.' %
+                              (exists_mluops_version, mmcv_mluops_version))
+                        exit()
+
            define_macros += [('MMCV_WITH_MLU', None)]
-            mlu_args = os.getenv('MMCV_MLU_ARGS')
-            extra_compile_args['cncc'] = [mlu_args] if mlu_args else []
+            mlu_args = os.getenv('MMCV_MLU_ARGS', '-DNDEBUG ')
+            mluops_includes = []
+            mluops_includes.append('-I' +
+                                   os.path.abspath('./mlu-ops/bangc-ops'))
+            mluops_includes.append(
+                '-I' + os.path.abspath('./mlu-ops/bangc-ops/kernels'))
+            extra_compile_args['cncc'] = [mlu_args] + \
+                mluops_includes if mlu_args else mluops_includes
+            extra_compile_args['cxx'] += ['-fno-gnu-unique']
            op_files = glob.glob('./mmcv/ops/csrc/pytorch/*.cpp') + \
                glob.glob('./mmcv/ops/csrc/pytorch/cpu/*.cpp') + \
                glob.glob('./mmcv/ops/csrc/pytorch/mlu/*.cpp') + \
-                glob.glob('./mmcv/ops/csrc/common/mlu/*.mlu')
+                glob.glob('./mmcv/ops/csrc/common/mlu/*.mlu') + \
+                glob.glob(
+                    './mlu-ops/bangc-ops/core/**/*.cpp', recursive=True) + \
+                glob.glob(
+                    './mlu-ops/bangc-ops/kernels/**/*.cpp', recursive=True) + \
+                glob.glob(
+                    './mlu-ops/bangc-ops/kernels/**/*.mlu', recursive=True)
+            extra_objects = glob.glob(
+                './mlu-ops/bangc-ops/kernels/kernel_wrapper/*.o')
            extension = MLUExtension
            include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common'))
            include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common/mlu'))
+            include_dirs.append(os.path.abspath('./mlu-ops/bangc-ops'))
        elif (hasattr(torch.backends, 'mps')
              and torch.backends.mps.is_available()) or os.getenv(
                  'FORCE_MPS', '0') == '1':
@@ -309,6 +392,7 @@ def get_extensions():
            sources=op_files,
            include_dirs=include_dirs,
            define_macros=define_macros,
+            extra_objects=extra_objects,
            extra_compile_args=extra_compile_args)
        extensions.append(ext_ops)
    return extensions