v2.1.13: better debug,add some spconv 1.x ops

8b5d2af0 · yan.yan · fe4a2e61 · 8b5d2af0 · 8b5d2af0 · 8b5d2af0
Commit 8b5d2af0 authored Nov 25, 2021 by yan.yan
18 changed files
--- a/.github/workflows/build.yaml
+++ b/.github/workflows/build.yaml
@@ -59,6 +59,7 @@ jobs:
        env:
          CUDA_VERSION: ${{ matrix.cuda-version }}
          PYTHON_VERSION: ${{ matrix.python-version }}
+          BOOST_VERSION: boost_1_77_0
        if: |
          (env.CUDA_VERSION != '') && (
            (github.event_name == 'push' && (startsWith(github.ref, 'refs/tags')) ) || 
@@ -72,6 +73,11 @@ jobs:
          $Env:CUMM_CUDA_ARCH_LIST = "all"
          $Env:SPCONV_DISABLE_JIT = "1"
          pip install pccm pybind11
+          # download boost header only
+          $ProgressPreference = 'SilentlyContinue'
+          Invoke-WebRequest -Uri https://boostorg.jfrog.io/artifactory/main/release/1.77.0/source/$BOOST_VERSION.zip -UseBasicParsing -OutFile $HOME/boost.zip
+          Expand-Archive $HOME/boost.zip -DestinationPath $HOME/boost
+          $Env:BOOST_ROOT = "$HOME/boost/$BOOST_VERSION"
          # ls "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v${{ matrix.cuda-version }}\include\thrust"
          python -m build --wheel --outdir dist/ .
        shell: powershell
@@ -111,10 +117,15 @@ jobs:
          python-version: ${{ matrix.python-version }}

      - name: Install pep build
+        env:
+          BOOST_VERSION: boost_1_77_0
        run: |
          python -m pip install build --user
          python -m pip install --upgrade pip twine wheel
          python -m pip install pytest setuptools
+          mkdir -p third_party
+          wget https://boostorg.jfrog.io/artifactory/main/release/1.77.0/source/$BOOST_VERSION.zip -O third_party/boost.zip
+          unzip third_party/boost.zip -d third_party/boost

      - name: Build a cuda wheel
        env:
@@ -122,6 +133,7 @@ jobs:
          PYTHON_VERSION: ${{ matrix.python-version }}
          DOCKER_IMAGE: scrin/manylinux2014-cuda:cu${{ matrix.cuda-version }}-devel-1.0.0
          PLAT: manylinux2014_x86_64
+          BOOST_VERSION: boost_1_77_0
        if: |
          (env.CUDA_VERSION != '') && (
            (github.event_name == 'push' && (startsWith(github.ref, 'refs/tags')) ) || 
@@ -132,7 +144,10 @@ jobs:
          )
        run: |
          chmod +x tools/build-wheels.sh
-          docker run --rm -e PLAT=$PLAT -e CUMM_CUDA_VERSION=${{ matrix.cuda-version }} -e SPCONV_PYTHON_LIST=${{env.PYTHON_VERSION}} -v `pwd`:/io $DOCKER_IMAGE bash -c "source /etc/bashrc && /io/tools/build-wheels.sh"
+          docker run --rm -e PLAT=$PLAT -e CUMM_CUDA_VERSION=${{ matrix.cuda-version }} \
+           -e SPCONV_PYTHON_LIST=${{env.PYTHON_VERSION}} \
+           -e BOOST_ROOT=/io/third_party/boost/$BOOST_VERSION \
+           -v `pwd`:/io $DOCKER_IMAGE bash -c "source /etc/bashrc && /io/tools/build-wheels.sh"

      - name: Build a cpu wheel
        env:
@@ -140,6 +155,7 @@ jobs:
          PYTHON_VERSION: ${{ matrix.python-version }}
          DOCKER_IMAGE: scrin/manylinux2014-cuda:cu114-devel-1.0.0
          PLAT: manylinux2014_x86_64
+          BOOST_VERSION: boost_1_77_0
        if: |
          (env.CUDA_VERSION == '') && (
            (github.event_name == 'push' && (startsWith(github.ref, 'refs/tags')) ) || 
@@ -150,7 +166,10 @@ jobs:
          )
        run: |
          chmod +x tools/build-wheels.sh
-          docker run --rm -e PLAT=$PLAT -e CUMM_CUDA_VERSION=${{ matrix.cuda-version }} -e SPCONV_PYTHON_LIST=${{env.PYTHON_VERSION}} -v `pwd`:/io $DOCKER_IMAGE bash -c "source /etc/bashrc && /io/tools/build-wheels.sh"
+          docker run --rm -e PLAT=$PLAT -e CUMM_CUDA_VERSION=${{ matrix.cuda-version }} \
+            -e SPCONV_PYTHON_LIST=${{env.PYTHON_VERSION}} \
+            -e BOOST_ROOT=/io/third_party/boost/$BOOST_VERSION \
+            -v `pwd`:/io $DOCKER_IMAGE bash -c "source /etc/bashrc && /io/tools/build-wheels.sh"

      - name: Publish a Python distribution to PyPI
        if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags')

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
 # Changelog

+## [2.1.13] - 2021-?-?
+### Added 
+- Add some ops from spconv 1.x, see spconv.utils for more details.
+- Add some debug tool for users to attach more info in issue.
+
 ## [2.1.12] - 2021-11-23
 ### Added 
 - Add a method for voxel generator to get pc_voxel_id, which is usually used in semantic segmentation

--- a/README.md
+++ b/README.md
@@ -61,7 +61,7 @@ Spconv 1.x users **NEED READ [THIS](docs/SPCONV_2_BREAKING_CHANGEs.md)** before
 * fp32 (not tf32) training/inference speed is increased (+50~80%)
 * fp16 training/inference speed is greatly increased when your layer support tensor core (channel size must be multiple of 8).
 * int8 op is ready, but we still need some time to figure out how to run int8 in pytorch.
-* [doesn't depend on pytorch binary](docs/FAQ.md#What-does-no-dependency-on-pytorch-mean), but you may need at least pytorch >= 1.6.0 to run spconv 2.x.
+* [doesn't depend on pytorch binary](docs/FAQ.md#What-does-no-dependency-on-pytorch-mean), but you may need at least pytorch >= 1.5.0 to run spconv 2.x.
 * since spconv 2.x doesn't depend on pytorch binary (never in future), it's impossible to support torch.jit/libtorch inference.

 ## Spconv 2.x Development and Roadmap
@@ -108,18 +108,32 @@ CUDA 11.1 will be removed in spconv 2.2 because pytorch 1.10 don't provide prebu

 ```pip install spconv-cu114``` for CUDA 11.4

-**NOTE** It's safe to have different **minor** cuda version between system and conda (pytorch) **in Linux**. for example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed.
+**NOTE** It's safe to have different **minor** cuda version between system and conda (pytorch) in **CUDA >= 11.0** because of [CUDA Minor Version Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility). For example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed.
+
+For CUDA 10, we don't know whether ```spconv-cu102``` works with CUDA 10.0 and 10.1. Users can have a try.

 **NOTE** In Linux, you can install spconv-cuxxx without install CUDA to system! only suitable NVIDIA driver is required. for CUDA 11, we need driver >= 450.82.

+#### Prebuilt GPU Support Matrix
+
+See [this page](https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/) to check supported GPU names by arch.
+
+| CUDA version | GPU Arch List  |
+| -------------- |:---------------------:|
+| 10.2 | 50,52,60,61,70,75     | 
+| 11.x       | 52,60,61,70,75,80,86     | 
+| 12.x       | 60,61,70,75,80,86,90     | 
+
 ### Build from source for development (JIT, recommend)

 The c++ code will be built automatically when you change c++ code in project.

-For NVIDIA Embedded Platforms, you need to specify cuda arch before build: ```export CUMM_CUDA_ARCH_LIST="7.2"``` for xavier.
+For NVIDIA Embedded Platforms, you need to specify cuda arch before build: ```export CUMM_CUDA_ARCH_LIST="7.2"``` for xavier, ```export CUMM_CUDA_ARCH_LIST="6.2"``` for TX2, ```export CUMM_CUDA_ARCH_LIST="8.7"``` for orin.

 You need to remove ```cumm``` in ```requires``` section in pyproject.toml after install editable ```cumm``` and before install spconv due to pyproject limit (can't find editable installed ```cumm```).

+You need to ensure ```pip list | grep spconv``` and ```pip list | grep cumm``` show nothing before install editable spconv/cumm.
+
 #### Linux

 0. uninstall spconv and cumm installed by pip

--- a/docs/BENCHMARK.md
+++ b/docs/BENCHMARK.md
@@ -16,7 +16,7 @@

 ## Simple Benchmark

-### Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU
+### Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU 150W

 Network Code: test/benchmark.py

@@ -25,6 +25,18 @@ Network Code: test/benchmark.py
 | Forward | 43ms     | 21.7ms/13.7ms    | 23.5ms/11.2ms      | 22ms/12.2ms      |
 | Backward | 80ms    | 41.9ms/25.2ms    | 51.0ms/13.8ms      | 41.1ms/12.2ms      |

+| F16 Forward | Native| Implicit Gemm | Implicit Gemm Split Mask  |
+| -------------- |:---------------------:|---------------------:| ---------------------:|
+| RTX 3080 Laptop 150W | 13.7ms     | 11.2ms    | 12.2ms      |
+| RTX A6000 | 19.1ms    |  11.7ms   | 14.0ms      |
+| TESLA V100 | 17.9ms    |  11.4ms   | 13.4ms      |
+
+| F16 Backward | Native| Implicit Gemm | Implicit Gemm Split Mask  |
+| -------------- |:---------------------:|---------------------:| ---------------------:|
+| RTX 3080 Laptop 150W | 25.2ms     | 13.8ms    | 12.2ms      |
+| RTX A6000       | 28.1ms     | 9.2ms     | 8.9ms      |
+| TESLA V100 | 33.9ms    |  12.2ms   | 12.9ms      |
+
 ### Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU

 Network Code: test/benchmark.py

--- a/docs/PERFORMANCE_GUIDE.md
+++ b/docs/PERFORMANCE_GUIDE.md
@@ -26,30 +26,3 @@
 * spconv 2.x in Windows 10 is 1.5x~2x slower than Linux. use Linux if possible.

 See [benchmark](BENCHMARK.md) for more performance details of different algorithms.
-
-## Algorithm Overview
-
-### Native Explicit (deprecated and removed in spconv 2.x)
-
-native algorithm (explicit, no fused) is standard gather-gemm-scatter algorithm. Assume we compute 3x3 conv, We can split it to 9 of 1x1 conv which can be computed by matmul, then sum them to get final result.
-For sparse convolution, we also do split-gemm-sum to calculate conv, but we need to collect data first because it's sparse.
-
-### Native
-
-Fused version of above algorithm. 1.5x-2x faster than non-fused version.
-
-### Implicit Gemm
-
-```Native``` algorithm do minimal mma (matrix multiply add), but it need to serialize IO. The pipeline of ```Native``` is gather-gemm-scatter-gather-gemm-scatter-...
-
-```Implicit Gemm``` fuse all calculation to one kernel and perform overlapped gather-mma-scatter to save a lot of time. 
-
-![Image Overlapped Gemm](https://raw.githubusercontent.com/NVIDIA/cutlass/master/media/images/software-pipeline.png)
-
-In my test, ```Implicit Gemm``` is almost 2x faster than ```Native```.
-
-### Implicit Gemm Split Mask
-
-TODO
-
-In my test, ```Implicit Gemm Split Mask``` is slightly faster than ```Implicit Gemm```, but the indice generation is slower, so currently we use ```Implicit Gemm``` by default.
\ No newline at end of file
--- a/setup.py
+++ b/setup.py
@@ -156,6 +156,8 @@ if disable_jit is not None and disable_jit == "1":
    from cumm.conv.main import ConvMainUnitTest
    from cumm.constants import CUMM_CPU_ONLY_BUILD
    from spconv.csrc.sparse.all import SpconvOps
+    from spconv.csrc.utils import BoxOps
+
    cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS + SHUFFLE_TURING_PARAMS)
    convcu = ConvMainUnitTest(IMPLGEMM_SIMT_PARAMS + IMPLGEMM_VOLTA_PARAMS + IMPLGEMM_TURING_PARAMS)
    convcu.namespace = "cumm.conv.main"
@@ -168,9 +170,9 @@ if disable_jit is not None and disable_jit == "1":
            std = "c++14" 
        else:
            std = "c++17"
-    cus = [cu, convcu, SpconvOps()]
+    cus = [cu, convcu, SpconvOps(), BoxOps()]
    if CUMM_CPU_ONLY_BUILD:
-        cus = [SpconvOps()]
+        cus = [SpconvOps(), BoxOps()]
    ext_modules: List[Extension] = [
        PCCMExtension(cus,
                      "spconv/core_cc",

--- a/spconv/build.py
+++ b/spconv/build.py
@@ -28,6 +28,8 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(
    from cumm.conv.main import ConvMainUnitTest

    from spconv.csrc.sparse.all import SpconvOps
+    from spconv.csrc.utils import BoxOps
+
    cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS +
                          SHUFFLE_TURING_PARAMS)
    cu.namespace = "cumm.gemm.main"
@@ -38,7 +40,7 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(
    if InWindows:
        # windows have command line limit, so we use objects_folder to reduce command size.
        objects_folder = "objects"
-    pccm.builder.build_pybind([cu, convcu, SpconvOps()],
+    pccm.builder.build_pybind([cu, convcu, SpconvOps(), BoxOps()],
                              PACKAGE_ROOT / "core_cc",
                              namespace_root=PACKAGE_ROOT,
                              objects_folder=objects_folder,

--- a/spconv/constants.py
+++ b/spconv/constants.py
@@ -27,3 +27,15 @@ _filter_hwio_env = os.getenv("SPCONV_FILTER_HWIO", "0")
 FILTER_HWIO = _filter_hwio_env == "1"
 DISABLE_JIT = os.getenv("SPCONV_DISABLE_JIT", "0") == "1"
 NDIM_DONT_CARE = 3
+
+SPCONV_DEBUG_SAVE_PATH = os.getenv("SPCONV_DEBUG_SAVE_PATH", "")
+
+
+_BOOST_ROOT = os.getenv("BOOST_ROOT", None)
+
+if _BOOST_ROOT is None:
+    BOOST_ROOT = None 
+else:
+    BOOST_ROOT = Path(_BOOST_ROOT)
+    assert BOOST_ROOT.exists(), "you provide BOOST_ROOT, but it not exists"
+    assert (BOOST_ROOT / "boost" / "geometry").exists(), "you provide BOOST_ROOT, but BOOST_ROOT/boost/geometry not exists"
--- a/spconv/core_cc/csrc/utils/__init__.pyi
+++ b/spconv/core_cc/csrc/utils/__init__.pyi
+# Copyright 2021 Yan Yan
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
--- a/spconv/core_cc/csrc/utils/boxops.pyi
+++ b/spconv/core_cc/csrc/utils/boxops.pyi
+from typing import overload, Any, Callable, Dict, List, Optional, Set, Tuple, Type, Union
+from pccm.stubs import EnumValue, EnumClassValue
+from cumm.tensorview import Tensor
+class BoxOps:
+    @staticmethod
+    def has_boost() -> bool: ...
+    @staticmethod
+    def non_max_suppression_cpu(boxes: Tensor, order: Tensor, thresh: float, eps: float = 0) -> List[int]: 
+        """
+        Args:
+            boxes: 
+            order: 
+            thresh: 
+            eps: 
+        """
+        ...
+    @staticmethod
+    def rotate_non_max_suppression_cpu(box_corners: Tensor, order: Tensor, standup_iou: Tensor, thresh: float, eps: float = 0) -> List[int]: 
+        """
+        Args:
+            box_corners: 
+            order: 
+            standup_iou: 
+            thresh: 
+            eps: 
+        """
+        ...
+    @staticmethod
+    def rbbox_iou(box_corners: Tensor, qbox_corners: Tensor, standup_iou: Tensor, overlaps: Tensor, standup_thresh: float, inter_only: bool) -> None: 
+        """
+        Args:
+            box_corners: 
+            qbox_corners: 
+            standup_iou: 
+            overlaps: 
+            standup_thresh: 
+            inter_only: 
+        """
+        ...
+    @staticmethod
+    def rbbox_iou_aligned(box_corners: Tensor, qbox_corners: Tensor, overlaps: Tensor, inter_only: bool) -> None: 
+        """
+        Args:
+            box_corners: 
+            qbox_corners: 
+            overlaps: 
+            inter_only: 
+        """
+        ...
--- a/spconv/cppconstants.py
+++ b/spconv/cppconstants.py
@@ -18,3 +18,7 @@ if hasattr(_ext, "cumm"):
    CPU_ONLY_BUILD = False
 else:
    CPU_ONLY_BUILD = True
+
+from spconv.core_cc.csrc.utils.boxops import BoxOps
+
+HAS_BOOST = BoxOps.has_boost()
\ No newline at end of file
--- a/spconv/csrc/utils/__init__.py
+++ b/spconv/csrc/utils/__init__.py
+# Copyright 2021 Yan Yan
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .boxops import BoxOps
\ No newline at end of file
--- a/spconv/csrc/utils/boxops.py
+++ b/spconv/csrc/utils/boxops.py
+# Copyright 2021 Yan Yan
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pccm 
+from pathlib import Path 
+import os 
+from cumm.common import TensorView, TensorViewCPU, TensorViewKernel, ThrustLib
+from spconv.constants import BOOST_ROOT
+
+
+class BoostGeometryLib(pccm.Class):
+    def __init__(self):
+        super().__init__()
+        assert BOOST_ROOT is not None 
+        self.build_meta.add_includes(BOOST_ROOT)
+        self.add_include("boost/geometry.hpp")
+
+class BoxOps(pccm.Class):
+    def __init__(self):
+        super().__init__()
+        self.add_dependency(TensorView)
+    
+    @pccm.pybind.mark
+    @pccm.static_function
+    def has_boost(self):
+        code = pccm.FunctionCode()
+        code.raw(f"return {pccm.boolean(BOOST_ROOT is not None)};")
+        return code.ret("bool")
+
+    @pccm.pybind.mark(nogil=True)
+    @pccm.static_function
+    def non_max_suppression_cpu(self):
+        code = pccm.FunctionCode()
+        code.arg("boxes, order", "tv::Tensor")
+        code.arg("thresh", "float")
+        code.arg("eps", "float", "0")
+
+        code.raw(f"""
+        auto ndets = boxes.dim(0);
+        std::vector<int> keep(ndets);
+
+        tv::dispatch<float, double>(boxes.dtype(), [&](auto I1){{
+            using DType = TV_DECLTYPE(I1);
+            auto boxes_r = boxes.tview<const DType, 2>();
+            tv::dispatch<int, int64_t, uint32_t, uint64_t>(order.dtype(), [&](auto I2){{
+                using T2 = TV_DECLTYPE(I2);
+                auto order_r = order.tview<const T2, 1>();
+                std::vector<DType> areas;
+                for (int i = 0; i < ndets; ++i){{
+                    areas[i] = (boxes_r(i, 2) - boxes_r(i, 0) + eps) * 
+                               (boxes_r(i, 3) - boxes_r(i, 1) + eps);
+                }}
+                std::vector<int> suppressed(ndets, 0);
+
+                int i, j;
+                DType xx1, xx2, w, h, inter, ovr;
+                for (int _i = 0; _i < ndets; ++_i) {{
+                    i = order_r(_i);
+                    if (suppressed[i] == 1)
+                        continue;
+                    keep.push_back(i);
+                    for (int _j = _i + 1; _j < ndets; ++_j) {{
+                        j = order_r(_j);
+                        if (suppressed[j] == 1)
+                            continue;
+                        xx2 = std::min(boxes_r(i, 2), boxes_r(j, 2));
+                        xx1 = std::max(boxes_r(i, 0), boxes_r(j, 0));
+                        w = xx2 - xx1 + eps;
+                        if (w > 0) {{
+                            xx2 = std::min(boxes_r(i, 3), boxes_r(j, 3));
+                            xx1 = std::max(boxes_r(i, 1), boxes_r(j, 1));
+                            h = xx2 - xx1 + eps;
+                            if (h > 0) {{
+                            inter = w * h;
+                            ovr = inter / (areas[i] + areas[j] - inter);
+                            if (ovr >= thresh)
+                                suppressed[j] = 1;
+                            }}
+                        }}
+                    }}
+                }}
+            }});
+
+        }});
+        return keep;
+        """)
+        return code.ret("std::vector<int>")
+
+    @pccm.pybind.mark(nogil=True)
+    @pccm.static_function
+    def rotate_non_max_suppression_cpu(self):
+        code = pccm.FunctionCode()
+        code.arg("box_corners, order, standup_iou", "tv::Tensor")
+        code.arg("thresh", "float")
+        code.arg("eps", "float", "0")
+        if BOOST_ROOT is None:
+            return code.make_invalid()
+        code.add_dependency(BoostGeometryLib)
+        code.raw(f"""
+        auto ndets = box_corners.dim(0);
+        std::vector<int> keep(ndets);
+
+        tv::dispatch<float, double>(box_corners.dtype(), [&](auto I1){{
+            using DType = TV_DECLTYPE(I1);
+            auto box_corners_r = box_corners.tview<const DType, 3>();
+            auto standup_iou_r = standup_iou.tview<const DType, 2>();
+
+            tv::dispatch<int, int64_t, uint32_t, uint64_t>(order.dtype(), [&](auto I2){{
+                using T2 = TV_DECLTYPE(I2);
+                auto order_r = order.tview<const T2, 1>();
+                std::vector<int> suppressed(ndets, 0);
+                int i, j;
+
+                namespace bg = boost::geometry;
+                typedef bg::model::point<DType, 2, bg::cs::cartesian> point_t;
+                typedef bg::model::polygon<point_t> polygon_t;
+                polygon_t poly, qpoly;
+                std::vector<polygon_t> poly_inter, poly_union;
+                DType inter_area, union_area, overlap;
+
+                for (int _i = 0; _i < ndets; ++_i) {{
+                    i = order_r(_i);
+                    if (suppressed[i] == 1)
+                    continue;
+                    keep.push_back(i);
+                    for (int _j = _i + 1; _j < ndets; ++_j) {{
+                        j = order_r(_j);
+                        if (suppressed[j] == 1)
+                            continue;
+                        if (standup_iou_r(i, j) <= 0.0)
+                            continue;
+                        // std::cout << "pre_poly" << std::endl;
+                        bg::append(poly,
+                                point_t(box_corners_r(i, 0, 0), box_corners_r(i, 0, 1)));
+                        bg::append(poly,
+                                point_t(box_corners_r(i, 1, 0), box_corners_r(i, 1, 1)));
+                        bg::append(poly,
+                                point_t(box_corners_r(i, 2, 0), box_corners_r(i, 2, 1)));
+                        bg::append(poly,
+                                point_t(box_corners_r(i, 3, 0), box_corners_r(i, 3, 1)));
+                        bg::append(poly,
+                                point_t(box_corners_r(i, 0, 0), box_corners_r(i, 0, 1)));
+                        bg::append(qpoly,
+                                point_t(box_corners_r(j, 0, 0), box_corners_r(j, 0, 1)));
+                        bg::append(qpoly,
+                                point_t(box_corners_r(j, 1, 0), box_corners_r(j, 1, 1)));
+                        bg::append(qpoly,
+                                point_t(box_corners_r(j, 2, 0), box_corners_r(j, 2, 1)));
+                        bg::append(qpoly,
+                                point_t(box_corners_r(j, 3, 0), box_corners_r(j, 3, 1)));
+                        bg::append(qpoly,
+                                point_t(box_corners_r(j, 0, 0), box_corners_r(j, 0, 1)));
+                        bg::intersection(poly, qpoly, poly_inter);
+                        if (!poly_inter.empty()) {{
+                            inter_area = bg::area(poly_inter.front());
+                            bg::union_(poly, qpoly, poly_union);
+                            if (!poly_union.empty()) {{ // ignore invalid box
+                                union_area = bg::area(poly_union.front());
+                                overlap = inter_area / union_area;
+                                if (overlap >= thresh)
+                                    suppressed[j] = 1;
+                                poly_union.clear();
+                            }}
+                        }}
+                        poly.clear();
+                        qpoly.clear();
+                        poly_inter.clear();
+                    }}
+                }}
+            }});
+        }});
+        return keep;
+        """)
+        return code.ret("std::vector<int>")
+
+    @pccm.pybind.mark(nogil=True)
+    @pccm.static_function
+    def rbbox_iou(self):
+        code = pccm.FunctionCode()
+        code.arg("box_corners, qbox_corners, standup_iou, overlaps", "tv::Tensor")
+        code.arg("standup_thresh", "float")
+        code.arg("inter_only", "bool")
+
+        if BOOST_ROOT is None:
+            return code.make_invalid()
+        code.add_dependency(BoostGeometryLib)
+        code.raw(f"""
+        auto N = box_corners.dim(0);
+        auto K = qbox_corners.dim(0);
+        if (N == 0 || K == 0) {{
+            return;
+        }}
+        tv::dispatch<float, double>(box_corners.dtype(), [&](auto I1){{
+            using DType = TV_DECLTYPE(I1);
+
+            auto box_corners_r = box_corners.tview<const DType, 3>();
+            auto qbox_corners_r = qbox_corners.tview<const DType, 3>();
+
+            auto standup_iou_r = standup_iou.tview<const DType, 2>();
+            auto overlaps_rw = overlaps.tview<DType, 2>();
+
+            namespace bg = boost::geometry;
+            typedef bg::model::point<DType, 2, bg::cs::cartesian> point_t;
+            typedef bg::model::polygon<point_t> polygon_t;
+            polygon_t poly, qpoly;
+            std::vector<polygon_t> poly_inter, poly_union;
+            DType inter_area, union_area;
+            for (int k = 0; k < K; ++k) {{
+                for (int n = 0; n < N; ++n) {{
+                    if (standup_iou_r(n, k) <= standup_thresh)
+                        continue;
+                    bg::append(poly, point_t(box_corners_r(n, 0, 0), box_corners_r(n, 0, 1)));
+                    bg::append(poly, point_t(box_corners_r(n, 1, 0), box_corners_r(n, 1, 1)));
+                    bg::append(poly, point_t(box_corners_r(n, 2, 0), box_corners_r(n, 2, 1)));
+                    bg::append(poly, point_t(box_corners_r(n, 3, 0), box_corners_r(n, 3, 1)));
+                    bg::append(poly, point_t(box_corners_r(n, 0, 0), box_corners_r(n, 0, 1)));
+                    bg::append(qpoly,
+                                point_t(qbox_corners_r(k, 0, 0), qbox_corners_r(k, 0, 1)));
+                    bg::append(qpoly,
+                                point_t(qbox_corners_r(k, 1, 0), qbox_corners_r(k, 1, 1)));
+                    bg::append(qpoly,
+                                point_t(qbox_corners_r(k, 2, 0), qbox_corners_r(k, 2, 1)));
+                    bg::append(qpoly,
+                                point_t(qbox_corners_r(k, 3, 0), qbox_corners_r(k, 3, 1)));
+                    bg::append(qpoly,
+                                point_t(qbox_corners_r(k, 0, 0), qbox_corners_r(k, 0, 1)));
+
+                    bg::intersection(poly, qpoly, poly_inter);
+
+                    if (!poly_inter.empty()) {{
+                        inter_area = bg::area(poly_inter.front());
+                        if (inter_only){{
+                            overlaps_rw(n, k) = inter_area;
+                        }}else{{
+                            bg::union_(poly, qpoly, poly_union);
+                            if (!poly_union.empty()) {{
+                                union_area = bg::area(poly_union.front());
+                                overlaps_rw(n, k) = inter_area / union_area;
+                            }}
+                            poly_union.clear();
+                        }}
+                    }}
+                    poly.clear();
+                    qpoly.clear();
+                    poly_inter.clear();
+                }}
+            }}
+        }});
+        return;
+        """)
+        return code
+
+    @pccm.pybind.mark(nogil=True)
+    @pccm.static_function
+    def rbbox_iou_aligned(self):
+        code = pccm.FunctionCode()
+        code.arg("box_corners, qbox_corners, overlaps", "tv::Tensor")
+        code.arg("inter_only", "bool")
+
+        if BOOST_ROOT is None:
+            return code.make_invalid()
+        code.add_dependency(BoostGeometryLib)
+        code.raw(f"""
+        auto N = box_corners.dim(0);
+        auto K = qbox_corners.dim(0);
+        TV_ASSERT_RT_ERR(N == K, "aligned iou must have same number of box")
+        if (N == 0 || K == 0) {{
+            return;
+        }}
+        tv::dispatch<float, double>(box_corners.dtype(), [&](auto I1){{
+            using DType = TV_DECLTYPE(I1);
+
+            auto box_corners_r = box_corners.tview<const DType, 3>();
+            auto qbox_corners_r = qbox_corners.tview<const DType, 3>();
+
+            auto overlaps_rw = overlaps.tview<DType, 1>();
+
+            namespace bg = boost::geometry;
+            typedef bg::model::point<DType, 2, bg::cs::cartesian> point_t;
+            typedef bg::model::polygon<point_t> polygon_t;
+            polygon_t poly, qpoly;
+            std::vector<polygon_t> poly_inter, poly_union;
+            DType inter_area, union_area;
+
+            for (int n = 0; n < N; ++n) {{
+                bg::append(poly, point_t(box_corners_r(n, 0, 0), box_corners_r(n, 0, 1)));
+                bg::append(poly, point_t(box_corners_r(n, 1, 0), box_corners_r(n, 1, 1)));
+                bg::append(poly, point_t(box_corners_r(n, 2, 0), box_corners_r(n, 2, 1)));
+                bg::append(poly, point_t(box_corners_r(n, 3, 0), box_corners_r(n, 3, 1)));
+                bg::append(poly, point_t(box_corners_r(n, 0, 0), box_corners_r(n, 0, 1)));
+                bg::append(qpoly,
+                            point_t(qbox_corners_r(n, 0, 0), qbox_corners_r(n, 0, 1)));
+                bg::append(qpoly,
+                            point_t(qbox_corners_r(n, 1, 0), qbox_corners_r(n, 1, 1)));
+                bg::append(qpoly,
+                            point_t(qbox_corners_r(n, 2, 0), qbox_corners_r(n, 2, 1)));
+                bg::append(qpoly,
+                            point_t(qbox_corners_r(n, 3, 0), qbox_corners_r(n, 3, 1)));
+                bg::append(qpoly,
+                            point_t(qbox_corners_r(n, 0, 0), qbox_corners_r(n, 0, 1)));
+
+                bg::intersection(poly, qpoly, poly_inter);
+
+                if (!poly_inter.empty()) {{
+                    inter_area = bg::area(poly_inter.front());
+                    if (inter_only){{
+                        overlaps_rw(n) = inter_area;
+                    }}else{{
+                        bg::union_(poly, qpoly, poly_union);
+                        if (!poly_union.empty()) {{
+                            union_area = bg::area(poly_union.front());
+                            overlaps_rw(n) = inter_area / union_area;
+                        }}
+                        poly_union.clear();
+                    }}
+                }}
+                poly.clear();
+                qpoly.clear();
+                poly_inter.clear();
+            }}
+        }});
+        return;
+        """)
+        return code
--- a/spconv/debug_utils.py
+++ b/spconv/debug_utils.py
+# Copyright 2021 Yan Yan
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pickle 
+from pathlib import Path 
+
+from spconv.constants import SPCONV_DEBUG_SAVE_PATH
+
+def spconv_save_debug_data(data):
+    if SPCONV_DEBUG_SAVE_PATH:
+        try:
+            save_path = Path(SPCONV_DEBUG_SAVE_PATH)
+            assert save_path.parent.exists(), "parent of SPCONV_DEBUG_SAVE_PATH must exist"
+            with save_path.open("wb") as f:
+                pickle.dump(data, f)
+            print((f"spconv save debug data to {SPCONV_DEBUG_SAVE_PATH}, "
+                    "you can submit issue with log and this debug data attached."))
+        except Exception as e:
+            print((f"spconv try to save debug data to {SPCONV_DEBUG_SAVE_PATH}, "
+                    f"but failed with exception {e}. please check your SPCONV_DEBUG_SAVE_PATH"))
+
+    else:
+        print((f"SPCONV_DEBUG_SAVE_PATH not found, "
+                "you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path "
+                "to save debug data which can be attached in a issue."))
--- a/spconv/pytorch/conv.py
+++ b/spconv/pytorch/conv.py
@@ -14,6 +14,7 @@

 import math
 import time
+import sys
 from typing import List, Optional, Tuple, Union

 import numpy as np
@@ -24,6 +25,7 @@ from torch.nn.parameter import Parameter

 from spconv import pytorch as spconv
 from spconv.core import ConvAlgo
+from spconv.debug_utils import spconv_save_debug_data
 from spconv.pytorch import functional as Fsp
 from spconv.pytorch import ops
 from spconv.cppconstants import CPU_ONLY_BUILD
@@ -291,11 +293,21 @@ class SparseConvolution(SparseModule):
                        if input.benchmark:
                            torch.cuda.synchronize()
                            t = time.time()
+                        try:
                            outids, indice_pairs, indice_pair_num = ops.get_indice_pairs(
                                indices, batch_size, spatial_shape, algo,
                                self.kernel_size, self.stride, self.padding,
                                self.dilation, self.output_padding, self.subm,
                                self.transposed)
+                        except Exception as e:
+                            msg = "[Exception|native_pair]"
+                            msg += f"indices={indices.shape},bs={batch_size},ss={spatial_shape},"
+                            msg += f"algo={algo},ksize={self.kernel_size},stride={self.stride},"
+                            msg += f"padding={self.padding},dilation={self.dilation},subm={self.subm},"
+                            msg += f"transpose={self.transposed}"
+                            print(msg, file=sys.stderr)
+                            spconv_save_debug_data(indices)
+                            raise e 
                        if input.benchmark:
                            torch.cuda.synchronize()
                            interval = time.time() - t
@@ -367,9 +379,11 @@ class SparseConvolution(SparseModule):
                        mask_argsort_bwd_splits = datas.mask_argsort_bwd_splits
                        masks = datas.masks
                    else:
+
                        with input._timer.namespace("gen_pairs"):
                            # we need to gen bwd indices for regular conv
                            # because it may be inversed.
+                            try:
                                res = ops.get_indice_pairs_implicit_gemm(
                                    indices,
                                    batch_size,
@@ -385,6 +399,16 @@ class SparseConvolution(SparseModule):
                                    is_train=(not self.subm) or self.training,
                                    alloc=input.thrust_allocator,
                                    timer=input._timer)
+                            except Exception as e:
+                                msg = "[Exception|implicit_gemm_pair]"
+                                msg += f"indices={indices.shape},bs={batch_size},ss={spatial_shape},"
+                                msg += f"algo={algo},ksize={self.kernel_size},stride={self.stride},"
+                                msg += f"padding={self.padding},dilation={self.dilation},subm={self.subm},"
+                                msg += f"transpose={self.transposed}"
+                                print(msg, file=sys.stderr)
+                                spconv_save_debug_data(indices)
+                                raise e 
+
                        outids = res[0]
                        num_inds_per_loc = res[1]
                        pair_fwd = res[2]

--- a/spconv/pytorch/functional.py
+++ b/spconv/pytorch/functional.py
@@ -12,6 +12,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import sys
+import pickle 
+
 import torch
 from torch import nn
 from torch.autograd import Function
@@ -19,9 +22,10 @@ from typing import Optional, TypeVar
 from spconv.tools import CUDAKernelTimer
 from spconv.pytorch import ops
 from spconv.pytorch.constants import PYTORCH_VERSION
-
+from spconv.debug_utils import spconv_save_debug_data
 from torch.autograd.function import once_differentiable
 import numpy as np
+from pathlib import Path

 from typing import List

@@ -53,6 +57,7 @@ class SparseConvFunction(Function):
        ctx.save_for_backward(indice_pairs, indice_pair_num, features, filters)
        ctx.algo = algo
        ctx.timer = timer
+        try:
            return ops.indice_conv(features,
                                filters,
                                indice_pairs,
@@ -61,6 +66,13 @@ class SparseConvFunction(Function):
                                False,
                                algo=algo,
                                timer=timer)
+        except Exception as e:
+            msg = "[Exception|indice_conv]"
+            msg += f"feat={features.shape},w={filters.shape},pair={indice_pairs.shape},"
+            msg += f"pairnum={indice_pair_num},act={num_activate_out},algo={algo}"
+            print(msg, file=sys.stderr)
+            spconv_save_debug_data((indice_pairs, indice_pair_num))
+            raise e 

    @staticmethod
    @once_differentiable
@@ -68,7 +80,7 @@ class SparseConvFunction(Function):
    def backward(ctx, grad_output):
        indice_pairs, indice_pair_num, features, filters = ctx.saved_tensors
        timer = ctx.timer
-
+        try:
            input_bp, filters_bp = ops.indice_conv_backward(features,
                                                            filters,
                                                            grad_output,
@@ -77,6 +89,13 @@ class SparseConvFunction(Function):
                                                            False,
                                                            algo=ctx.algo,
                                                            timer=timer)
+        except Exception as e:
+            msg = "[Exception|indice_conv_backward]"
+            msg += f"feat={features.shape},w={filters.shape},pair={indice_pairs.shape},"
+            msg += f"pairnum={indice_pair_num},do={grad_output.shape}"
+            print(msg, file=sys.stderr)
+            spconv_save_debug_data((indice_pairs, indice_pair_num))
+            raise e 

        return input_bp, filters_bp, None, None, None, None, None

@@ -95,7 +114,7 @@ class SparseInverseConvFunction(Function):
        ctx.save_for_backward(indice_pairs, indice_pair_num, features, filters)
        ctx.algo = algo
        ctx.timer = timer
-
+        try:
            return ops.indice_conv(features,
                                filters,
                                indice_pairs,
@@ -105,6 +124,13 @@ class SparseInverseConvFunction(Function):
                                False,
                                algo=algo,
                                timer=timer)
+        except Exception as e:
+            msg = "[Exception|indice_conv|inverse]"
+            msg += f"feat={features.shape},w={filters.shape},pair={indice_pairs.shape},"
+            msg += f"pairnum={indice_pair_num},act={num_activate_out},algo={algo}"
+            print(msg, file=sys.stderr)
+            spconv_save_debug_data((indice_pairs, indice_pair_num))
+            raise e 

    @staticmethod
    @once_differentiable
@@ -112,7 +138,7 @@ class SparseInverseConvFunction(Function):
    def backward(ctx, grad_output):
        indice_pairs, indice_pair_num, features, filters = ctx.saved_tensors
        timer = ctx.timer
-
+        try:
            input_bp, filters_bp = ops.indice_conv_backward(features,
                                                            filters,
                                                            grad_output,
@@ -122,6 +148,13 @@ class SparseInverseConvFunction(Function):
                                                            False,
                                                            algo=ctx.algo,
                                                            timer=timer)
+        except Exception as e:
+            msg = "[Exception|indice_conv_backward|inverse]"
+            msg += f"feat={features.shape},w={filters.shape},pair={indice_pairs.shape},"
+            msg += f"pairnum={indice_pair_num},do={grad_output.shape}"
+            print(msg, file=sys.stderr)
+            spconv_save_debug_data((indice_pairs, indice_pair_num))
+            raise e 

        return input_bp, filters_bp, None, None, None, None, None

@@ -143,13 +176,23 @@ class SparseImplicitGemmFunction(Function):
                is_train: bool,
                is_subm: bool,
                timer: CUDAKernelTimer = CUDAKernelTimer(False)):
-
+        try:
            out, mask_out, mask_width = ops.implicit_gemm(features, filters,
                                                        pair_fwd,
                                                        pair_mask_fwd_splits,
                                                        mask_argsort_fwd_splits,
                                                        num_activate_out, masks,
                                                        is_train, is_subm, timer)
+        except Exception as e:
+            msg = "[Exception|implicit_gemm]"
+            msg += f"feat={features.shape},w={filters.shape},pair={pair_fwd.shape},"
+            msg += f"act={num_activate_out},issubm={is_subm},istrain={is_train}"
+            print(msg, file=sys.stderr)
+            spconv_save_debug_data((pair_fwd, pair_bwd, pair_mask_fwd_splits, 
+                pair_mask_bwd_splits, mask_argsort_fwd_splits, mask_argsort_bwd_splits,
+                masks))
+            raise e 
+            
        ctx.save_for_backward(features, filters, pair_fwd, pair_bwd)
        ctx.mask_width = mask_width
        ctx.mask_out = mask_out
@@ -178,6 +221,7 @@ class SparseImplicitGemmFunction(Function):
        masks = ctx.masks
        is_subm = ctx.is_subm
        timer = ctx.timer
+        try:
            input_bp, filters_bp = ops.implicit_gemm_backward(
                features,
                filters,
@@ -193,6 +237,16 @@ class SparseImplicitGemmFunction(Function):
                mask_width=mask_width,
                is_subm=is_subm,
                timer=timer)
+        except Exception as e:
+            msg = "[Exception|implicit_gemm_backward]"
+            msg += f"feat={features.shape},w={filters.shape},pair={pair_fwd.shape},"
+            msg += f"issubm={is_subm},do={grad_output.shape}"
+            print(msg, file=sys.stderr)
+            spconv_save_debug_data((pair_fwd, pair_bwd, pair_mask_fwd_splits, 
+                pair_mask_bwd_splits, mask_argsort_fwd_splits, mask_argsort_bwd_splits,
+                masks))
+            raise e 
+
        None_9 = [None] * 11
        return (input_bp, filters_bp, *None_9)

@@ -211,6 +265,7 @@ class SubMConvFunction(Function):
        ctx.save_for_backward(indice_pairs, indice_pair_num, features, filters)
        ctx.algo = algo
        ctx.timer = timer
+        try:
            return ops.indice_conv(features,
                                filters,
                                indice_pairs,
@@ -220,6 +275,13 @@ class SubMConvFunction(Function):
                                True,
                                algo=algo,
                                timer=timer)
+        except Exception as e:
+            msg = "[Exception|indice_conv|subm]"
+            msg += f"feat={features.shape},w={filters.shape},pair={indice_pairs.shape},"
+            msg += f"pairnum={indice_pair_num},act={num_activate_out},algo={algo}"
+            print(msg, file=sys.stderr)
+            spconv_save_debug_data((indice_pairs, indice_pair_num))
+            raise e 

    @staticmethod
    @once_differentiable
@@ -227,7 +289,7 @@ class SubMConvFunction(Function):
    def backward(ctx, grad_output):
        indice_pairs, indice_pair_num, features, filters = ctx.saved_tensors
        timer = ctx.timer
-
+        try:
            input_bp, filters_bp = ops.indice_conv_backward(features,
                                                            filters,
                                                            grad_output,
@@ -237,6 +299,14 @@ class SubMConvFunction(Function):
                                                            True,
                                                            algo=ctx.algo,
                                                            timer=timer)
+        except Exception as e:
+            msg = "[Exception|indice_conv_backward|subm]"
+            msg += f"feat={features.shape},w={filters.shape},pair={indice_pairs.shape},"
+            msg += f"pairnum={indice_pair_num},do={grad_output.shape}"
+            print(msg, file=sys.stderr)
+            spconv_save_debug_data((indice_pairs, indice_pair_num))
+            raise e 
+

        return input_bp, filters_bp, None, None, None, None, None


--- a/spconv/pytorch/utils.py
+++ b/spconv/pytorch/utils.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from typing import List
+from typing import List, Union
 import torch
 from cumm import tensorview as tv

@@ -158,12 +158,16 @@ class PointToVoxel(object):
                    self.num_per_voxel[:num_voxels], pc_voxel_id)


-def gather_features_by_pc_voxel_id(seg_res_features: torch.Tensor, pc_voxel_id: torch.Tensor):
+def gather_features_by_pc_voxel_id(seg_res_features: torch.Tensor, pc_voxel_id: torch.Tensor, invalid_value: Union[int, float] = 0):
    """This function is used to gather segmentation result to match origin pc.
    """
    if seg_res_features.device != pc_voxel_id.device:
        pc_voxel_id = pc_voxel_id.to(seg_res_features.device)
-    res = torch.zeros((pc_voxel_id.shape[0], seg_res_features.shape[1]), dtype=seg_res_features.dtype, device=seg_res_features.device)
+    res_feature_shape = (pc_voxel_id.shape[0], *seg_res_features.shape[1:])
+    if invalid_value == 0:
+        res = torch.zeros(res_feature_shape, dtype=seg_res_features.dtype, device=seg_res_features.device)
+    else:
+        res = torch.full(res_feature_shape, invalid_value, dtype=seg_res_features.dtype, device=seg_res_features.device)
    pc_voxel_id_valid = pc_voxel_id != -1
    pc_voxel_id_valid_ids = torch.nonzero(pc_voxel_id_valid).view(-1)
    seg_res_features_valid = seg_res_features[pc_voxel_id[pc_voxel_id_valid_ids]]

--- a/spconv/utils/__init__.py
+++ b/spconv/utils/__init__.py
@@ -16,6 +16,7 @@ import numpy as np
 from cumm import tensorview as tv
 from contextlib import AbstractContextManager
 from spconv.cppconstants import CPU_ONLY_BUILD
+from spconv.core_cc.csrc.utils.boxops import BoxOps

 from spconv.core_cc.csrc.sparse.all.ops_cpu1d import Point2VoxelCPU as Point2VoxelCPU1d
 from spconv.core_cc.csrc.sparse.all.ops_cpu2d import Point2VoxelCPU as Point2VoxelCPU2d
@@ -47,3 +48,69 @@ class nullcontext(AbstractContextManager):

    def __exit__(self, *excinfo):
        pass
+
+
+def rbbox_iou(box_corners: np.ndarray, qbox_corners: np.ndarray,
+              standup_iou: np.ndarray, standup_thresh: float):
+    if not BoxOps.has_boost():
+        raise NotImplementedError(
+            "this op require spconv built with boost, download boost, export BOOST_ROOT and rebuild."
+        )
+    N = box_corners.shape[0]
+    K = qbox_corners.shape[0]
+    overlap = np.zeros((N, K), dtype=box_corners.dtype)
+
+    BoxOps.rbbox_iou(tv.from_numpy(box_corners), tv.from_numpy(qbox_corners),
+                     tv.from_numpy(standup_iou), tv.from_numpy(overlap),
+                     standup_thresh, False)
+    return overlap
+
+
+def rbbox_intersection(box_corners: np.ndarray, qbox_corners: np.ndarray,
+                       standup_iou: np.ndarray, standup_thresh: float):
+    if not BoxOps.has_boost():
+        raise NotImplementedError(
+            "this op require spconv built with boost, download boost, export BOOST_ROOT and rebuild."
+        )
+    N = box_corners.shape[0]
+    K = qbox_corners.shape[0]
+    overlap = np.zeros((N, K), dtype=box_corners.dtype)
+
+    BoxOps.rbbox_iou(tv.from_numpy(box_corners), tv.from_numpy(qbox_corners),
+                     tv.from_numpy(standup_iou), tv.from_numpy(overlap),
+                     standup_thresh, True)
+    return overlap
+
+
+def rbbox_iou_loss(box_corners: np.ndarray, qbox_corners: np.ndarray):
+    if not BoxOps.has_boost():
+        raise NotImplementedError(
+            "this op require spconv built with boost, download boost, export BOOST_ROOT and rebuild."
+        )
+    N = box_corners.shape[0]
+    overlap = np.zeros((N, ), dtype=box_corners.dtype)
+
+    BoxOps.rbbox_iou_aligned(tv.from_numpy(box_corners),
+                             tv.from_numpy(qbox_corners),
+                             tv.from_numpy(overlap), False)
+    return overlap
+
+
+def non_max_suppression_cpu(boxes: np.ndarray,
+                            order: np.ndarray,
+                            thresh: float,
+                            eps: float = 0.0):
+    return BoxOps.non_max_suppression_cpu(tv.from_numpy(boxes),
+                                          tv.from_numpy(order), thresh, eps)
+
+
+def rotate_non_max_suppression_cpu(boxes: np.ndarray, order: np.ndarray,
+                                   standup_iou: np.ndarray, thresh: float):
+    if not BoxOps.has_boost():
+        raise NotImplementedError(
+            "this op require spconv built with boost, download boost, export BOOST_ROOT and rebuild."
+        )
+    return BoxOps.rotate_non_max_suppression_cpu(tv.from_numpy(boxes),
+                                                 tv.from_numpy(order),
+                                                 tv.from_numpy(standup_iou),
+                                                 thresh)