v2.1.5: add profile tool and python 3.6 for linux

82fd7a8b · yan.yan · f31eee3a · 82fd7a8b · 82fd7a8b · 82fd7a8b
Commit 82fd7a8b authored Nov 10, 2021 by yan.yan
20 changed files
--- a/.github/workflows/build.yaml
+++ b/.github/workflows/build.yaml
@@ -89,7 +89,7 @@ jobs:
    runs-on: ubuntu-20.04
    strategy:
      matrix:
-        python-version: ['3.7', '3.8', '3.9', '3.10'] # this version is only used for upload.
+        python-version: ['3.6', '3.7', '3.8', '3.9', '3.10'] # this version is only used for upload.
        cuda-version: ['102', '111', '113', '114', '']

    steps:

--- a/.github/workflows/stale.yaml
+++ b/.github/workflows/stale.yaml
@@ -14,5 +14,6 @@ jobs:
    steps:
      - uses: actions/stale@v4
        with:
-          stale-issue-message: 'Close stale issues due to inactivity.'
-          stale-pr-message: 'Close stale PRs due to inactivity.'
+          stale-issue-message: 'Mark stale issues due to inactivity.'
+          stale-pr-message: 'Mark stale PRs due to inactivity.'
+          operations-per-run: 300
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
 # Changelog

+## [2.1.5] - 2021-11-10
+### Added
+- Add cuda profile tool
+- Add python 36 support
+### Changed
+- Format all code
+### Removed
+- remove a unnecessary device sync and slightly improve performance.
+
 ## [2.1.0] - 2021-10-31
 ### Addad
 * add implicit gemm algorithm for all kind of convolution with kernel volume <= 32. this algorithm is very fast with float16.

--- a/README.md
+++ b/README.md
@@ -13,16 +13,36 @@
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
-
-[pypi-download]: https://img.shields.io/pypi/dm/spconv-cu114
-[pypi-url]: https://pypi.org/project/spconv-cu114/
-[pypi-image]: https://badge.fury.io/py/spconv-cu114.svg
+[pypi-ver-cpu]: https://img.shields.io/pypi/v/spconv
+[pypi-ver-114]: https://img.shields.io/pypi/v/spconv-cu114
+[pypi-ver-111]: https://img.shields.io/pypi/v/spconv-cu111
+[pypi-ver-113]: https://img.shields.io/pypi/v/spconv-cu113
+[pypi-ver-102]: https://img.shields.io/pypi/v/spconv-cu102
+
+[pypi-url-111]: https://pypi.org/project/spconv-cu111/
+[pypi-download-111]: https://img.shields.io/pypi/dm/spconv-cu111
+[pypi-url-113]: https://pypi.org/project/spconv-cu113/
+[pypi-download-113]: https://img.shields.io/pypi/dm/spconv-cu113
+[pypi-url-102]: https://pypi.org/project/spconv-cu102/
+[pypi-download-102]: https://img.shields.io/pypi/dm/spconv-cu102
+[pypi-url-114]: https://pypi.org/project/spconv-cu114/
+[pypi-download-114]: https://img.shields.io/pypi/dm/spconv-cu114
+[pypi-url-cpu]: https://pypi.org/project/spconv/
+[pypi-download-cpu]: https://img.shields.io/pypi/dm/spconv

 # SpConv: Spatially Sparse Convolution Library
-[![Build Status](https://github.com/traveller59/spconv/workflows/build/badge.svg)](https://github.com/traveller59/spconv/actions?query=workflow%3Abuild) [![PyPI Version][pypi-image]][pypi-url] [![pypi monthly download][pypi-download]][pypi-url]
+[![Build Status](https://github.com/traveller59/spconv/workflows/build/badge.svg)](https://github.com/traveller59/spconv/actions?query=workflow%3Abuild) 
+
+|                | PyPi Version  | Downloads  |
+| -------------- |:---------------------:| ---------------------:| 
+| CPU (Linux Only) | [![PyPI Version][pypi-ver-cpu]][pypi-url-cpu] | [![pypi monthly download][pypi-download-cpu]][pypi-url-cpu] | 
+| CUDA 10.2 | [![PyPI Version][pypi-ver-102]][pypi-url-102] | [![pypi monthly download][pypi-download-102]][pypi-url-102] | 
+| CUDA 11.1 | [![PyPI Version][pypi-ver-111]][pypi-url-111] | [![pypi monthly download][pypi-download-111]][pypi-url-111]| 
+| CUDA 11.3 (Linux Only) | [![PyPI Version][pypi-ver-113]][pypi-url-113] |[![pypi monthly download][pypi-download-113]][pypi-url-113]| 
+| CUDA 11.4 | [![PyPI Version][pypi-ver-114]][pypi-url-114] | [![pypi monthly download][pypi-download-114]][pypi-url-114]| 


-```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support.
+```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support. check [benchmark](docs/BENCHMARK.md) to see how fast spconv 2.x runs.

 [Spconv 1.x code](https://github.com/traveller59/spconv/tree/v1.2.1). We won't provide any support for spconv 1.x since it's deprecated. use spconv 2.x if possible. <!--remove this message in spconv 2.2-->

@@ -99,7 +119,10 @@ The c++ code will be built automatically when you change c++ code in project.

 For NVIDIA Embedded Platforms, you need to specify cuda arch before build: ```export CUMM_CUDA_ARCH_LIST="7.2"``` for xavier.

+You need to remove ```cumm``` in ```requires``` section in pyproject.toml after install editable ```cumm``` and before install spconv due to pyproject limit (can't find editable installed ```cumm```).
+
 #### Linux
+
 0. uninstall spconv and cumm installed by pip
 1. install build-essential, install CUDA
 2. ```git clone https://github.com/FindDefinition/cumm```, ```cd ./cumm```, ```pip install -e .```

--- a/docs/BENCHMARK.md
+++ b/docs/BENCHMARK.md
+<!--
+ Copyright 2021 Yan Yan
+ 
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+## Simple Benchmark
+
+### Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU
+
+Network Code: test/benchmark.py
+
+| F32/F16 | Spconv 1.x F32 (1080Ti) | Native| Implicit Gemm | Implicit Gemm Split Mask  |
+| -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
+| Forward | 43ms     | 21.7ms/13.7ms    | 23.5ms/11.2ms      | 22ms/12.2ms      |
+| Backward | 80ms    | 41.9ms/25.2ms    | 51.0ms/13.8ms      | 41.1ms/12.2ms      |
+
+### Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU
+
+Network Code: test/benchmark.py
+
+The network/input/profile code is same as above table.
+
+This table only profile **fp16 gemm kernels** without output tensor create/clear overhead. this table show the performance upper bound of our algorithm.
+
+| F16 |  Native| Implicit Gemm | Implicit Gemm Split Mask  |
+| -------------- |:---------------------:|---------------------:| ---------------------:|
+| Forward | 8.0ms    | 4.3ms      | 4.0ms      |
+
+We can see that the implicit gemm is very fast, gemm only use 4.3ms/11.2ms in network forward. we can achieve better performance in TensorRT + Pure C++.
+
+**NOTE** 
+When you want to benchmark network in your laptop, don't forget to close all apps except terminals! Other apps will consume GPU resource and make kernels run slower.
+
+
+## Comparsion with [MinkowskiEngine](https://github.com/NVIDIA/MinkowskiEngine) and [torchsparse](https://github.com/mit-han-lab/torchsparse)
+
+TODO
\ No newline at end of file
--- a/docs/PERFORMANCE_GUIDE.md
+++ b/docs/PERFORMANCE_GUIDE.md
@@ -25,12 +25,7 @@
 * make sure your channel size is multiple of 8 when using fp16. multiple of 32 is better.
 * spconv 2.x in Windows 10 is 1.5x~2x slower than Linux. use Linux if possible.

-Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU
-
-| F32/F16 | Spconv 1.x | Native| Implicit Gemm | Implicit Gemm Split Mask  |
-| -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
-| Forward | 43ms     | 29ms/23ms    | 30ms/15ms      | 30ms/19ms      |
-| Backward | 80ms    | 47ms/32ms    | 56ms/15ms      | 45ms/14ms      |
+See [benchmark](BENCHMARK.md) for more performance details of different algorithms.

 ## Algorithm Overview

@@ -57,4 +52,4 @@ In my test, ```Implicit Gemm``` is almost 2x faster than ```Native```.

 TODO

-In my test, ```Implicit Gemm Split Mask``` is slightly faster than ```Implicit Gemm```, but the indice generation is greatly slower, so currently we use ```Implicit Gemm``` by default.
\ No newline at end of file
+In my test, ```Implicit Gemm Split Mask``` is slightly faster than ```Implicit Gemm```, but the indice generation is slower, so currently we use ```Implicit Gemm``` by default.
\ No newline at end of file
--- a/example/mnist_sparse.py
+++ b/example/mnist_sparse.py
@@ -24,6 +24,7 @@ from torch.optim.lr_scheduler import StepLR
 import contextlib
 import torch.cuda.amp

+
 @contextlib.contextmanager
 def identity_ctx():
    yield
@@ -46,7 +47,6 @@ class Net(nn.Module):
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)

-
    def forward(self, x: torch.Tensor):
        # x: [N, 28, 28, 1], must be NHWC tensor
        x_sp = spconv.SparseConvTensor.from_dense(x.reshape(-1, 28, 28, 1))
@@ -116,13 +116,17 @@ def test(args, model, device, test_loader):
            with amp_ctx:

                output = model(data)
-            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
-            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
+            test_loss += F.nll_loss(
+                output, target, reduction='sum').item()  # sum up batch loss
+            pred = output.argmax(
+                dim=1,
+                keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

-    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
+    print(
+        '\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
            test_loss, correct, len(test_loader.dataset),
            100. * correct / len(test_loader.dataset)))

@@ -130,26 +134,54 @@ def test(args, model, device, test_loader):
 def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
-    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
+    parser.add_argument('--batch-size',
+                        type=int,
+                        default=64,
+                        metavar='N',
                        help='input batch size for training (default: 64)')
-    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
+    parser.add_argument('--test-batch-size',
+                        type=int,
+                        default=1000,
+                        metavar='N',
                        help='input batch size for testing (default: 1000)')
-    parser.add_argument('--epochs', type=int, default=14, metavar='N',
+    parser.add_argument('--epochs',
+                        type=int,
+                        default=14,
+                        metavar='N',
                        help='number of epochs to train (default: 14)')
-    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
+    parser.add_argument('--lr',
+                        type=float,
+                        default=1.0,
+                        metavar='LR',
                        help='learning rate (default: 1.0)')
-    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
+    parser.add_argument('--gamma',
+                        type=float,
+                        default=0.7,
+                        metavar='M',
                        help='Learning rate step gamma (default: 0.7)')
-    parser.add_argument('--no-cuda', action='store_true', default=False,
+    parser.add_argument('--no-cuda',
+                        action='store_true',
+                        default=False,
                        help='disables CUDA training')
-    parser.add_argument('--seed', type=int, default=1, metavar='S',
+    parser.add_argument('--seed',
+                        type=int,
+                        default=1,
+                        metavar='S',
                        help='random seed (default: 1)')
-    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
+    parser.add_argument(
+        '--log-interval',
+        type=int,
+        default=10,
+        metavar='N',
        help='how many batches to wait before logging training status')

-    parser.add_argument('--save-model', action='store_true', default=False,
+    parser.add_argument('--save-model',
+                        action='store_true',
+                        default=False,
                        help='For Saving the current Model')
-    parser.add_argument('--fp16', action='store_true', default=False,
+    parser.add_argument('--fp16',
+                        action='store_true',
+                        default=False,
                        help='For mixed precision training')

    args = parser.parse_args()
@@ -161,20 +193,30 @@ def main():

    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
    train_loader = torch.utils.data.DataLoader(
-        datasets.MNIST('../data', train=True, download=True,
+        datasets.MNIST(
+            '../data',
+            train=True,
+            download=True,
            transform=transforms.Compose([
                transforms.ToTensor(),
                # here we remove norm to get sparse tensor with lots of zeros
                # transforms.Normalize((0.1307,), (0.3081,))
            ])),
-        batch_size=args.batch_size, shuffle=True, **kwargs)
+        batch_size=args.batch_size,
+        shuffle=True,
+        **kwargs)
    test_loader = torch.utils.data.DataLoader(
-        datasets.MNIST('../data', train=False, transform=transforms.Compose([
+        datasets.MNIST(
+            '../data',
+            train=False,
+            transform=transforms.Compose([
                transforms.ToTensor(),
                # here we remove norm to get sparse tensor with lots of zeros
                # transforms.Normalize((0.1307,), (0.3081,))
            ])),
-        batch_size=args.test_batch_size, shuffle=True, **kwargs)
+        batch_size=args.test_batch_size,
+        shuffle=True,
+        **kwargs)

    model = Net().to(device)
    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

--- a/example/voxel_gen.py
+++ b/example/voxel_gen.py
@@ -19,10 +19,10 @@ from spconv.utils import Point2VoxelCPU3d
 from spconv.pytorch.utils import PointToVoxel
 import torch

+
 def main():
    # voxel gen source code: spconv/csrc/sparse/pointops.py
-    gen = Point2VoxelCPU3d(
-        vsize_xyz=[0.1, 0.1, 0.1], 
+    gen = Point2VoxelCPU3d(vsize_xyz=[0.1, 0.1, 0.1],
                           coors_range_xyz=[-80, -80, -2, 80, 80, 6],
                           num_point_features=3,
                           max_num_voxels=5000,
@@ -39,19 +39,22 @@ def main():
    print("------Raw Voxels-------")
    print(voxels_np[0])
    # run voxel gen and FILL MEAN VALUE to voxel remain
-    voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(pc_tv)
+    voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(
+        pc_tv)
    voxels_np = voxels_tv.numpy_view()
    indices_np = indices_tv.numpy_view()
    num_p_in_vx_np = num_p_in_vx_tv.numpy_view()
    print("------Voxels with mean filled-------")
    print(voxels_np[0])

+
 def main_point_with_features():
    # voxel gen source code: spconv/csrc/sparse/pointops.py
    gen = Point2VoxelCPU3d(
        vsize_xyz=[0.1, 0.1, 0.1],
        coors_range_xyz=[-80, -80, -2, 80, 80, 6],
-        num_point_features=4,  # here num_point_features must equal to pc.shape[1]
+        num_point_features=
+        4,  # here num_point_features must equal to pc.shape[1]
        max_num_voxels=5000,
        max_num_points_per_voxel=5)

@@ -68,17 +71,18 @@ def main_point_with_features():
    print("------Raw Voxels-------")
    print(voxels_np[0])
    # run voxel gen and FILL MEAN VALUE to voxel remain
-    voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(pc_tv)
+    voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(
+        pc_tv)
    voxels_np = voxels_tv.numpy_view()
    indices_np = indices_tv.numpy_view()
    num_p_in_vx_np = num_p_in_vx_tv.numpy_view()
    print("------Voxels with mean filled-------")
    print(voxels_np[0])

+
 def main_pytorch_voxel_gen():
    # voxel gen source code: spconv/csrc/sparse/pointops.py
-    gen = PointToVoxel(
-        vsize_xyz=[0.1, 0.1, 0.1], 
+    gen = PointToVoxel(vsize_xyz=[0.1, 0.1, 0.1],
                       coors_range_xyz=[-80, -80, -2, 80, 80, 6],
                       num_point_features=3,
                       max_num_voxels=5000,
@@ -100,11 +104,11 @@ def main_pytorch_voxel_gen():
    print("------Voxels with mean filled-------")
    print(voxels_np[0])

+
 def main_pytorch_voxel_gen_cuda():
    # voxel gen source code: spconv/csrc/sparse/pointops.py
    device = torch.device("cuda:0")
-    gen = PointToVoxel(
-        vsize_xyz=[0.1, 0.1, 0.1], 
+    gen = PointToVoxel(vsize_xyz=[0.1, 0.1, 0.1],
                       coors_range_xyz=[-80, -80, -2, 80, 80, 6],
                       num_point_features=3,
                       max_num_voxels=5000,

--- a/format_all.sh
+++ b/format_all.sh
-isort -rc --atomic ./spconv && \
-isort -rc --atomic ./test && \
-yapf -i --recursive -vv ./spconv ./test
-find ./src -regex '.*\.\(cpp\|hpp\|cc\|cxx\|cu\|cuh\|h\)' | xargs clang-format -i
-find ./include -regex '.*\.\(cpp\|hpp\|cc\|cxx\|cu\|cuh\|h\)' | xargs clang-format -i
\ No newline at end of file
+yapf -i --recursive -vv ./spconv ./test ./example ./scripts
--- a/pyproject.toml
+++ b/pyproject.toml
 [build-system]
-requires = ["setuptools>=41.0", "wheel", "pccm>=0.2.21", "cumm>=0.2.1"]
+requires = ["setuptools>=41.0", "wheel", "pccm>=0.2.21", "cumm>=0.2.2"]
 build-backend = "setuptools.build_meta"
--- a/scripts/dev_subm.py
+++ b/scripts/dev_subm.py
@@ -29,10 +29,11 @@ from spconv.pytorch import ops
 from spconv.algo import CONV, BestConvAlgoByProfile
 from spconv.pytorch.cppcore import torch_tensor_to_tv

+
 def reduce_mask_count(mask: np.ndarray, width: int):
    mask_length_32 = (div_up(mask.shape[0], width)) * width
    if mask.shape[0] < mask_length_32:
-        mask_pad = np.zeros((mask_length_32,), dtype=mask.dtype)
+        mask_pad = np.zeros((mask_length_32, ), dtype=mask.dtype)
        mask_pad[:mask.shape[0]] = mask
        mask = mask_pad
    mask = mask.reshape(-1, width)
@@ -40,16 +41,18 @@ def reduce_mask_count(mask: np.ndarray, width: int):
    maskr_tv = tv.from_numpy(maskr)
    return SpconvOps.count_bits(maskr_tv).numpy().sum() * width

+
 def reduce_mask_count_x(mask: np.ndarray, width: int):
    mask_length_32 = (div_up(mask.shape[0], width)) * width
    if mask.shape[0] < mask_length_32:
-        mask_pad = np.zeros((mask_length_32,), dtype=mask.dtype)
+        mask_pad = np.zeros((mask_length_32, ), dtype=mask.dtype)
        mask_pad[:mask.shape[0]] = mask
        mask = mask_pad
    mask = mask.reshape(-1, width)
    maskr = np.bitwise_or.reduce(mask, axis=1)
    return maskr

+
 def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
    limit_input_n = 16384
    limit_input_n = None
@@ -88,8 +91,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
        stride = [1] * ndim
        dilation = [1] * ndim
        out_padding = [0] * ndim
-    out_inds, pair_ref, indice_num_per_loc = ops.get_indice_pairs(indices_th, 1, spatial_shape, ConvAlgo.Native, 
-            ksize, stride, padding, dilation, out_padding, subm)
+    out_inds, pair_ref, indice_num_per_loc = ops.get_indice_pairs(
+        indices_th, 1, spatial_shape, ConvAlgo.Native, ksize, stride, padding,
+        dilation, out_padding, subm)
    indice_num_per_loc_np = indice_num_per_loc.cpu().numpy()
    indice_pairs_np = pair_ref.cpu().numpy()
    algo = ConvAlgo.MaskSplitImplicitGemm
@@ -98,8 +102,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
    else:
        num_split = 2
    for i in range(5):
-        res = ops.get_indice_pairs_implicit_gemm(indices_th, 1, spatial_shape, algo, 
-            ksize, stride, padding, dilation, out_padding, subm)
+        res = ops.get_indice_pairs_implicit_gemm(indices_th, 1, spatial_shape,
+                                                 algo, ksize, stride, padding,
+                                                 dilation, out_padding, subm)
    out_inds = res[0]
    num_inds_per_loc = res[1]
    pair_fwd = res[2]
@@ -115,23 +120,38 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
    mask_argsort_fwd_splits = res[6]
    mask_argsort_bwd_splits = res[7]
    masks = res[8]
-    pair_mask_fwd_splits_tv = [ops.torch_tensor_to_tv(t, dtype=tv.uint32) for t in pair_mask_fwd_splits]
-    valid_location_bitcount = [SpconvOps.count_bits(t) for t in pair_mask_fwd_splits_tv]
-    valid_location_count = sum([t.cpu().numpy().sum() for t in valid_location_bitcount])
+    pair_mask_fwd_splits_tv = [
+        ops.torch_tensor_to_tv(t, dtype=tv.uint32)
+        for t in pair_mask_fwd_splits
+    ]
+    valid_location_bitcount = [
+        SpconvOps.count_bits(t) for t in pair_mask_fwd_splits_tv
+    ]
+    valid_location_count = sum(
+        [t.cpu().numpy().sum() for t in valid_location_bitcount])
    reduce_length = 32
-    split_mask_valid_count = sum([reduce_mask_count(t.cpu().numpy(), reduce_length) for t in pair_mask_fwd_splits_tv])
+    split_mask_valid_count = sum([
+        reduce_mask_count(t.cpu().numpy(), reduce_length)
+        for t in pair_mask_fwd_splits_tv
+    ])
    if subm:
-        print("SUBM", valid_location_count, split_mask_valid_count, pair_fwd.numel())
+        print("SUBM", valid_location_count, split_mask_valid_count,
+              pair_fwd.numel())
    else:
-        print("REGULAR", valid_location_count, split_mask_valid_count, pair_fwd.numel())
+        print("REGULAR", valid_location_count, split_mask_valid_count,
+              pair_fwd.numel())
    # return

    if run_conv:
        C = 64
        K = 64
        desps = CONV.desps
-        mask_output_fwd = torch.zeros([2, div_up(out_inds.shape[0], 32)], dtype=torch.int32, device=indices_th.device)
-        mask_output_bwd = torch.zeros([2, div_up(indices.dim(0), 32)], dtype=torch.int32, device=indices_th.device)
+        mask_output_fwd = torch.zeros([2, div_up(out_inds.shape[0], 32)],
+                                      dtype=torch.int32,
+                                      device=indices_th.device)
+        mask_output_bwd = torch.zeros([2, div_up(indices.dim(0), 32)],
+                                      dtype=torch.int32,
+                                      device=indices_th.device)

        for desp in desps:
            if desp.algo != GemmAlgo.Simt.value:
@@ -140,17 +160,22 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
            #     continue
            # if desp.tile_shape !
            if desp.dtype_a == dtypes.int8.tv_dtype:
-                inp = np.random.randint(-1, 1, size=[voxels_np.shape[0], C]).astype(np.int8)
-                weight = np.random.randint(-1, 1, size=[K, *ksize, C]).astype(np.int8)
-                output = np.random.randint(-1, 1, size=[out_inds.shape[0], K]).astype(
-                    dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
+                inp = np.random.randint(-1, 1, size=[voxels_np.shape[0],
+                                                     C]).astype(np.int8)
+                weight = np.random.randint(-1, 1, size=[K, *ksize,
+                                                        C]).astype(np.int8)
+                output = np.random.randint(-1, 1, size=[
+                    out_inds.shape[0], K
+                ]).astype(dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
            else:
-                inp = np.random.uniform(-1, 1, size=[voxels_np.shape[0], C]).astype(
-                    dtypes.get_npdtype_from_tvdtype(desp.dtype_input))
+                inp = np.random.uniform(-1, 1, size=[
+                    voxels_np.shape[0], C
+                ]).astype(dtypes.get_npdtype_from_tvdtype(desp.dtype_input))
                weight = np.random.uniform(-1, 1, size=[K, *ksize, C]).astype(
                    dtypes.get_npdtype_from_tvdtype(desp.dtype_weight))
-                output = np.random.uniform(-1, 1, size=[out_inds.shape[0], K]).astype(
-                    dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
+                output = np.random.uniform(-1, 1, size=[
+                    out_inds.shape[0], K
+                ]).astype(dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
            weight_ref = weight.transpose(1, 2, 3, 0, 4)
            weight_ref = np.ascontiguousarray(weight_ref).reshape(-1, K, C)
            if desp.op_type == ConvOpType.kBackwardInput.value:
@@ -270,7 +295,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
                    c_inds = indice_pairs_np[1][filter_offset][:nhot]
                    # print(a_inds_cpu[:10])
                    a = inp[a_inds]
-                    cc = a.astype(np.float32) @ weight_ref[filter_offset].T.astype(np.float32)
+                    cc = a.astype(
+                        np.float32) @ weight_ref[filter_offset].T.astype(
+                            np.float32)
                    output_ref[c_inds] += cc

                output_cpu = output_tv.cpu().numpy().astype(np.float32)
@@ -294,12 +321,18 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
                    # print(a_inds_cpu[:10])
                    a = output[a_inds]
                    # NK @ KC
-                    cc = a.astype(np.float32) @ weight_ref[filter_offset].astype(np.float32)
+                    cc = a.astype(
+                        np.float32) @ weight_ref[filter_offset].astype(
+                            np.float32)
                    dinput_ref[c_inds] += cc
                din_cpu = inp_tv.cpu().numpy()
-                print("ERROR", np.linalg.norm(din_cpu.reshape(-1) - dinput_ref.reshape(-1)))
+                print(
+                    "ERROR",
+                    np.linalg.norm(
+                        din_cpu.reshape(-1) - dinput_ref.reshape(-1)))
            else:
-                dw_ref = np.zeros_like(weight_ref, dtype=np.float32) # KV, K, C
+                dw_ref = np.zeros_like(weight_ref,
+                                       dtype=np.float32)  # KV, K, C
                for filter_offset in range(kv):
                    if subm and filter_offset > kv // 2:
                        nhot = indice_num_per_loc_np[kv - 1 - filter_offset]
@@ -313,13 +346,17 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
                    out_gather = output[o_inds]  # [N, K]
                    inp_gather = inp[i_inds]  # [N, C]
                    # KN @ NC
-                    dw_res = out_gather.astype(np.float32).T @ inp_gather.astype(np.float32)
+                    dw_res = out_gather.astype(
+                        np.float32).T @ inp_gather.astype(np.float32)
                    dw_ref[filter_offset] = dw_res
                # print(indice_pairs_np_test[0])
                dw_ref_kcrs = dw_ref.transpose(1, 0, 2)
                dw_cpu = weight_tv.cpu().numpy().reshape(K, np.prod(ksize), C)

-                print("ERROR", np.linalg.norm(dw_cpu.reshape(-1) - dw_ref_kcrs.reshape(-1)))
+                print(
+                    "ERROR",
+                    np.linalg.norm(
+                        dw_cpu.reshape(-1) - dw_ref_kcrs.reshape(-1)))


 if __name__ == "__main__":

--- a/scripts/sort_bench.py
+++ b/scripts/sort_bench.py
@@ -20,6 +20,7 @@ import torch

 from spconv.pytorch.cppcore import torch_tensor_to_tv

+
 def main():
    with open("/home/yy/asd.pkl", "rb") as f:
        a_th = pickle.load(f)
@@ -34,5 +35,6 @@ def main():
        a_tv_1 = a_tv.clone()
        SpconvOps.sort_1d_by_key(a_tv_1[0], mask_argsort_tv[0])

+
 if __name__ == "__main__":
    main()
--- a/setup.py
+++ b/setup.py
@@ -38,9 +38,9 @@ if cuda_ver:
    cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102

    RELEASE_NAME += "-cu{}".format(cuda_ver)
-    deps = ["cumm-cu{}".format(cuda_ver)]
+    deps = ["cumm-cu{}>=0.2.2".format(cuda_ver)]
 else:
-    deps = ["cumm"]
+    deps = ["cumm>=0.2.2"]



@@ -48,11 +48,11 @@ DESCRIPTION = 'spatial sparse convolution'
 URL = 'https://github.com/traveller59/spconv'
 EMAIL = 'yanyan.sub@outlook.com'
 AUTHOR = 'Yan Yan'
-REQUIRES_PYTHON = '>=3.7'
+REQUIRES_PYTHON = '>=3.6'
 VERSION = None

 # What packages are required for this module to be executed?
-REQUIRED = ["pccm>=0.2.19", "pybind11>=2.6.0", "fire", "numpy", *deps]
+REQUIRED = ["pccm>=0.2.21", "pybind11>=2.6.0", "fire", "numpy", *deps]

 # What packages are optional?
 EXTRAS = {

--- a/spconv/__init__.py
+++ b/spconv/__init__.py
--- a/spconv/algo.py
+++ b/spconv/algo.py
@@ -24,9 +24,10 @@ from spconv.constants import NDIM_DONT_CARE
 from typing import Optional
 import time
 from threading import Lock
-import torch 
+import contextlib
 import numpy as np
 from spconv.core import ConvAlgo, AlgoHint
+from spconv.tools import CUDAKernelTimer

 ALL_ALGO_DESPS = GemmMainUnitTest.get_all_algo_desp()
 ALL_CONV_ALGO_DESPS = ConvMainUnitTest.get_all_conv_algo_desp()
@@ -403,7 +404,8 @@ class SimpleGemm:
        alpha: float = 1.0,
        beta: float = 0.0,
        gather_data: tv.Tensor = tv.Tensor(),
-        workspace: tv.Tensor = tv.Tensor()):
+        workspace: tv.Tensor = tv.Tensor(),
+        timer: CUDAKernelTimer = CUDAKernelTimer(False)):
        m, n, k = GemmMainUnitTest.extract_mnk(a.shape, b.shape, trans_a,
                                               trans_b, trans_c,
                                               shuffle_type.value,
@@ -446,6 +448,9 @@ class SimpleGemm:
        #                    stream=stream)
        #     GemmMainUnitTest.stream_synchronize(stream)
        #     gather = time.time() - tt
+        if timer.enable:
+            assert timer._timer is not None
+            params.timer = timer._timer

        GemmMainUnitTest.matmul2(params)
        # GemmMainUnitTest.stream_synchronize(stream)
@@ -678,7 +683,8 @@ class SimpleConv:
                              beta: float = 0.0,
                              stream: int = 0,
                              workspace: tv.Tensor = tv.Tensor(),
-                              verbose: bool = False):
+                              verbose: bool = False,
+                              timer: CUDAKernelTimer = CUDAKernelTimer(False)):
        channel_k = output.dim(1)
        channel_c = inp.dim(1)
        # GemmMainUnitTest.stream_synchronize(stream)
@@ -709,9 +715,11 @@ class SimpleConv:
        params.mask_filter = mask_filter
        params.mask_output = mask_output
        params.reverse_mask = reverse_mask
+        if timer.enable:
+            assert timer._timer is not None
+            params.timer = timer._timer
        # torch.cuda.synchronize()
        # t = time.time()
-
        params.workspace = workspace
        ConvMainUnitTest.implicit_gemm2(params)
        # torch.cuda.synchronize()
@@ -724,6 +732,7 @@ class SimpleConv:
    def stream_synchronize(self, stream: int):
        return GemmMainUnitTest.stream_synchronize(stream)

+
 GEMM = SimpleGemm(ALL_ALGO_DESPS)
 CONV = SimpleConv(ALL_CONV_ALGO_DESPS)


--- a/spconv/build.py
+++ b/spconv/build.py
@@ -19,7 +19,8 @@ from pccm.utils import project_is_editable, project_is_installed
 from ccimport.compat import InWindows
 from .constants import PACKAGE_NAME, PACKAGE_ROOT, DISABLE_JIT

-if project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME) and not DISABLE_JIT:
+if project_is_installed(PACKAGE_NAME) and project_is_editable(
+        PACKAGE_NAME) and not DISABLE_JIT:
    from spconv.core import SHUFFLE_SIMT_PARAMS, SHUFFLE_VOLTA_PARAMS, SHUFFLE_TURING_PARAMS
    from spconv.core import IMPLGEMM_SIMT_PARAMS, IMPLGEMM_VOLTA_PARAMS, IMPLGEMM_TURING_PARAMS

@@ -27,9 +28,11 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME) and
    from cumm.conv.main import ConvMainUnitTest

    from spconv.csrc.sparse.all import SpconvOps
-    cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS + SHUFFLE_TURING_PARAMS)
+    cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS +
+                          SHUFFLE_TURING_PARAMS)
    cu.namespace = "cumm.gemm.main"
-    convcu = ConvMainUnitTest(IMPLGEMM_SIMT_PARAMS + IMPLGEMM_VOLTA_PARAMS + IMPLGEMM_TURING_PARAMS)
+    convcu = ConvMainUnitTest(IMPLGEMM_SIMT_PARAMS + IMPLGEMM_VOLTA_PARAMS +
+                              IMPLGEMM_TURING_PARAMS)
    convcu.namespace = "cumm.conv.main"
    objects_folder = None
    if InWindows:

--- a/spconv/constants.py
+++ b/spconv/constants.py
@@ -20,8 +20,8 @@ from pccm.utils import project_is_editable, project_is_installed
 PACKAGE_NAME = "spconv"
 PACKAGE_ROOT = Path(__file__).parent.resolve()

-EDITABLE_INSTALLED = project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME)
-
+EDITABLE_INSTALLED = project_is_installed(
+    PACKAGE_NAME) and project_is_editable(PACKAGE_NAME)

 _filter_hwio_env = os.getenv("SPCONV_FILTER_HWIO", "0")
 FILTER_HWIO = _filter_hwio_env == "1"

--- a/spconv/core.py
+++ b/spconv/core.py
--- a/spconv/core_cc/__init__.pyi
+++ b/spconv/core_cc/__init__.pyi
-# Copyright 2021 Yan Yan
-# 
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-# 
-#     http://www.apache.org/licenses/LICENSE-2.0
-# 
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
--- a/spconv/core_cc/csrc/__init__.pyi
+++ b/spconv/core_cc/csrc/__init__.pyi
-# Copyright 2021 Yan Yan
-# 
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-# 
-#     http://www.apache.org/licenses/LICENSE-2.0
-# 
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-