v2.1

eae6a3bd · yan.yan · fa995a4f · eae6a3bd · eae6a3bd · eae6a3bd
Commit eae6a3bd authored Nov 07, 2021 by yan.yan
20 changed files
--- a/.github/workflows/build.yaml
+++ b/.github/workflows/build.yaml
@@ -16,7 +16,7 @@ jobs:
    strategy:
      matrix:
        python-version: ['3.7', '3.8', '3.9', '3.10'] 
-        cuda-version: ['11.1', '11.4']
+        cuda-version: ['10.2', '11.1', '11.3', '11.4']
    steps:
      - uses: actions/checkout@master
      - uses: dorny/paths-filter@v2
@@ -75,7 +75,7 @@ jobs:
    strategy:
      matrix:
        python-version: ['3.7', '3.8', '3.9', '3.10'] # this version is only used for upload.
-        cuda-version: ['111', '114']
+        cuda-version: ['102', '111', '113', '114', '']

    steps:
      - uses: actions/checkout@master
@@ -110,6 +110,17 @@ jobs:
          chmod +x tools/build-wheels.sh
          docker run --rm -e PLAT=$PLAT -e CUMM_CUDA_VERSION=${{ matrix.cuda-version }} -e SPCONV_PYTHON_LIST=${{env.PYTHON_VERSION}} -v `pwd`:/io $DOCKER_IMAGE bash -c "source /etc/bashrc && /io/tools/build-wheels.sh"

+      - name: Build a cpu wheel
+        env:
+          CUDA_VERSION: ${{ matrix.cuda-version }}
+          PYTHON_VERSION: ${{ matrix.python-version }}
+          DOCKER_IMAGE: scrin/manylinux2014-cuda:cu114-devel-1.0.0
+          PLAT: manylinux2014_x86_64
+        if: (github.event_name == 'push' && (startsWith(github.ref, 'refs/tags')) && (env.CUDA_VERSION == '')) || ((steps.changes.outputs.nobuild == 'false') && (env.PYTHON_VERSION == '3.10') && (env.CUDA_VERSION == ''))
+        run: |
+          chmod +x tools/build-wheels.sh
+          docker run --rm -e PLAT=$PLAT -e CUMM_CUDA_VERSION=${{ matrix.cuda-version }} -e SPCONV_PYTHON_LIST=${{env.PYTHON_VERSION}} -v `pwd`:/io $DOCKER_IMAGE bash -c "source /etc/bashrc && /io/tools/build-wheels.sh"
+
      - name: Publish a Python distribution to PyPI
        if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags')
        uses: pypa/gh-action-pypi-publish@master

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
 # Changelog

+## [2.1.0] - 2021-10-31
+### Addad
+* add implicit gemm algorithm for all kind of convolution with kernel volume <= 32. this algorithm is very fast with float16.
+* add pytorch wrapper for voxel generator
+* add CPU support and CPU-only build.
+
+## [2.0.2] - 2021-10-26
+### Fixed
+- Fix a serious bug that do nothing with non-spconv layers in SparseSequential
+- Fix a bug of ProxyableClassMeta
+
 ## [2.0.0] - 2021-10-16
 ### Changed
 - Change build system from cmake to pccm.

--- a/README.md
+++ b/README.md
@@ -14,41 +14,29 @@
 limitations under the License.
 -->

-# SpConv: PyTorch Spatially Sparse Convolution Library
+# SpConv: Spatially Sparse Convolution Library

 [![Build Status](https://github.com/traveller59/spconv/workflows/build/badge.svg)](https://github.com/traveller59/spconv/actions?query=workflow%3Abuild)

-[Spconv 1.x code](https://github.com/traveller59/spconv/tree/v1.2.1). We won't provide any support for spconv 1.x since it's deprecated. use spconv 2.x if possible.
+```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support.

-## Breaking changes in Spconv 2.x
-
-* ```spconv.xxx``` move to ```spconv.pytorch.xxx```, change all ```import spconv``` to ```import spconv.pytorch as spconv``` and ```from spconv.xxx import``` to ```from spconv.pytorch.xxx import```.
-* ```use_hash``` in Sparse Convolution is removed, we only use hash table in 2.x.
-* ```x.features = F.relu(x)``` now raise error. use ```x = x.replace_feature(F.relu(x.features))``` instead.
-* weight layout has been changed to RSKC (native algorithm) or KRSC (implicit gemm), no longer RSCK (spconv 1.x). RS is kernel size, C is input channel, K is output channel.
-* all util ops are removed (pillar scatter/nms/...)
-* VoxelGenerator has been replaced by Point2VoxelGPU[1-4]d/Point2VoxelCPU[1-4]d.
-* spconv 2.x don't support CPU for now
-
-* test spconv 1.x model in spconv 2.x: set environment variable before run program. Linux: ```export SPCONV_FILTER_HWIO="1"```, Windows powershell: ```$Env:SPCONV_FILTER_HWIO = "1"```
-
-## Upcoming release Spconv 2.1.0 (Delay to 11.7.2021, sorry):
+[Spconv 1.x code](https://github.com/traveller59/spconv/tree/v1.2.1). We won't provide any support for spconv 1.x since it's deprecated. use spconv 2.x if possible. <!--remove this message in spconv 2.2-->

-**Status**: CPU build is ready. implicit gemm is ready. working on implicit-gemm-style indice generation for standard conv/pool, and implicit-gemm-style maxpool op.
+## Breaking changes in Spconv 2.x

-* implicit gemm algorithm, greatly faster than native algorithm when using float16 (tested in RTX 3080 Laptop).
-* simple CPU support and CPU-only build
-* add pytorch cpu/cuda voxel generator
-* fix a bug of mixed precision training.
+Spconv 1.x users **NEED READ [THIS](docs/SPCONV_2_BREAKING_CHANGEs.md)** before using spconv 2.x.

-## News in Spconv 2.0.0
+## Spconv 2.1 vs Spconv 1.x

-* training/inference speed is increased (+50~80% for float32)
-* support int8/tensor core
+* spconv now can be installed by **pip**. see install section in readme for more details.
+* Microsoft Windows support (only windows 10 has been tested).
+* fp32 (not tf32) training/inference speed is increased (+50~80%)
+* fp16 training/inference speed is greatly increased when your layer support tensor core (channel size must be multiple of 8).
+* int8 op is ready, but we still need some time to figure out how to run int8 in pytorch.
 * doesn't depend on pytorch binary. 
 * since spconv 2.x doesn't depend on pytorch binary (never in future), it's impossible to support torch.jit/libtorch inference.

-Spconv 2.1.0 vs 1.x speed:
+Spconv 2.1 vs 1.x speed:

 |                | 1080Ti Spconv 1.x F32 | 1080Ti Spconv 2.0 F32 | 3080M* Spconv 2.1 F16  |
 | -------------- |:---------------------:| ---------------------:| ----------:|
@@ -56,12 +44,26 @@ Spconv 2.1.0 vs 1.x speed:

 \* 3080M (Laptop) ~= 3070 Desktop

+
+<!--
+TODO Spconv vs [MinkowskiEngine](https://github.com/NVIDIA/MinkowskiEngine) vs [torchsparse](https://github.com/mit-han-lab/torchsparse)
+-->
+
+
+## Roadmap for Spconv 2.2-2.3: 
+* TensorFormat32 support for faster fp32 training when you use NVIDIA Geforce RTX 30x0/Tesla A100/Quadro RTX Ax000 (2.2)
+* change implicit gemm weight layout from KRSC to RSKC to make sure we can use native algorithm with implicit gemm weight. (2.2)
+* documents (2.2)
+* Ampere feature support (2.3)
+* pytorch int8 inference, and QAT support (2.3)
+
 ## Usage

 Firstly you need to use ```import spconv.pytorch as spconv``` in spconv 2.x.

-Then see docs/USAGE.md.
+Then see [this](docs/USAGE.md).

+Don't forget to check [performance guide](docs/PERFORMANCE_GUIDE.md).

 ## Install

@@ -73,21 +75,50 @@ You need at least CUDA 10.2 to build and run spconv 2.x. We won't offer any supp

 ### Prebuilt

-We offer python 3.7-3.10 and 11.1/11.4 prebuilt binaries for linux (manylinux) and windows 10/11.
-
-CUDA 10.2 support will be added in version 2.0.2.
+We offer python 3.7-3.10 and cuda 10.2/11.1/11.3/11.4 prebuilt binaries for linux (manylinux) and windows 10/11.

-We will offer prebuilts for CUDA versions supported by latest pytorch release. For example, pytorch 1.9 support cuda 10.2 and 11.1, so we support them too.
+We will provide prebuilts for CUDA versions supported by latest pytorch release. For example, pytorch 1.10 provide cuda 10.2 and 11.3 prebuilts, so we provide them too.

 For Linux users, you need to install pip >= 20.3 first to install prebuilt.

+CUDA 11.1 will be removed in spconv 2.2 because pytorch 1.10 don't provide prebuilts for it.
+
+```pip install spconv``` for CPU only (**Linux Only**). you should only use this for debug usage, the performance isn't optimized due to manylinux limit (no omp support).
+
+```pip install spconv-cu102``` for CUDA 10.2
+
 ```pip install spconv-cu111``` for CUDA 11.1

+```pip install spconv-cu113``` for CUDA 11.3
+
 ```pip install spconv-cu114``` for CUDA 11.4

-**NOTE** It's safe to have different minor cuda version between system and conda (pytorch). for example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed.
+**NOTE** It's safe to have different **minor** cuda version between system and conda (pytorch) **in Linux**. for example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed.
+
+
+### Build from source for development (JIT, recommend)
+
+The c++ code will be built automatically when you change c++ code in project.
+
+For NVIDIA Embedded Platforms, you need to specify cuda arch before build: ```export CUMM_CUDA_ARCH_LIST="7.2"``` for xavier.
+
+#### Linux
+0. uninstall spconv and cumm installed by pip
+1. install build-essential, install CUDA
+2. ```git clone https://github.com/FindDefinition/cumm```, ```cd ./cumm```, ```pip install -e .```
+3. ```git clone https://github.com/traveller59/spconv```, ```cd ./spconv```, ```pip install -e .```
+4. in python, ```import spconv``` and wait for build finish.
+
+#### Windows
+0. uninstall spconv and cumm installed by pip
+1. install visual studio 2019 or newer. make sure C++ development component is installed. install CUDA
+2. set [powershell script execution policy](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_execution_policies?view=powershell-7.1)
+3. start a new powershell, run ```tools/msvc_setup.ps1```
+4. ```git clone https://github.com/FindDefinition/cumm```, ```cd ./cumm```, ```pip install -e .```
+5. ```git clone https://github.com/traveller59/spconv```, ```cd ./spconv```, ```pip install -e .```
+6. in python, ```import spconv``` and wait for build finish.

-### Build from source
+### Build wheel from source (not recommend, this is done in CI.)

 You need to rebuild ```cumm``` first if you are build along a CUDA version that not provided in prebuilts.

@@ -95,15 +126,17 @@ You need to rebuild ```cumm``` first if you are build along a CUDA version that

 1. install build-essential, install CUDA
 2. run ```export SPCONV_DISABLE_JIT="1"```
-3. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl```
+3. run ```pip install pccm cumm wheel```
+4. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl```

-#### Windows 10/11
+#### Windows

-1. install visual studio 2019 or newer. make sure C++ development package is installed. install CUDA
+1. install visual studio 2019 or newer. make sure C++ development component is installed. install CUDA
 2. set [powershell script execution policy](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_execution_policies?view=powershell-7.1)
 3. start a new powershell, run ```tools/msvc_setup.ps1```
 4. run ```$Env:SPCONV_DISABLE_JIT = "1"```
-5. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl```
+5. run ```pip install pccm cumm wheel```
+6. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl```

 ## TODO in Spconv 2.x
 - [ ] Ampere (A100 / RTX 3000 series) feature support (work in progress)
@@ -112,7 +145,6 @@ You need to rebuild ```cumm``` first if you are build along a CUDA version that
 - [ ] Build C++ only package
 - [ ] JIT compilation for CUDA kernels
 - [ ] Document (low priority)
- [ ] CPU support (low priority)

 ## Note


--- a/docs/DEVELOPMENT.md
+++ b/docs/DEVELOPMENT.md
@@ -14,3 +14,12 @@
 limitations under the License.
 -->

+# How to develop spconv 2.x
+
+## First step
+
+spconv 2.x is written in a unique c++ framework ```pccm```. read [pccm guide]() to learn how to use ```pccm```.
+
+It's recommend to uninstall spconv and cumm installed by pip, then install spconv and cumm both in editable mode (```pip install -e .```)
+
+## Architecture
\ No newline at end of file
--- a/docs/PERFORMANCE_GUIDE.md
+++ b/docs/PERFORMANCE_GUIDE.md
@@ -14,3 +14,46 @@
 limitations under the License.
 -->

+# Spconv 2.x Performance Guide
+
+## Short Guide
+
+* If you train without Tensor Core (i.e. FP32 training), set all ```algo``` in convolution/maxpool to ```ConvAlgo.Native``` manually.
+* If your GPU support Tensor Core, use FP16 (mixed precision training) if possible. 
+* If you train with mixed precision training (use Tensor Core), you don't need to set algorithm manually.
+* Currently fast algorithm only support kernel volume (prod of kernel size) <= 32, so don't use large kernel size.
+* make sure your channel size is multiple of 8 when using fp16. multiple of 32 is better.
+
+Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU
+
+| F32/F16 | Spconv 1.x | Native| Implicit Gemm | Implicit Gemm Split Mask  |
+| -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
+| Forward | 43ms     | 29ms/23ms    | 30ms/15ms      | 30ms/19ms      |
+| Backward | 80ms    | 47ms/32ms    | 56ms/15ms      | 45ms/14ms      |
+
+## Algorithm Overview
+
+### Native Explicit (deprecated and removed in spconv 2.x)
+
+native algorithm (explicit, no fused) is standard gather-gemm-scatter algorithm. Assume we compute 3x3 conv, We can split it to 9 of 1x1 conv which can be computed by matmul, then sum them to get final result.
+For sparse convolution, we also do split-gemm-sum to calculate conv, but we need to collect data first because it's sparse.
+
+### Native
+
+Fused version of above algorithm. 1.5x-2x faster than non-fused version.
+
+### Implicit Gemm
+
+```Native``` algorithm do minimal mma (matrix multiply add), but it need to serialize IO. The pipeline of ```Native``` is gather-gemm-scatter-gather-gemm-scatter-...
+
+```Implicit Gemm``` fuse all calculation to one kernel and perform overlapped gather-mma-scatter to save a lot of time. 
+
+![Image Overlapped Gemm](https://raw.githubusercontent.com/NVIDIA/cutlass/master/media/images/software-pipeline.png)
+
+In my test, ```Implicit Gemm``` is almost 2x faster than ```Native```.
+
+### Implicit Gemm Split Mask
+
+TODO
+
+In my test, ```Implicit Gemm Split Mask``` is slightly faster than ```Implicit Gemm```, but the indice generation is greatly slower, so currently we use ```Implicit Gemm``` by default.
\ No newline at end of file
--- a/docs/SPCONV_2_BREAKING_CHANGEs.md
+++ b/docs/SPCONV_2_BREAKING_CHANGEs.md
+<!--
+ Copyright 2021 Yan Yan
+ 
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+# Breaking changes in Spconv 2.x for spconv 1.x users
+
+* ```spconv.xxx``` move to ```spconv.pytorch.xxx```, change all ```import spconv``` to ```import spconv.pytorch as spconv``` and ```from spconv.xxx import``` to ```from spconv.pytorch.xxx import```.
+* ```use_hash``` and ```fused_bn``` in Sparse Convolution is removed, we only use hash table in 2.x.
+* ```x.features = F.relu(x.features)``` now raise error. use ```x = x.replace_feature(F.relu(x.features))``` instead.
+* weight layout has been changed to RSKC (native algorithm) or KRSC (implicit gemm), no longer RSCK (spconv 1.x). RS is kernel size, C is input channel, K is output channel.
+* all util ops are removed (pillar scatter/nms/rbbox_iou...)
+* VoxelGenerator has been replaced by Point2VoxelGPU[1-4]d/Point2VoxelCPU[1-4]d.
+* spconv < 2.1 don't support CPU. spconv 2.1+ support cpu for debug usage.
+
+* test spconv 1.x model in spconv 2.x: set environment variable before run program. Linux: ```export SPCONV_FILTER_HWIO="1"```, Windows powershell: ```$Env:SPCONV_FILTER_HWIO = "1"```. **WARNING** test spconv 1.x model don't support implicit gemm algorithm, you need to train from scratch with spconv 2.x and select ConvAlgo.MaskSplitImplicitGemm.
--- a/docs/USAGE.md
+++ b/docs/USAGE.md
@@ -52,7 +52,7 @@ class ExampleNet(nn.Module):
    def __init__(self, shape):
        super().__init__()
        self.net = spconv.SparseSequential(
-            spconv.SparseConv3d(32, 64, 3), # just like nn.Conv3d but don't support group and all([d > 1, s > 1])
+            spconv.SparseConv3d(32, 64, 3), # just like nn.Conv3d but don't support group
            nn.BatchNorm1d(64), # non-spatial layers can be used directly in SparseSequential.
            nn.ReLU(),
            spconv.SubMConv3d(64, 64, 3, indice_key="subm0"),
@@ -100,6 +100,9 @@ class ExampleNet(nn.Module):
        return self.net(x)
 ```

+### Fast Mixed Percision Training
+
+see example/mnist_sparse. we support ```torch.cuda.amp```.

 ### Utility functions

@@ -107,23 +110,18 @@ class ExampleNet(nn.Module):

 voxel generator in spconv generate indices in **ZYX** order, the params format are **XYZ**.

-voxel generator in spconv takes a ```tv.Tensor``` return a ```tv.Tensor```, this tensor reference to a **permanent** storage in generator.
-
+generated indices don't include batch axis, you need to add it by yourself.

 ```Python
-from spconv.utils import Point2VoxelCPU3d
+from spconv.pytorch.utils import PointToVoxel
 # this generator generate ZYX indices.
-gen = Point2VoxelCPU3d(
+gen = PointToVoxel(
    vsize_xyz=[0.1, 0.1, 0.1], 
    coors_range_xyz=[-80, -80, -2, 80, 80, 6], 
    num_point_features=3, 
    max_num_voxels=5000, 
    max_num_points_per_voxel=5)
 pc = np.random.uniform(-10, 10, size=[1000, 3])
-pc_tv = tv.from_numpy(pc)
-voxels, coords, num_points_per_voxel = gen.generate(pc_tv)
-
-# get numpy
-voxels_np = voxels.numpy_view() # no copy, but become invalid if generator is destroyed.
-voxels_np = voxels.numpy() # will perform copy
+pc_th = torch.from_numpy(pc)
+voxels, coords, num_points_per_voxel = gen(pc_th)
 ```
--- a/example/mnist_sparse.py
+++ b/example/mnist_sparse.py
@@ -21,6 +21,12 @@ import torch.nn.functional as F
 import torch.optim as optim
 from torchvision import datasets, transforms
 from torch.optim.lr_scheduler import StepLR
+import contextlib
+import torch.cuda.amp 
+
+@contextlib.contextmanager
+def identity_ctx():
+    yield 


 class Net(nn.Module):
@@ -58,27 +64,58 @@ class Net(nn.Module):

 def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
+    scaler = torch.cuda.amp.grad_scaler.GradScaler()
+    amp_ctx = identity_ctx()
+    if args.fp16:
+        amp_ctx = torch.cuda.amp.autocast()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
-        output = model(data)
-        loss = F.nll_loss(output, target)
-        loss.backward()
-        optimizer.step()
+        with amp_ctx:
+            output = model(data)
+            loss = F.nll_loss(output, target)
+            scale = 1.0
+            if args.fp16:
+                assert loss.dtype is torch.float32
+                scaler.scale(loss).backward()
+                # scaler.step() first unscales the gradients of the optimizer's assigned params.
+                # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
+                # otherwise, optimizer.step() is skipped.
+                # scaler.unscale_(optim)
+
+                # Since the gradients of optimizer's assigned params are now unscaled, clips as usual.
+                # You may use the same value for max_norm here as you would without gradient scaling.
+                # torch.nn.utils.clip_grad_norm_(models[0].net.parameters(), max_norm=0.1)
+
+                scaler.step(optimizer)
+                # Updates the scale for next iteration.
+                scaler.update()
+                scale = scaler.get_scale()
+            else:
+                loss.backward()
+                optimizer.step()
+
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))


-def test(model, device, test_loader):
+def test(args, model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
+    amp_ctx = identity_ctx()
+    if args.fp16:
+        amp_ctx = torch.cuda.amp.autocast()
+
    with torch.no_grad():
        for data, target in test_loader:
+
            data, target = data.to(device), target.to(device)
-            output = model(data)
+            with amp_ctx:
+
+                output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
@@ -112,6 +149,9 @@ def main():

    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
+    parser.add_argument('--fp16', action='store_true', default=False,
+                        help='For mixed precision training')
+
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()

@@ -142,7 +182,7 @@ def main():
    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
    for epoch in range(1, args.epochs + 1):
        train(args, model, device, train_loader, optimizer, epoch)
-        test(model, device, test_loader)
+        test(args, model, device, test_loader)
        scheduler.step()

    if args.save_model:

--- a/example/voxel_gen.py
+++ b/example/voxel_gen.py
@@ -16,7 +16,8 @@ import numpy as np

 from cumm import tensorview as tv 
 from spconv.utils import Point2VoxelCPU3d
-
+from spconv.pytorch.utils import PointToVoxel
+import torch 

 def main():
    # voxel gen source code: spconv/csrc/sparse/pointops.py
@@ -45,5 +46,91 @@ def main():
    print("------Voxels with mean filled-------")
    print(voxels_np[0])

+def main_point_with_features():
+    # voxel gen source code: spconv/csrc/sparse/pointops.py
+    gen = Point2VoxelCPU3d(
+        vsize_xyz=[0.1, 0.1, 0.1], 
+        coors_range_xyz=[-80, -80, -2, 80, 80, 6], 
+        num_point_features=4,  # here num_point_features must equal to pc.shape[1]
+        max_num_voxels=5000, 
+        max_num_points_per_voxel=5)
+
+    pc = np.random.uniform(-10, 10, size=[1000, 3])
+    other_pc_feature = np.random.uniform(-1, 1, size=[1000, 1])
+    pc_with_feature = np.concatenate([pc, other_pc_feature], axis=1)
+    pc_tv = tv.from_numpy(pc_with_feature)
+    # generate voxels, note that voxels_tv reference to a persistent buffer in generator,
+    # so we can't run it in multi-thread.
+    voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel(pc_tv)
+    voxels_np = voxels_tv.numpy_view()
+    indices_np = indices_tv.numpy_view()
+    num_p_in_vx_np = num_p_in_vx_tv.numpy_view()
+    print("------Raw Voxels-------")
+    print(voxels_np[0])
+    # run voxel gen and FILL MEAN VALUE to voxel remain
+    voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(pc_tv)
+    voxels_np = voxels_tv.numpy_view()
+    indices_np = indices_tv.numpy_view()
+    num_p_in_vx_np = num_p_in_vx_tv.numpy_view()
+    print("------Voxels with mean filled-------")
+    print(voxels_np[0])
+
+def main_pytorch_voxel_gen():
+    # voxel gen source code: spconv/csrc/sparse/pointops.py
+    gen = PointToVoxel(
+        vsize_xyz=[0.1, 0.1, 0.1], 
+        coors_range_xyz=[-80, -80, -2, 80, 80, 6], 
+        num_point_features=3, 
+        max_num_voxels=5000, 
+        max_num_points_per_voxel=5)
+
+    pc = np.random.uniform(-10, 10, size=[1000, 3])
+    pc_th = torch.from_numpy(pc)
+    voxels_th, indices_th, num_p_in_vx_th = gen(pc_th)
+    voxels_np = voxels_th.numpy()
+    indices_np = indices_th.numpy()
+    num_p_in_vx_np = num_p_in_vx_th.numpy()
+    print("------Raw Voxels-------")
+    print(voxels_np[0])
+    # run voxel gen and FILL MEAN VALUE to voxel remain
+    voxels_tv, indices_tv, num_p_in_vx_tv = gen(pc_th, empty_mean=True)
+    voxels_np = voxels_tv.numpy()
+    indices_np = indices_tv.numpy()
+    num_p_in_vx_np = num_p_in_vx_tv.numpy()
+    print("------Voxels with mean filled-------")
+    print(voxels_np[0])
+
+def main_pytorch_voxel_gen_cuda():
+    # voxel gen source code: spconv/csrc/sparse/pointops.py
+    device = torch.device("cuda:0")
+    gen = PointToVoxel(
+        vsize_xyz=[0.1, 0.1, 0.1], 
+        coors_range_xyz=[-80, -80, -2, 80, 80, 6], 
+        num_point_features=3, 
+        max_num_voxels=5000, 
+        max_num_points_per_voxel=5,
+        device=device)
+
+    pc = np.random.uniform(-10, 10, size=[1000, 3]).astype(np.float32)
+    pc_th = torch.from_numpy(pc).to(device)
+    voxels_th, indices_th, num_p_in_vx_th = gen(pc_th)
+    voxels_np = voxels_th.cpu().numpy()
+    indices_np = indices_th.cpu().numpy()
+    num_p_in_vx_np = num_p_in_vx_th.cpu().numpy()
+    print("------Raw Voxels-------")
+    print(voxels_np[0])
+    # run voxel gen and FILL MEAN VALUE to voxel remain
+    voxels_tv, indices_tv, num_p_in_vx_tv = gen(pc_th, empty_mean=True)
+    voxels_np = voxels_tv.cpu().numpy()
+    indices_np = indices_tv.cpu().numpy()
+    num_p_in_vx_np = num_p_in_vx_tv.cpu().numpy()
+    print("------Voxels with mean filled-------")
+    print(voxels_np[0])
+
+
 if __name__ == "__main__":
    main()
+    main_point_with_features()
+    main_pytorch_voxel_gen()
+    if torch.cuda.is_available():
+        main_pytorch_voxel_gen_cuda()
\ No newline at end of file
--- a/pyproject.toml
+++ b/pyproject.toml
 [build-system]
-requires = ["setuptools>=41.0", "wheel", "pccm>=0.2.15", "cumm>=0.1.9"]
+requires = ["setuptools>=41.0", "wheel", "pccm>=0.2.19", "cumm>=0.2.0"]
 build-backend = "setuptools.build_meta"
--- a/scripts/dev_subm.py
+++ b/scripts/dev_subm.py
+import sys
+from pathlib import Path
+from typing import Dict, List, Tuple
+import pickle
+import sys
+import time
+from pathlib import Path
+from cumm.gemm.algospec.core import GemmAlgo
+
+import numpy as np
+import pccm
+import torch
+import torch.nn.functional as F
+
+from cumm import dtypes
+from cumm import tensorview as tv
+from cumm.constants import PACKAGE_ROOT
+from cumm.conv.bases import NCHW, NHWC, ConvIterAlgo, ConvOpType
+from cumm.conv.main import ConvMainUnitTest, gen_gemm_kernels
+from cumm.conv.params import ConvProblem
+from cumm.gemm import kernel
+import os 
+from spconv.core_cc.csrc.sparse.all import SpconvOps
+from cumm.gemm.codeops import div_up
+from spconv.constants import PACKAGE_ROOT
+from spconv.core import ConvAlgo
+
+from spconv.pytorch import ops 
+from spconv.algo import CONV, BestConvAlgoByProfile
+from spconv.pytorch.cppcore import torch_tensor_to_tv
+
+def reduce_mask_count(mask: np.ndarray, width: int):
+    mask_length_32 = (div_up(mask.shape[0], width)) * width
+    if mask.shape[0] < mask_length_32:
+        mask_pad = np.zeros((mask_length_32,), dtype=mask.dtype)
+        mask_pad[:mask.shape[0]] = mask
+        mask = mask_pad
+    mask = mask.reshape(-1, width)
+    maskr = np.bitwise_or.reduce(mask, axis=1)
+    maskr_tv = tv.from_numpy(maskr)
+    return SpconvOps.count_bits(maskr_tv).numpy().sum() * width
+
+def reduce_mask_count_x(mask: np.ndarray, width: int):
+    mask_length_32 = (div_up(mask.shape[0], width)) * width
+    if mask.shape[0] < mask_length_32:
+        mask_pad = np.zeros((mask_length_32,), dtype=mask.dtype)
+        mask_pad[:mask.shape[0]] = mask
+        mask = mask_pad
+    mask = mask.reshape(-1, width)
+    maskr = np.bitwise_or.reduce(mask, axis=1)
+    return maskr
+
+def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
+    limit_input_n = 16384
+    limit_input_n = None
+    np.random.seed(484)
+
+    with (PACKAGE_ROOT.parent / "test/data/test_spconv.pkl").open("rb") as f:
+        voxels_np, indices_np, spatial_shape = pickle.load(f)
+        from spconv.test_utils import generate_sparse_data
+        voxels_np = voxels_np[:limit_input_n]
+        indices_np = indices_np[:limit_input_n]
+
+        spatial_shape = [19, 18, 17]
+        sparse_dict = generate_sparse_data(spatial_shape, [1024], 128)
+
+        voxels_np = np.ascontiguousarray(sparse_dict["features"]).astype(
+            np.float32)
+        indices_np = np.ascontiguousarray(
+            sparse_dict["indices"][:, [3, 0, 1, 2]]).astype(np.int32)
+
+        voxels = tv.from_numpy(voxels_np).cuda()
+        indices = tv.from_numpy(indices_np).cuda()
+        indices_th = torch.from_numpy(indices_np).cuda()
+    print(spatial_shape, indices_np.shape)
+    ndim = 3
+    if subm:
+        ksize = [3, 3, 3]
+        kv = np.prod(ksize)
+        padding = [1] * ndim
+        stride = [1] * ndim
+        dilation = [1] * ndim
+        out_padding = [0] * ndim
+    else:
+        ksize = [2, 2, 2]
+        kv = np.prod(ksize)
+        padding = [0] * ndim
+        stride = [1] * ndim
+        dilation = [1] * ndim
+        out_padding = [0] * ndim
+    out_inds, pair_ref, indice_num_per_loc = ops.get_indice_pairs(indices_th, 1, spatial_shape, ConvAlgo.Native, 
+            ksize, stride, padding, dilation, out_padding, subm)
+    indice_num_per_loc_np = indice_num_per_loc.cpu().numpy()
+    indice_pairs_np = pair_ref.cpu().numpy()
+    algo = ConvAlgo.MaskSplitImplicitGemm
+    if algo == ConvAlgo.MaskImplicitGemm:
+        num_split = 1
+    else:
+        num_split = 2
+    for i in range(5):
+        res = ops.get_indice_pairs_implicit_gemm(indices_th, 1, spatial_shape, algo, 
+            ksize, stride, padding, dilation, out_padding, subm)
+    out_inds = res[0]
+    num_inds_per_loc = res[1]
+    pair_fwd = res[2]
+    pair_fwd_x = pair_fwd.cpu().numpy().reshape(-1)
+    pair_fwd_x[pair_fwd_x == -1] = 0
+    loc_num_np = (pair_fwd_x > 0).reshape(kv, -1).sum(1)
+    print(loc_num_np)
+    print(indice_num_per_loc_np)
+
+    pair_bwd = res[3]
+    pair_mask_fwd_splits = res[4]
+    pair_mask_bwd_splits = res[5]
+    mask_argsort_fwd_splits = res[6]
+    mask_argsort_bwd_splits = res[7]
+    masks = res[8]
+    pair_mask_fwd_splits_tv = [ops.torch_tensor_to_tv(t, dtype=tv.uint32) for t in pair_mask_fwd_splits]
+    valid_location_bitcount = [SpconvOps.count_bits(t) for t in pair_mask_fwd_splits_tv]
+    valid_location_count = sum([t.cpu().numpy().sum() for t in valid_location_bitcount])
+    reduce_length = 32
+    split_mask_valid_count = sum([reduce_mask_count(t.cpu().numpy(), reduce_length) for t in pair_mask_fwd_splits_tv])
+    if subm:
+        print("SUBM", valid_location_count, split_mask_valid_count, pair_fwd.numel())
+    else:
+        print("REGULAR", valid_location_count, split_mask_valid_count, pair_fwd.numel())
+    # return 
+
+    if run_conv:
+        C = 64
+        K = 64
+        desps = CONV.desps
+        mask_output_fwd = torch.zeros([2, div_up(out_inds.shape[0], 32)], dtype=torch.int32, device=indices_th.device)
+        mask_output_bwd = torch.zeros([2, div_up(indices.dim(0), 32)], dtype=torch.int32, device=indices_th.device)
+
+        for desp in desps:
+            if desp.algo != GemmAlgo.Simt.value:
+                continue
+            # if desp.op_type == ConvOpType.kBackwardWeight.value:
+            #     continue
+            # if desp.tile_shape !
+            if desp.dtype_a == dtypes.int8.tv_dtype:
+                inp = np.random.randint(-1, 1, size=[voxels_np.shape[0], C]).astype(np.int8)
+                weight = np.random.randint(-1, 1, size=[K, *ksize, C]).astype(np.int8)
+                output = np.random.randint(-1, 1, size=[out_inds.shape[0], K]).astype(
+                    dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
+            else:
+                inp = np.random.uniform(-1, 1, size=[voxels_np.shape[0], C]).astype(
+                    dtypes.get_npdtype_from_tvdtype(desp.dtype_input))
+                weight = np.random.uniform(-1, 1, size=[K, *ksize, C]).astype(
+                    dtypes.get_npdtype_from_tvdtype(desp.dtype_weight))
+                output = np.random.uniform(-1, 1, size=[out_inds.shape[0], K]).astype(
+                    dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
+            weight_ref = weight.transpose(1, 2, 3, 0, 4)
+            weight_ref = np.ascontiguousarray(weight_ref).reshape(-1, K, C)
+            if desp.op_type == ConvOpType.kBackwardInput.value:
+                inp_tv = tv.zeros(inp.shape, desp.dtype_input, 0)
+            else:
+                inp_tv = tv.from_numpy(inp).cuda()
+            if desp.op_type == ConvOpType.kBackwardWeight.value:
+                weight_tv = tv.zeros(weight.shape, desp.dtype_weight, 0)
+            else:
+                weight_tv = tv.from_numpy(weight).cuda()
+            # _ = tv.zeros([5000, 10], tv.float32, 0)
+            if desp.op_type == ConvOpType.kForward.value:
+                output_tv = tv.zeros(output.shape, desp.dtype_output, 0)
+            else:
+                output_tv = tv.from_numpy(output).cuda()
+            torch.cuda.synchronize()
+            t = time.time()
+            spk = 1
+            if desp.op_type == ConvOpType.kBackwardWeight.value:
+                # TODO support splitk parallel
+                spk = 32
+            if subm:
+                if desp.op_type == ConvOpType.kForward.value:
+                    indice_pairs = pair_fwd
+                elif desp.op_type == ConvOpType.kBackwardInput.value:
+                    indice_pairs = pair_bwd
+                else:
+                    indice_pairs = pair_fwd
+                mask_output = mask_output_fwd
+                # print([bin(x.item()) for x in masks])
+                for j in range(num_split):
+                    beta = 1 if j == 1 else 0
+                    mask_filter = 0xffffffff
+                    mask_filter = masks[j].item()
+
+                    reverse_mask = False
+                    if desp.op_type == ConvOpType.kBackwardWeight.value:
+                        mask_op = mask_output[j]
+                    else:
+                        mask_op = pair_mask_fwd_splits[j]
+                    if desp.op_type == ConvOpType.kBackwardInput.value:
+                        reverse_mask = True
+                    CONV.run_with_tuned_result(
+                        BestConvAlgoByProfile(desp, spk),
+                        desp.op_type,
+                        inp_tv,
+                        weight_tv,
+                        output_tv,
+                        torch_tensor_to_tv(mask_op, dtype=tv.uint32),
+                        torch_tensor_to_tv(mask_argsort_fwd_splits[j]),
+                        torch_tensor_to_tv(mask_output[j], dtype=tv.uint32),
+                        torch_tensor_to_tv(indice_pairs),
+                        reverse_mask,
+                        mask_filter=mask_filter,
+                        mask_width=32,
+                        beta=beta,
+                        verbose=True,
+                    )
+            else:
+                if desp.op_type == ConvOpType.kForward.value:
+                    indice_pairs = pair_fwd # inp -> out
+                    mask_ops = pair_mask_fwd_splits
+                    mask_argsorts = mask_argsort_fwd_splits
+                    mask_output = mask_output_fwd
+                elif desp.op_type == ConvOpType.kBackwardInput.value:
+                    indice_pairs = pair_bwd # out -> inp
+                    mask_ops = pair_mask_bwd_splits
+                    mask_argsorts = mask_argsort_bwd_splits
+                    mask_output = mask_output_bwd
+
+                    print([bin(x.item()) for x in masks])
+                else:
+                    indice_pairs = pair_fwd # inp -> out
+                    mask_ops = pair_mask_fwd_splits
+                    mask_argsorts = mask_argsort_fwd_splits
+                    mask_output = mask_output_fwd
+
+                for j in range(2):
+                    beta = 1 if j == 1 else 0
+                    mask_filter = masks[j].item()
+                    reverse_mask = False
+                    if desp.op_type == ConvOpType.kBackwardWeight.value:
+                        mask_op = mask_output[j]
+                    else:
+                        mask_op = mask_ops[j]
+
+                    CONV.run_with_tuned_result(
+                        BestConvAlgoByProfile(desp, spk),
+                        desp.op_type,
+                        inp_tv,
+                        weight_tv,
+                        output_tv,
+                        torch_tensor_to_tv(mask_op, dtype=tv.uint32),
+                        torch_tensor_to_tv(mask_argsorts[j]),
+                        torch_tensor_to_tv(mask_output[j], dtype=tv.uint32),
+                        torch_tensor_to_tv(indice_pairs),
+                        reverse_mask,
+                        mask_filter=mask_filter,
+                        mask_width=32,
+                        beta=beta,
+                        verbose=True,
+                    )
+
+            torch.cuda.synchronize()
+            duration = time.time() - t 
+            if desp.op_type == ConvOpType.kForward.value:
+                output_ref = np.zeros_like(output, dtype=np.float32)
+                # ref algorithm
+                for filter_offset in range(kv):
+                    if subm and filter_offset > kv // 2:
+                        nhot = indice_num_per_loc_np[kv - 1 - filter_offset]
+                    elif subm and filter_offset == kv // 2:
+                        nhot = voxels.shape[0]
+                    else:
+                        nhot = indice_num_per_loc_np[filter_offset]
+                    a_inds = indice_pairs_np[0][filter_offset][:nhot]
+                    c_inds = indice_pairs_np[1][filter_offset][:nhot]
+                    # print(a_inds_cpu[:10])
+                    a = inp[a_inds]
+                    cc = a.astype(np.float32) @ weight_ref[filter_offset].T.astype(np.float32)
+                    output_ref[c_inds] += cc
+
+                output_cpu = output_tv.cpu().numpy().astype(np.float32)
+                duration = time.time() - t
+                my = output_cpu.reshape(-1)
+                print("ERROR", np.linalg.norm(output_ref.reshape(-1) - my))
+
+            elif desp.op_type == ConvOpType.kBackwardInput.value:
+                dinput_ref = np.zeros_like(inp, dtype=np.float32)
+                # ref algorithm
+                for filter_offset in range(kv):
+                    if subm and filter_offset > kv // 2:
+                        nhot = indice_num_per_loc_np[kv - 1 - filter_offset]
+                    elif subm and filter_offset == kv // 2:
+                        nhot = voxels.shape[0]
+                    else:
+                        nhot = indice_num_per_loc_np[filter_offset]
+                    a_inds = indice_pairs_np[1][filter_offset][:nhot]
+                    c_inds = indice_pairs_np[0][filter_offset][:nhot]
+
+                    # print(a_inds_cpu[:10])
+                    a = output[a_inds]
+                    # NK @ KC
+                    cc = a.astype(np.float32) @ weight_ref[filter_offset].astype(np.float32)
+                    dinput_ref[c_inds] += cc
+                din_cpu = inp_tv.cpu().numpy()
+                print("ERROR", np.linalg.norm(din_cpu.reshape(-1) - dinput_ref.reshape(-1)))
+            else:
+                dw_ref = np.zeros_like(weight_ref, dtype=np.float32) # KV, K, C
+                for filter_offset in range(kv):
+                    if subm and filter_offset > kv // 2:
+                        nhot = indice_num_per_loc_np[kv - 1 - filter_offset]
+                    elif subm and filter_offset == kv // 2:
+                        nhot = voxels.shape[0]
+                    else:
+                        nhot = indice_num_per_loc_np[filter_offset]
+                    o_inds = indice_pairs_np[1][filter_offset][:nhot]
+                    i_inds = indice_pairs_np[0][filter_offset][:nhot]
+                    # print(a_inds_cpu[:10])
+                    out_gather = output[o_inds] # [N, K]
+                    inp_gather = inp[i_inds] # [N, C]
+                    # KN @ NC
+                    dw_res = out_gather.astype(np.float32).T @ inp_gather.astype(np.float32)
+                    dw_ref[filter_offset] = dw_res
+                # print(indice_pairs_np_test[0])
+                dw_ref_kcrs = dw_ref.transpose(1, 0, 2)
+                dw_cpu = weight_tv.cpu().numpy().reshape(K, np.prod(ksize), C)
+
+                print("ERROR", np.linalg.norm(dw_cpu.reshape(-1) - dw_ref_kcrs.reshape(-1)))
+
+
+if __name__ == "__main__":
+    dev_subm_inds_v2()
--- a/scripts/sort_bench.py
+++ b/scripts/sort_bench.py
+# Copyright 2021 Yan Yan
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np 
+from cumm import tensorview as tv 
+from spconv.core_cc.csrc.sparse.all import SpconvOps
+import pickle 
+import torch
+
+from spconv.pytorch.cppcore import torch_tensor_to_tv 
+
+def main():
+    with open("/home/yy/asd.pkl", "rb") as f:
+        a_th = pickle.load(f)
+    mask_argsort = torch.empty((1, a_th.shape[1]),
+                                dtype=torch.int32,
+                                device=a_th.device)
+
+    a = a_th.cpu().numpy()[0]
+    a_tv = torch_tensor_to_tv(a_th)
+    mask_argsort_tv = torch_tensor_to_tv(mask_argsort)
+    for i in range(10):
+        a_tv_1 = a_tv.clone()
+        SpconvOps.sort_1d_by_key(a_tv_1[0], mask_argsort_tv[0])
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
--- a/setup.py
+++ b/setup.py
@@ -25,27 +25,34 @@ NAME = 'spconv'
 RELEASE_NAME = NAME
 deps = ["cumm"]
 cuda_ver = os.environ.get("CUMM_CUDA_VERSION", "")
-if not cuda_ver:
-    nvcc_version = subprocess.check_output(["nvcc", "--version"
-                                            ]).decode("utf-8").strip()
-    nvcc_version_str = nvcc_version.split("\n")[3]
-    version_str: str = re.findall(r"release (\d+.\d+)",
-                                    nvcc_version_str)[0]
-    cuda_ver = version_str
-cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102
-
-RELEASE_NAME += "-cu{}".format(cuda_ver)
-deps = ["cumm-cu{}".format(cuda_ver)]
+# is_ci_build = cuda_ver != ""
+# if not cuda_ver:
+#     nvcc_version = subprocess.check_output(["nvcc", "--version"
+#                                             ]).decode("utf-8").strip()
+#     nvcc_version_str = nvcc_version.split("\n")[3]
+#     version_str: str = re.findall(r"release (\d+.\d+)",
+#                                     nvcc_version_str)[0]
+#     cuda_ver = version_str
+
+if cuda_ver:
+    cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102
+
+    RELEASE_NAME += "-cu{}".format(cuda_ver)
+    deps = ["cumm-cu{}".format(cuda_ver)]
+else:
+    deps = ["cumm"]
+
+

 DESCRIPTION = 'spatial sparse convolution'
 URL = 'https://github.com/traveller59/spconv'
 EMAIL = 'yanyan.sub@outlook.com'
 AUTHOR = 'Yan Yan'
-REQUIRES_PYTHON = '>=3.6'
+REQUIRES_PYTHON = '>=3.7'
 VERSION = None

 # What packages are required for this module to be executed?
-REQUIRED = ["pccm>=0.2.14", "pybind11>=2.6.0", "fire", "numpy", *deps]
+REQUIRED = ["pccm>=0.2.19", "pybind11>=2.6.0", "fire", "numpy", *deps]

 # What packages are optional?
 EXTRAS = {
@@ -145,18 +152,27 @@ if disable_jit is not None and disable_jit == "1":
    }
    from cumm.gemm.main import GemmMainUnitTest
    from spconv.core import SHUFFLE_SIMT_PARAMS, SHUFFLE_VOLTA_PARAMS, SHUFFLE_TURING_PARAMS
-
+    from spconv.core import IMPLGEMM_SIMT_PARAMS, IMPLGEMM_VOLTA_PARAMS, IMPLGEMM_TURING_PARAMS
+    from cumm.conv.main import ConvMainUnitTest
+    from cumm.constants import CUMM_CPU_ONLY_BUILD
    from spconv.csrc.sparse.all import SpconvOps
    cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS + SHUFFLE_TURING_PARAMS)
+    convcu = ConvMainUnitTest(IMPLGEMM_SIMT_PARAMS + IMPLGEMM_VOLTA_PARAMS + IMPLGEMM_TURING_PARAMS)
+    convcu.namespace = "cumm.conv.main"

    cu.namespace = "cumm.gemm.main"
-    cuda_ver_number = int(cuda_ver)
-    if cuda_ver_number < 110:
-        std = "c++14" 
-    else:
-        std = "c++17"
+    std = "c++17"
+    if cuda_ver:
+        cuda_ver_number = int(cuda_ver)
+        if cuda_ver_number < 110:
+            std = "c++14" 
+        else:
+            std = "c++17"
+    cus = [cu, convcu, SpconvOps()]
+    if CUMM_CPU_ONLY_BUILD:
+        cus = [SpconvOps()]
    ext_modules: List[Extension] = [
-        PCCMExtension([cu, SpconvOps()],
+        PCCMExtension(cus,
                      "spconv/core_cc",
                      Path(__file__).resolve().parent / "spconv",
                      objects_folder="objects",

--- a/spconv/__init__.py
+++ b/spconv/__init__.py
@@ -15,4 +15,5 @@
 from . import build as _build

 from .core import ConvAlgo, AlgoHint
-from . import constants
\ No newline at end of file
+from . import constants
+from .__version__ import __version__
\ No newline at end of file
--- a/spconv/algo.py
+++ b/spconv/algo.py
--- a/spconv/build.py
+++ b/spconv/build.py
@@ -16,17 +16,27 @@ from pathlib import Path

 import pccm
 from pccm.utils import project_is_editable, project_is_installed
+from ccimport.compat import InWindows
+from .constants import PACKAGE_NAME, PACKAGE_ROOT, DISABLE_JIT

-from .constants import PACKAGE_NAME, PACKAGE_ROOT
-
-if project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME):
+if project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME) and not DISABLE_JIT:
    from spconv.core import SHUFFLE_SIMT_PARAMS, SHUFFLE_VOLTA_PARAMS, SHUFFLE_TURING_PARAMS
+    from spconv.core import IMPLGEMM_SIMT_PARAMS, IMPLGEMM_VOLTA_PARAMS, IMPLGEMM_TURING_PARAMS
+
    from cumm.gemm.main import GemmMainUnitTest
+    from cumm.conv.main import ConvMainUnitTest
+
    from spconv.csrc.sparse.all import SpconvOps
    cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS + SHUFFLE_TURING_PARAMS)
    cu.namespace = "cumm.gemm.main"
-    pccm.builder.build_pybind([cu, SpconvOps()],
+    convcu = ConvMainUnitTest(IMPLGEMM_SIMT_PARAMS + IMPLGEMM_VOLTA_PARAMS + IMPLGEMM_TURING_PARAMS)
+    convcu.namespace = "cumm.conv.main"
+    objects_folder = None 
+    if InWindows:
+        # windows have command line limit, so we use objects_folder to reduce command size.
+        objects_folder = "objects"
+    pccm.builder.build_pybind([cu, convcu, SpconvOps()],
                              PACKAGE_ROOT / "core_cc",
                              namespace_root=PACKAGE_ROOT,
-                              objects_folder="objects",
+                              objects_folder=objects_folder,
                              load_library=False)
--- a/spconv/constants.py
+++ b/spconv/constants.py
@@ -24,4 +24,6 @@ EDITABLE_INSTALLED = project_is_installed(PACKAGE_NAME) and project_is_editable(


 _filter_hwio_env = os.getenv("SPCONV_FILTER_HWIO", "0")
-FILTER_HWIO = _filter_hwio_env == "1"
\ No newline at end of file
+FILTER_HWIO = _filter_hwio_env == "1"
+DISABLE_JIT = os.getenv("SPCONV_DISABLE_JIT", "0") == "1"
+NDIM_DONT_CARE = 3
\ No newline at end of file
--- a/spconv/core.py
+++ b/spconv/core.py
--- a/spconv/core_cc/csrc/sparse/all/__init__.pyi
+++ b/spconv/core_cc/csrc/sparse/all/__init__.pyi
 from typing import overload, Any, Callable, Dict, List, Optional, Set, Tuple, Type, Union
 from pccm.stubs import EnumValue, EnumClassValue
 from cumm.tensorview import Tensor
+class ThrustCustomAllocatorV2:
+    alloc_func: Callable[int, int]
 class SpconvOps:
    @staticmethod
    def generate_conv_inds_stage1(indices: Tensor, indice_pairs: Tensor, indice_pairs_uniq: Tensor, indice_num_per_loc: Tensor, batch_size: int, output_dims: List[int], input_dims: List[int], ksize: List[int], stride: List[int], padding: List[int], dilation: List[int], transposed: bool = False, stream_int: int = 0) -> None: 
@@ -53,6 +55,49 @@ class SpconvOps:
        """
        ...
    @staticmethod
+    def generate_conv_inds_mask_stage1(indices: Tensor, indice_pairs_bwd: Tensor, indice_pairs_uniq: Tensor, indice_num_per_loc: Tensor, batch_size: int, output_dims: List[int], input_dims: List[int], ksize: List[int], stride: List[int], padding: List[int], dilation: List[int], transposed: bool = False, stream_int: int = 0) -> None: 
+        """
+        Args:
+            indices: 
+            indice_pairs_bwd: 
+            indice_pairs_uniq: 
+            indice_num_per_loc: 
+            batch_size: 
+            output_dims: 
+            input_dims: 
+            ksize: 
+            stride: 
+            padding: 
+            dilation: 
+            transposed: 
+            stream_int: 
+        """
+        ...
+    @staticmethod
+    def generate_conv_inds_mask_stage2(indices: Tensor, hashdata: Tensor, indice_pairs_fwd: Tensor, indice_pairs_bwd: Tensor, indice_pairs_uniq: Tensor, out_inds: Tensor, mask_fwd: Tensor, mask_bwd: Tensor, num_out_act: int, batch_size: int, output_dims: List[int], input_dims: List[int], ksize: List[int], stride: List[int], padding: List[int], dilation: List[int], transposed: bool = False, stream_int: int = 0) -> int: 
+        """
+        Args:
+            indices: 
+            hashdata: 
+            indice_pairs_fwd: 
+            indice_pairs_bwd: 
+            indice_pairs_uniq: 
+            out_inds: 
+            mask_fwd: 
+            mask_bwd: 
+            num_out_act: 
+            batch_size: 
+            output_dims: 
+            input_dims: 
+            ksize: 
+            stride: 
+            padding: 
+            dilation: 
+            transposed: 
+            stream_int: 
+        """
+        ...
+    @staticmethod
    def generate_subm_conv_inds(indices: Tensor, hashdata: Tensor, indice_pairs: Tensor, out_inds: Tensor, indice_num_per_loc: Tensor, batch_size: int, input_dims: List[int], ksize: List[int], dilation: List[int], indice_pair_mask: Tensor =  Tensor(), backward: bool = False, stream_int: int =  0) -> int: 
        """
        Args:
@@ -71,6 +116,38 @@ class SpconvOps:
        """
        ...
    @staticmethod
+    def generate_conv_inds_cpu(indices: Tensor, indice_pairs: Tensor, out_inds: Tensor, indice_num_per_loc: Tensor, batch_size: int, output_dims: List[int], input_dims: List[int], ksize: List[int], stride: List[int], padding: List[int], dilation: List[int], transposed: bool = False) -> int: 
+        """
+        Args:
+            indices: 
+            indice_pairs: 
+            out_inds: 
+            indice_num_per_loc: 
+            batch_size: 
+            output_dims: 
+            input_dims: 
+            ksize: 
+            stride: 
+            padding: 
+            dilation: 
+            transposed: 
+        """
+        ...
+    @staticmethod
+    def generate_subm_conv_inds_cpu(indices: Tensor, indice_pairs: Tensor, out_inds: Tensor, indice_num_per_loc: Tensor, batch_size: int, input_dims: List[int], ksize: List[int], dilation: List[int]) -> int: 
+        """
+        Args:
+            indices: 
+            indice_pairs: 
+            out_inds: 
+            indice_num_per_loc: 
+            batch_size: 
+            input_dims: 
+            ksize: 
+            dilation: 
+        """
+        ...
+    @staticmethod
    def maxpool_forward(out: Tensor, inp: Tensor, out_inds: Tensor, in_inds: Tensor, stream: int = 0) -> None: 
        """
        Args:
@@ -95,9 +172,157 @@ class SpconvOps:
        """
        ...
    @staticmethod
-    def sort_1d_by_key(data: Tensor) -> Tensor: 
+    def maxpool_implicit_gemm_forward(out: Tensor, inp: Tensor, inds: Tensor, stream: int = 0) -> None: 
+        """
+        Args:
+            out: 
+            inp: 
+            inds: 
+            stream: 
+        """
+        ...
+    @staticmethod
+    def maxpool_implicit_gemm_backward(out: Tensor, inp: Tensor, dout: Tensor, dinp: Tensor, inds: Tensor, stream: int = 0) -> None: 
+        """
+        Args:
+            out: 
+            inp: 
+            dout: 
+            dinp: 
+            inds: 
+            stream: 
+        """
+        ...
+    @staticmethod
+    def maxpool_forward_cpu(out: Tensor, inp: Tensor, out_inds: Tensor, in_inds: Tensor) -> None: 
+        """
+        Args:
+            out: 
+            inp: 
+            out_inds: 
+            in_inds: 
+        """
+        ...
+    @staticmethod
+    def maxpool_backward_cpu(out: Tensor, inp: Tensor, dout: Tensor, dinp: Tensor, out_inds: Tensor, in_inds: Tensor) -> None: 
+        """
+        Args:
+            out: 
+            inp: 
+            dout: 
+            dinp: 
+            out_inds: 
+            in_inds: 
+        """
+        ...
+    @staticmethod
+    def gather_cpu(out: Tensor, inp: Tensor, inds: Tensor) -> None: 
+        """
+        Args:
+            out: 
+            inp: 
+            inds: 
+        """
+        ...
+    @staticmethod
+    def scatter_add_cpu(out: Tensor, inp: Tensor, inds: Tensor) -> None: 
+        """
+        Args:
+            out: 
+            inp: 
+            inds: 
+        """
+        ...
+    @staticmethod
+    def sort_1d_by_key(data: Tensor, indices: Tensor =  Tensor(), stream: int = 0) -> Tensor: 
+        """
+        Args:
+            data: 
+            indices: 
+            stream: 
+        """
+        ...
+    @staticmethod
+    def sort_1d_by_key_allocator(data: Tensor, alloc_func, indices: Tensor =  Tensor(), stream: int = 0) -> Tensor: 
        """
        Args:
            data: 
+            alloc_func: 
+            indices: 
+            stream: 
+        """
+        ...
+    @staticmethod
+    def sort_1d_by_key_split(data: Tensor, mask: Tensor, indices: Tensor =  Tensor(), stream: int = 0, mask_output: bool = False) -> Tensor: 
+        """
+        Args:
+            data: 
+            mask: 
+            indices: 
+            stream: 
+            mask_output: 
+        """
+        ...
+    @staticmethod
+    def sort_1d_by_key_split_allocator(data: Tensor, alloc_func, mask: Tensor, indices: Tensor =  Tensor(), stream: int = 0, mask_output: bool = False) -> Tensor: 
+        """
+        Args:
+            data: 
+            alloc_func: 
+            mask: 
+            indices: 
+            stream: 
+            mask_output: 
+        """
+        ...
+    @staticmethod
+    def count_bits(a: Tensor) -> Tensor: 
+        """
+        Args:
+            a: 
+        """
+        ...
+    @staticmethod
+    def calc_point2voxel_meta_data(vsize_xyz: List[float], coors_range_xyz: List[float]) -> Tuple[List[float], List[int], List[int], List[float]]: 
+        """
+        Args:
+            vsize_xyz: 
+            coors_range_xyz: 
+        """
+        ...
+    @staticmethod
+    def point2voxel_cpu(points: Tensor, voxels: Tensor, indices: Tensor, num_per_voxel: Tensor, densehashdata: Tensor, vsize: List[float], grid_size: List[int], grid_stride: List[int], coors_range: List[float], empty_mean: bool = False, clear_voxels: bool = True) -> Tuple[Tensor, Tensor, Tensor]: 
+        """
+        Args:
+            points: 
+            voxels: 
+            indices: 
+            num_per_voxel: 
+            densehashdata: 
+            vsize: 
+            grid_size: 
+            grid_stride: 
+            coors_range: 
+            empty_mean: 
+            clear_voxels: 
+        """
+        ...
+    @staticmethod
+    def point2voxel_cuda(points: Tensor, voxels: Tensor, indices: Tensor, num_per_voxel: Tensor, hashdata: Tensor, point_indice_data: Tensor, vsize: List[float], grid_size: List[int], grid_stride: List[int], coors_range: List[float], empty_mean: bool = False, clear_voxels: bool = True, stream_int: int = 0) -> Tuple[Tensor, Tensor, Tensor]: 
+        """
+        Args:
+            points: 
+            voxels: 
+            indices: 
+            num_per_voxel: 
+            hashdata: 
+            point_indice_data: 
+            vsize: 
+            grid_size: 
+            grid_stride: 
+            coors_range: 
+            empty_mean: 
+            clear_voxels: 
+            stream_int: 
        """
        ...
--- a/spconv/core_cc/csrc/sparse/all/ops1d/__init__.pyi
+++ b/spconv/core_cc/csrc/sparse/all/ops1d/__init__.pyi
+from typing import overload, Any, Callable, Dict, List, Optional, Set, Tuple, Type, Union
+from pccm.stubs import EnumValue, EnumClassValue
+from cumm.tensorview import Tensor
+class Point2Voxel:
+    hashdata: Tensor
+    point_indice_data: Tensor
+    voxels: Tensor
+    indices: Tensor
+    num_per_voxel: Tensor
+    @property
+    def grid_size(self) -> List[int]: ...
+    def __init__(self, vsize_xyz: List[float], coors_range_xyz: List[float], num_point_features: int, max_num_voxels: int, max_num_points_per_voxel: int) -> None: 
+        """
+        Args:
+            vsize_xyz: 
+            coors_range_xyz: 
+            num_point_features: 
+            max_num_voxels: 
+            max_num_points_per_voxel: 
+        """
+        ...
+    def point_to_voxel_hash(self, points: Tensor, clear_voxels: bool = True, empty_mean: bool = False, stream_int: int = 0) -> Tuple[Tensor, Tensor, Tensor]: 
+        """
+        Args:
+            points: 
+            clear_voxels: 
+            empty_mean: 
+            stream_int: 
+        """
+        ...
+    @staticmethod
+    def point_to_voxel_hash_static(points: Tensor, voxels: Tensor, indices: Tensor, num_per_voxel: Tensor, hashdata: Tensor, point_indice_data: Tensor, vsize: List[float], grid_size: List[int], grid_stride: List[int], coors_range: List[float], clear_voxels: bool = True, empty_mean: bool = False, stream_int: int = 0) -> Tuple[Tensor, Tensor, Tensor]: 
+        """
+        Args:
+            points: 
+            voxels: 
+            indices: 
+            num_per_voxel: 
+            hashdata: 
+            point_indice_data: 
+            vsize: 
+            grid_size: 
+            grid_stride: 
+            coors_range: 
+            clear_voxels: 
+            empty_mean: 
+            stream_int: 
+        """
+        ...