Merge branch 'feature/ampere'

838aa644 · yan.yan · a31b131f · f3a22f19 · 838aa644 · 838aa644
Commit 838aa644 authored Sep 23, 2022 by yan.yan
20 changed files
--- a/.github/workflows/build.yaml
+++ b/.github/workflows/build.yaml
@@ -15,7 +15,7 @@ jobs:
    runs-on: windows-2019
    strategy:
      matrix:
-        python-version: ['3.7', '3.8', '3.9', '3.10'] 
+        python-version: ['3.7', '3.8', '3.9', '3.10', '3.11'] 
        cuda-version: ['10.2', '11.1', '11.4']
    steps:
      - uses: actions/checkout@master
@@ -115,7 +115,7 @@ jobs:
    runs-on: ubuntu-20.04
    strategy:
      matrix:
-        python-version: ['3.6', '3.7', '3.8', '3.9', '3.10'] # this version is only used for upload.
+        python-version: ['3.7', '3.8', '3.9', '3.10', '3.11'] # this version is only used for upload.
        cuda-version: ['102', '111', '113', '114', '']

    steps:

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
 # Changelog
+## [2.2.0] - 2022-9-24
+### Added 
+- Add Ampere support. faster fp16, faster tf32 and greatly faster int8 kernels in Ampere GPUs.
+- Add pure c++ code generation (libspconv.so) for deploy (or train in another deeplearning framework)
+- Add NVRTC support for all gemm kernels. if your GPU architecture isn't compiled in prebuilt, spconv will use slightly slower (10-20us overhead for every kernel launch) NVRTC kernels. 
+
+### Fixed
+- Fix launch fail in maxpool if too much voxels 
+
+### Changed
+- all weight layout will be KRSC, don't support old spconv 1.x weights anymore.
+- previous gemm ops in ops.py now move to c++ by default (controlled by spconv.constants.SPCONV_CPP_GEMM)
+
+### Removed
+- drop python 3.6 support.
+- pascal and kepler architecture is removed in CUDA 12 prebuilt.
+
 ## [2.1.22] - 2022-6-11
 ### Fixed
 - Fix thrust problem by adding -fvisibility=hidden

--- a/README.md
+++ b/README.md
@@ -17,16 +17,19 @@
 [pypi-ver-114]: https://img.shields.io/pypi/v/spconv-cu114
 [pypi-ver-111]: https://img.shields.io/pypi/v/spconv-cu111
 [pypi-ver-113]: https://img.shields.io/pypi/v/spconv-cu113
+[pypi-ver-120]: https://img.shields.io/pypi/v/spconv-cu120
 [pypi-ver-102]: https://img.shields.io/pypi/v/spconv-cu102

+[pypi-url-102]: https://pypi.org/project/spconv-cu102/
+[pypi-download-102]: https://img.shields.io/pypi/dm/spconv-cu102
 [pypi-url-111]: https://pypi.org/project/spconv-cu111/
 [pypi-download-111]: https://img.shields.io/pypi/dm/spconv-cu111
 [pypi-url-113]: https://pypi.org/project/spconv-cu113/
 [pypi-download-113]: https://img.shields.io/pypi/dm/spconv-cu113
-[pypi-url-102]: https://pypi.org/project/spconv-cu102/
-[pypi-download-102]: https://img.shields.io/pypi/dm/spconv-cu102
 [pypi-url-114]: https://pypi.org/project/spconv-cu114/
 [pypi-download-114]: https://img.shields.io/pypi/dm/spconv-cu114
+[pypi-url-120]: https://pypi.org/project/spconv-cu120/
+[pypi-download-120]: https://img.shields.io/pypi/dm/spconv-cu120
 [pypi-url-cpu]: https://pypi.org/project/spconv/
 [pypi-download-cpu]: https://img.shields.io/pypi/dm/spconv

@@ -37,10 +40,11 @@
 |                | PyPI   | Install  |Downloads  |
 | -------------- |:---------------------:| ---------------------:| ---------------------:| 
 | CPU (Linux Only) | [![PyPI Version][pypi-ver-cpu]][pypi-url-cpu] | ```pip install spconv``` | [![pypi monthly download][pypi-download-cpu]][pypi-url-cpu] | 
-| CUDA 10.2 | [![PyPI Version][pypi-ver-102]][pypi-url-102] | ```pip install spconv-cu102``` | [![pypi monthly download][pypi-download-102]][pypi-url-102] | 
+| CUDA 10.2 | [![PyPI Version][pypi-ver-102]][pypi-url-102] | ```pip install spconv-cu102```| [![pypi monthly download][pypi-download-102]][pypi-url-102]| 
 | CUDA 11.1 | [![PyPI Version][pypi-ver-111]][pypi-url-111] | ```pip install spconv-cu111```| [![pypi monthly download][pypi-download-111]][pypi-url-111]| 
 | CUDA 11.3 (Linux Only) | [![PyPI Version][pypi-ver-113]][pypi-url-113] | ```pip install spconv-cu113```| [![pypi monthly download][pypi-download-113]][pypi-url-113]| 
 | CUDA 11.4 | [![PyPI Version][pypi-ver-114]][pypi-url-114] | ```pip install spconv-cu114```| [![pypi monthly download][pypi-download-114]][pypi-url-114]|
+<!-- | CUDA 12.0 | [![PyPI Version][pypi-ver-120]][pypi-url-120] | ```pip install spconv-cu120```| [![pypi monthly download][pypi-download-120]][pypi-url-120]| -->

 ```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support. check [benchmark](docs/BENCHMARK.md) to see how fast spconv 2.x runs.

@@ -48,11 +52,9 @@

 Check [spconv 2.x algorithm introduction](docs/spconv2_algo.pdf) to understand sparse convolution algorithm in spconv 2.x!

-**WARNING** spconv < 2.1.18 users need to upgrade your version to 2.1.18, it fix a bug in conv weight init which cause std of inited weight too large, and a bug in PointToVoxel.
-
-## Breaking changes in Spconv 2.x
+## NEWS

-Spconv 1.x users **NEED READ [THIS](docs/SPCONV_2_BREAKING_CHANGEs.md)** before using spconv 2.x.
+* spconv 2.2: ampere feature support (by [EvernightAurora](https://github.com/EvernightAurora)), pure c++ code generation, nvrtc, drop python 3.6

 ## Spconv 2.1 vs Spconv 1.x

@@ -64,11 +66,12 @@ Spconv 1.x users **NEED READ [THIS](docs/SPCONV_2_BREAKING_CHANGEs.md)** before
 * [doesn't depend on pytorch binary](docs/FAQ.md#What-does-no-dependency-on-pytorch-mean), but you may need at least pytorch >= 1.5.0 to run spconv 2.x.
 * since spconv 2.x doesn't depend on pytorch binary (never in future), it's impossible to support torch.jit/libtorch inference.

-## Spconv 2.x Development and Roadmap
-
-Spconv 2.2 development has started. See [this issue](https://github.com/traveller59/spconv/issues/380) for more details.
+## Spconv 2.2 vs Spconv 2.1

-See [dev plan](docs/SPCONV_DEVELOP_PLAN.md). A complete guide of spconv development will be released soon.
+* faster fp16 kernels (~5-30%) in ampere GPUs (tested in RTX 3090)
+* greatly faster int8 kernels (~1.2x~2.7x) in ampere GPUs (tested in RTX 3090)
+* no python 3.6 support
+* nvrtc support: kernel in old GPUs will be compiled in runtime.

 ## Usage

@@ -80,24 +83,20 @@ Don't forget to check [performance guide](docs/PERFORMANCE_GUIDE.md).

 ## Install

-You need to install python >= 3.6 (>=3.7 for windows) first to use spconv 2.x.
+You need to install python >= 3.7 first to use spconv 2.x.

 You need to install CUDA toolkit first before using prebuilt binaries or build from source.

-You need at least CUDA 10.2 to build and run spconv 2.x. We won't offer any support for CUDA < 10.2.
+You need at least CUDA 11.0 to build and run spconv 2.x. We won't offer any support for CUDA < 11.0.

 ### Prebuilt

-We offer python 3.6-3.10 and cuda 10.2/11.1/11.3/11.4 prebuilt binaries for linux (manylinux).
+We offer python 3.7-3.11 and cuda 10.2/11.1/11.3/11.4/12.0 prebuilt binaries for linux (manylinux).

-We offer python 3.7-3.10 and cuda 10.2/11.1/11.4 prebuilt binaries for windows 10/11.
-
-We will provide prebuilts for CUDA versions supported by latest pytorch release. For example, pytorch 1.10 provide cuda 10.2 and 11.3 prebuilts, so we provide them too.
+We offer python 3.7-3.11 and cuda 10.2/11.1/11.4/12.0 prebuilt binaries for windows 10/11.

 For Linux users, you need to install pip >= 20.3 first to install prebuilt.

-CUDA 11.1 will be removed in spconv 2.2 because pytorch 1.10 don't provide prebuilts for it.
-
 ```pip install spconv``` for CPU only (**Linux Only**). you should only use this for debug usage, the performance isn't optimized due to manylinux limit (no omp support).

 ```pip install spconv-cu102``` for CUDA 10.2
@@ -108,9 +107,9 @@ CUDA 11.1 will be removed in spconv 2.2 because pytorch 1.10 don't provide prebu

 ```pip install spconv-cu114``` for CUDA 11.4

-**NOTE** It's safe to have different **minor** cuda version between system and conda (pytorch) in **CUDA >= 11.0** because of [CUDA Minor Version Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility). For example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed.
+```pip install spconv-cu120``` for CUDA 12.0

-For CUDA 10, we don't know whether ```spconv-cu102``` works with CUDA 10.0 and 10.1. Users can have a try.
+**NOTE** It's safe to have different **minor** cuda version between system and conda (pytorch) in **CUDA >= 11.0** because of [CUDA Minor Version Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility). For example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed.

 **NOTE** In Linux, you can install spconv-cuxxx without install CUDA to system! only suitable NVIDIA driver is required. for CUDA 11, we need driver >= 450.82.

@@ -118,11 +117,12 @@ For CUDA 10, we don't know whether ```spconv-cu102``` works with CUDA 10.0 and 1

 See [this page](https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/) to check supported GPU names by arch.

+If you use a GPU architecture that isn't compiled in prebuilt, spconv will use NVRTC to compile a slightly slower kernel.
+
 | CUDA version | GPU Arch List  |
 | -------------- |:---------------------:|
-| 10.2 | 50,52,60,61,70,75     | 
 | 11.x       | 52,60,61,70,75,80,86     | 
-| 12.x       | 60,61,70,75,80,86,90     | 
+| 12.x       | 70,75,80,86,90     | 

 ### Build from source for development (JIT, recommend)

@@ -171,9 +171,9 @@ You need to rebuild ```cumm``` first if you are build along a CUDA version that
 5. run ```pip install pccm cumm wheel```
 6. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl```

-## Know issues
+## Contributers

-* Spconv 2.x F16 runs slow in A100. 
+* [EvernightAurora](https://github.com/EvernightAurora): add ampere feature.

 ## Note


--- a/docs/PURE_CPP_BUILD.md
+++ b/docs/PURE_CPP_BUILD.md
--- a/example/fuse_bn_act.py
+++ b/example/fuse_bn_act.py
+# Copyright 2021 Yan Yan
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""simple fuse
+see https://pytorch.org/tutorials/intermediate/fx_conv_bn_fuser.html
+"""
+
+import time
+from pathlib import Path
+from typing import Any, Dict, Tuple
+
+import numpy as np
+import torch
+from torch import nn
+from cumm import tensorview as tv
+from spconv.core import ConvAlgo
+import torch.fx
+import spconv.pytorch as spconv
+import copy 
+import pickle
+from spconv.pytorch.conv import SparseConvolution
+from spconv.pytorch import functional as Fsp
+
+
+def fuse_bn_weights(conv_w_OKI, conv_b, bn_rm, bn_rv, bn_eps, bn_w, bn_b):
+    NDim = conv_w_OKI.ndim - 2
+    permute = [0, NDim+1] + [i+1 for i in range(NDim)]
+    conv_w_OIK = conv_w_OKI.permute(*permute)
+    # OIDHW
+    if conv_b is None:
+        conv_b = torch.zeros_like(bn_rm)
+    if bn_w is None:
+        bn_w = torch.ones_like(bn_rm)
+    if bn_b is None:
+        bn_b = torch.zeros_like(bn_rm)
+    bn_var_rsqrt = torch.rsqrt(bn_rv + bn_eps)
+
+    conv_w_OIK = conv_w_OIK * (bn_w * bn_var_rsqrt).reshape([-1] + [1] * (len(conv_w_OIK.shape) - 1))
+    conv_b = (conv_b - bn_rm) * bn_var_rsqrt * bn_w + bn_b
+    permute = [0,] + [i+2 for i in range(NDim)] + [1,]
+    conv_w_OKI = conv_w_OIK.permute(*permute).contiguous()
+    return torch.nn.Parameter(conv_w_OKI), torch.nn.Parameter(conv_b)
+
+def fuse_bn(conv, bn):
+    """
+    Given a conv Module `A` and an batch_norm module `B`, returns a conv
+    module `C` such that C(x) == B(A(x)) in inference mode.
+    """
+    assert(not (conv.training or bn.training)), "Fusion only for eval!"
+    fused_conv = copy.deepcopy(conv)
+
+    fused_conv.weight, fused_conv.bias = \
+        fuse_bn_weights(fused_conv.weight, fused_conv.bias,
+                             bn.running_mean, bn.running_var, bn.eps, bn.weight, bn.bias)
+
+    return fused_conv
+
+def fuse_act_net(conv, act):
+    """
+    Given a conv Module `A` and an batch_norm module `B`, returns a conv
+    module `C` such that C(x) == B(A(x)) in inference mode.
+    """
+    assert(not (conv.training)), "Fusion only for eval!"
+    fused_conv = copy.deepcopy(conv)
+    if isinstance(act, torch.nn.ReLU):
+        fused_conv.act_type = tv.gemm.Activation.ReLU
+    if isinstance(act, torch.nn.Sigmoid):
+        fused_conv.act_type = tv.gemm.Activation.Sigmoid
+    elif isinstance(act, torch.nn.LeakyReLU):
+        fused_conv.act_type = tv.gemm.Activation.LeakyReLU
+        fused_conv.act_alpha = act.negative_slope
+    else:
+        raise NotImplementedError
+    return fused_conv
+
+def _parent_name(target : str) -> Tuple[str, str]:
+    """
+    Splits a qualname into parent path and last atom.
+    For example, `foo.bar.baz` -> (`foo.bar`, `baz`)
+    """
+    *parent, name = target.rsplit('.', 1)
+    return parent[0] if parent else '', name
+
+def replace_node_module(node: torch.fx.Node, modules: Dict[str, Any], new_module: torch.nn.Module):
+    assert(isinstance(node.target, str))
+    parent_name, name = _parent_name(node.target)
+    setattr(modules[parent_name], name, new_module)
+
+def fuse(model: torch.fx.GraphModule) -> torch.fx.GraphModule:
+    model = copy.deepcopy(model)
+    # The first step of most FX passes is to symbolically trace our model to
+    # obtain a `GraphModule`. This is a representation of our original model
+    # that is functionally identical to our original model, except that we now
+    # also have a graph representation of our forward pass.
+    fx_model = model
+    modules = dict(fx_model.named_modules())
+    # The primary representation for working with FX are the `Graph` and the
+    # `Node`. Each `GraphModule` has a `Graph` associated with it - this
+    # `Graph` is also what generates `GraphModule.code`.
+    # The `Graph` itself is represented as a list of `Node` objects. Thus, to
+    # iterate through all of the operations in our graph, we iterate over each
+    # `Node` in our `Graph`.
+    for node in fx_model.graph.nodes:
+        # The FX IR contains several types of nodes, which generally represent
+        # call sites to modules, functions, or methods. The type of node is
+        # determined by `Node.op`.
+
+        if node.op != 'call_module': # If our current node isn't calling a Module then we can ignore it.
+            continue
+        # For call sites, `Node.target` represents the module/function/method
+        # that's being called. Here, we check `Node.target` to see if it's a
+        # batch norm module, and then check `Node.args[0].target` to see if the
+        # input `Node` is a convolution.
+        # print(node.target, node.args, node.args[0].args)
+        if isinstance(modules[node.target], torch.nn.BatchNorm1d):
+            if node.args[0].target in modules and isinstance(modules[node.args[0].target], SparseConvolution):
+                if len(node.args[0].users) > 1:  # Output of conv is used by other nodes
+                    continue
+                conv = modules[node.args[0].target]
+                bn = modules[node.target]
+                fused_conv = fuse_bn(conv, bn)
+                assert isinstance(fused_conv, SparseConvolution)
+                replace_node_module(node.args[0], modules, fused_conv)
+                modules_to_fuse = [node.args[0].target, node.target]
+                try:
+                    if isinstance(modules[node.next.target], torch.nn.ReLU):
+                        modules_to_fuse.append(node.next.target)
+                except Exception as e:
+                    pass
+                # As we've folded the batch nor into the conv, we need to replace all uses
+                # of the batch norm with the conv.
+                node.replace_all_uses_with(node.args[0])
+                # Now that all uses of the batch norm have been replaced, we can
+                # safely remove the batch norm.
+                fx_model.graph.erase_node(node)
+                
+    fx_model.graph.lint()
+    # After we've modified our graph, we need to recompile our graph in order
+    # to keep the generated code in sync.
+    fx_model.recompile()
+    return fx_model
+
+
+def fuse_act(model: torch.fx.GraphModule) -> torch.fx.GraphModule:
+    model = copy.deepcopy(model)
+    # The first step of most FX passes is to symbolically trace our model to
+    # obtain a `GraphModule`. This is a representation of our original model
+    # that is functionally identical to our original model, except that we now
+    # also have a graph representation of our forward pass.
+    fx_model = model
+    modules = dict(fx_model.named_modules())
+    # The primary representation for working with FX are the `Graph` and the
+    # `Node`. Each `GraphModule` has a `Graph` associated with it - this
+    # `Graph` is also what generates `GraphModule.code`.
+    # The `Graph` itself is represented as a list of `Node` objects. Thus, to
+    # iterate through all of the operations in our graph, we iterate over each
+    # `Node` in our `Graph`.
+    for node in fx_model.graph.nodes:
+        # The FX IR contains several types of nodes, which generally represent
+        # call sites to modules, functions, or methods. The type of node is
+        # determined by `Node.op`.
+
+        if node.op != 'call_module': # If our current node isn't calling a Module then we can ignore it.
+            continue
+        # For call sites, `Node.target` represents the module/function/method
+        # that's being called. Here, we check `Node.target` to see if it's a
+        # batch norm module, and then check `Node.args[0].target` to see if the
+        # input `Node` is a convolution.
+        # print(node.target, node.args, node.args[0].args)
+        if isinstance(modules[node.target], (torch.nn.ReLU, torch.nn.LeakyReLU, torch.nn.Sigmoid)):
+            if node.args[0].target in modules and isinstance(modules[node.args[0].target], SparseConvolution):
+                if len(node.args[0].users) > 1:  # Output of conv is used by other nodes
+                    continue
+                conv = modules[node.args[0].target]
+                act = modules[node.target]
+                fused_conv = fuse_act_net(conv, act)
+                assert isinstance(fused_conv, SparseConvolution)
+                replace_node_module(node.args[0], modules, fused_conv)
+                # As we've folded the batch nor into the conv, we need to replace all uses
+                # of the batch norm with the conv.
+                node.replace_all_uses_with(node.args[0])
+                # Now that all uses of the batch norm have been replaced, we can
+                # safely remove the batch norm.
+                fx_model.graph.erase_node(node)
+                
+    fx_model.graph.lint()
+    # After we've modified our graph, we need to recompile our graph in order
+    # to keep the generated code in sync.
+    fx_model.recompile()
+    return fx_model
+
+
+class Net(nn.Module):
+    def __init__(self, shape, algo):
+        super().__init__()
+        pool_algo = algo
+        # pool_algo = ConvAlgo.Native
+        self.net = spconv.SparseSequential(
+            spconv.SubMConv3d(3, 64, 3, bias=False, indice_key="c0",
+                              algo=algo),
+            nn.BatchNorm1d(64),
+            nn.ReLU(),
+            spconv.SubMConv3d(64,
+                              64,
+                              3,
+                              bias=False,
+                              indice_key="c0",
+                              algo=algo),
+            nn.BatchNorm1d(64),
+            nn.ReLU(),
+            spconv.SparseConv3d(64, 64, 2, 2, bias=False, indice_key="m0", algo=algo),
+            nn.BatchNorm1d(64),
+            nn.ReLU(),
+            spconv.SubMConv3d(64,
+                              96,
+                              3,
+                              bias=False,
+                              indice_key="c1",
+                              algo=algo),
+            nn.BatchNorm1d(96),
+            nn.ReLU(),
+
+            spconv.SubMConv3d(96,
+                              96,
+                              3,
+                              bias=False,
+                              indice_key="c1",
+                              algo=algo),
+            nn.BatchNorm1d(96),
+            nn.ReLU(),
+            spconv.SparseConv3d(96, 96, 2, 2, bias=False, indice_key="m1", algo=algo),
+            nn.BatchNorm1d(96),
+            nn.ReLU(),
+
+            spconv.SubMConv3d(96,
+                              128,
+                              3,
+                              bias=False,
+                              indice_key="c2",
+                              algo=algo),
+            nn.BatchNorm1d(128),
+            nn.ReLU(),
+            spconv.SubMConv3d(128,
+                              128,
+                              3,
+                              bias=False,
+                              indice_key="c2",
+                              algo=algo),
+            nn.BatchNorm1d(128),
+            nn.ReLU(),
+            spconv.SparseConv3d(128, 128, 2, 2, bias=False, indice_key="m2", algo=algo),
+            nn.BatchNorm1d(128),
+            nn.ReLU(),
+
+            spconv.SubMConv3d(128,
+                              160,
+                              3,
+                              bias=False,
+                              indice_key="c3",
+                              algo=algo),
+            nn.BatchNorm1d(160),
+            nn.ReLU(),
+            spconv.SubMConv3d(160,
+                              160,
+                              3,
+                              bias=False,
+                              indice_key="c3",
+                              algo=algo),
+            nn.BatchNorm1d(160),
+            nn.ReLU(),
+            spconv.SparseConv3d(160, 160, 2, 2, bias=False, indice_key="m3", algo=algo),
+            nn.BatchNorm1d(160),
+            nn.ReLU(),
+
+            spconv.SubMConv3d(160,
+                              192,
+                              3,
+                              bias=False,
+                              indice_key="c4",
+                              algo=algo),
+            nn.BatchNorm1d(192),
+            nn.ReLU(),
+
+            spconv.SubMConv3d(192,
+                              192,
+                              3,
+                              bias=False,
+                              indice_key="c4",
+                              algo=algo),
+            nn.BatchNorm1d(192),
+            nn.ReLU(),
+            spconv.SparseConv3d(192, 192, 2, 2, bias=False, indice_key="m4", algo=algo),
+            nn.BatchNorm1d(192),
+            nn.ReLU(),
+
+            spconv.SubMConv3d(192,
+                              224,
+                              3,
+                              bias=False,
+                              indice_key="c5",
+                              algo=algo),
+            nn.BatchNorm1d(224),
+            nn.ReLU(),
+
+            spconv.SubMConv3d(224,
+                              224,
+                              3,
+                              bias=False,
+                              indice_key="c5",
+                              algo=algo),
+            nn.BatchNorm1d(224),
+            nn.ReLU(),
+            spconv.SparseConv3d(224, 224, 2, 2, bias=False, indice_key="m5", algo=algo),
+            nn.BatchNorm1d(224),
+            nn.ReLU(),
+
+            spconv.SubMConv3d(224,
+                              256,
+                              3,
+                              bias=False,
+                              indice_key="c6",
+                              algo=algo),
+            nn.BatchNorm1d(256),
+            nn.ReLU(),
+
+            spconv.SubMConv3d(256,
+                              256,
+                              3,
+                              bias=False,
+                              indice_key="c6",
+                              algo=algo),
+            nn.BatchNorm1d(256),
+            nn.ReLU(),
+
+            spconv.SparseInverseConv3d(256,
+                                       128,
+                                       2,
+                                       indice_key="m5",
+                                       bias=False,
+                                       algo=algo),
+            nn.BatchNorm1d(128),
+            nn.ReLU(),
+            spconv.SparseInverseConv3d(128,
+                                       64,
+                                       2,
+                                       indice_key="m4",
+                                       bias=False,
+                                       algo=algo),
+            nn.BatchNorm1d(64),
+            nn.ReLU(),
+
+        )
+        max_batch_size = 1
+        # grid (dense map) is used for indice generation. use pre-allocated grid can run faster.
+        # self.grid = None
+        self.shape = shape
+        for n in self.net.modules():
+            if isinstance(n, nn.BatchNorm1d):
+                n.bias.data.uniform_(-0.1, 0.1)
+
+    def forward(self, features, coors, batch_size, vx_num=None):
+        x = spconv.SparseConvTensor(features, coors, self.shape, batch_size, voxel_num=vx_num)
+        return self.net(x)
+
+
+
+class MyTracer(torch.fx.Tracer):
+    def is_leaf_module(self, m: torch.nn.Module, module_qualified_name):
+        is_custom_leaf_module = isinstance(m, SparseConvolution)
+        return super().is_leaf_module(m, module_qualified_name) or is_custom_leaf_module
+
+def main():
+    # run this file with SPCONV_FX_TRACE_MODE=1
+    torch.manual_seed(50051)
+    torch.backends.cuda.matmul.allow_tf32 = False
+    torch.backends.cudnn.allow_tf32 = False
+    with open(Path(__file__).parent / "data" / "test_spconv.pkl", "rb") as f:
+        (voxels, coors, spatial_shape) = pickle.load(f)
+    np.random.seed(50051)
+    device = torch.device("cuda:0")
+    device_cpu = torch.device("cpu:0")
+    dtype = torch.float32
+    net = Net(spatial_shape, ConvAlgo.MaskImplicitGemm).cuda().eval().to(dtype)
+    tracer = MyTracer()
+    graph_trace = tracer.trace(net)
+    net_fused = torch.fx.GraphModule(tracer.root, graph_trace)
+    net_fused = fuse(net_fused)
+    net_fused = fuse_act(net_fused)
+    print(net_fused)
+    voxels_th = torch.from_numpy(voxels).to(device_cpu).to(dtype)
+    coors_th = torch.from_numpy(coors).to(device_cpu).int()
+    voxels_th_cuda = torch.from_numpy(voxels).to(device).to(dtype)
+    coors_th_cuda = torch.from_numpy(coors).to(device).int()
+
+    out_ref = net(voxels_th_cuda, coors_th_cuda, 1)
+    print("-------------fused------------")
+    out_fused = net_fused(voxels_th_cuda, coors_th_cuda, 1)
+    res = Fsp.sparse_add_hash_based(out_ref, out_fused.minus())
+    print(torch.linalg.norm(res.features))
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
--- a/example/pc_compress.py
+++ b/example/pc_compress.py
+# Copyright 2022 Yan Yan
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from spconv.core_cc.csrc.utils.pcc import PointCloudCompress
+from pathlib import Path 
+import numpy as np 
+from cumm import tensorview as tv 
+
+def main():
+    data = np.load(Path(__file__).parent.parent / "test" / "data" / "benchmark-pc.npz")
+    pc = np.ascontiguousarray(data["pc"]).astype(np.float32)
+    pc_encoded, order = PointCloudCompress.encode_with_order(tv.from_numpy(pc), tv.Tensor(), 0.01, 0.01, 0.01, PointCloudCompress.EncodeType.XYZ_8, True)
+    print(pc.nbytes, pc_encoded.bytesize())
+    pc_recover = PointCloudCompress.decode(pc_encoded)
+    pc_reorder = pc[order.numpy_view()]
+    error = pc_recover.numpy_view()[:, :3] - pc_reorder[:, :3]
+    print("ERROR", np.abs(error).max())
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
--- a/setup.py
+++ b/setup.py
@@ -38,9 +38,9 @@ if cuda_ver:
    cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102

    RELEASE_NAME += "-cu{}".format(cuda_ver)
-    deps = ["cumm-cu{}>=0.2.8".format(cuda_ver)]
+    deps = ["cumm-cu{}>=0.3.0".format(cuda_ver)]
 else:
-    deps = ["cumm>=0.2.8"]
+    deps = ["cumm>=0.3.0"]



@@ -156,7 +156,7 @@ if disable_jit is not None and disable_jit == "1":
    from cumm.conv.main import ConvMainUnitTest
    from cumm.constants import CUMM_CPU_ONLY_BUILD
    from spconv.csrc.sparse.all import SpconvOps
-    from spconv.csrc.utils import BoxOps
+    from spconv.csrc.utils import BoxOps, PointCloudCompress
    from spconv.csrc.hash.core import HashTable
    from cumm.common import CompileInfo
    from spconv.csrc.sparse.alloc import ExternalAllocator
@@ -192,7 +192,7 @@ if disable_jit is not None and disable_jit == "1":
        convops.namespace = "csrc.sparse.convops.spops"
    cus = [gemmtuner, convtuner,
        convops, SpconvOps(), BoxOps(), HashTable(), CompileInfo(), 
-        ExternalAllocator(),
+        ExternalAllocator(), PointCloudCompress(),
        ExternalSpconvMatmul(), InferenceOps()]
    if not CUMM_CPU_ONLY_BUILD:
        cus.extend([cu, convcu])

--- a/spconv/build.py
+++ b/spconv/build.py
@@ -31,7 +31,7 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(

    from spconv.csrc.sparse.all import SpconvOps
    from spconv.csrc.sparse.alloc import ExternalAllocator
-    from spconv.csrc.utils import BoxOps
+    from spconv.csrc.utils import BoxOps, PointCloudCompress
    from spconv.csrc.hash.core import HashTable
    from spconv.csrc.sparse.convops import GemmTunerSimple, ExternalSpconvMatmul
    from spconv.csrc.sparse.convops import ConvTunerSimple, ConvGemmOps
@@ -65,6 +65,7 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(
        ExternalSpconvMatmul(),
        SimpleExternalSpconvMatmul(), # for debug, won't be included in release
        InferenceOps(),
+        PointCloudCompress(),
    ]
    pccm.builder.build_pybind(cus,
                              PACKAGE_ROOT / "core_cc",

--- a/spconv/core_cc/csrc/utils/pcc.pyi
+++ b/spconv/core_cc/csrc/utils/pcc.pyi
+from typing import overload, Any, Callable, Dict, List, Optional, Set, Tuple, Type, Union
+from pccm.stubs import EnumValue, EnumClassValue
+from cumm.tensorview import Tensor
+class PointCloudCompress:
+    @staticmethod
+    def encode_with_order(points: Tensor, intensity: Tensor, ex: float, ey: float, ez: float, type, with_order: bool = False) -> Tuple[Tensor, Tensor]: 
+        """
+        Args:
+            points: 
+            intensity: 
+            ex: 
+            ey: 
+            ez: 
+            type: 
+            with_order: 
+        """
+        ...
+    @staticmethod
+    def encode_xyzi(points: Tensor, intensity: Tensor, ex: float, ey: float, ez: float) -> Tensor: 
+        """
+        Args:
+            points: 
+            intensity: 
+            ex: 
+            ey: 
+            ez: 
+        """
+        ...
+    @staticmethod
+    def encode_xyz(points: Tensor, ex: float, ey: float, ez: float) -> Tensor: 
+        """
+        Args:
+            points: 
+            ex: 
+            ey: 
+            ez: 
+        """
+        ...
+    @staticmethod
+    def decode(data: Tensor) -> Tensor: 
+        """
+        Args:
+            data: 
+        """
+        ...
+    class EncodeType:
+        XYZ_8 = EnumClassValue(0) # type: EnumClassValue
+        XYZI_8 = EnumClassValue(1) # type: EnumClassValue
+        @staticmethod
+        def __members__() -> Dict[str, EnumClassValue]: ...
--- a/spconv/cppconstants.py
+++ b/spconv/cppconstants.py
@@ -27,4 +27,4 @@ from spconv.core_cc.csrc.utils.boxops import BoxOps
 from spconv.core_cc.cumm.common import CompileInfo
 HAS_BOOST = BoxOps.has_boost()

-COMPILED_CUDA_ARCHS = set(CompileInfo.get_compiled_cuda_arch())
+COMPILED_CUDA_ARCHS = set(CompileInfo.get_compiled_gemm_cuda_arch())
--- a/spconv/csrc/sparse/all.py
+++ b/spconv/csrc/sparse/all.py
@@ -1384,7 +1384,8 @@ class SpconvOps(pccm.Class):
        code.arg("voxels, indices, num_per_voxel, hashdata, point_indice_data, pc_voxel_id",
                 "tv::Tensor")
        code.arg("vsize", f"std::vector<float>")
-        code.arg("grid_size, grid_stride", f"std::vector<int>")
+        code.arg("grid_size", f"std::vector<int>")
+        code.arg("grid_stride", f"std::vector<int64_t>")
        code.arg("coors_range", f"std::vector<float>")

        code.arg("empty_mean", "bool", "false")
@@ -1404,7 +1405,9 @@ class SpconvOps(pccm.Class):
            code.raw(f"""
            if (ndim == {ndim}){{
                std::array<float, {ndim}> vsize_;
-                std::array<int, {ndim}> grid_size_, grid_stride_;
+                std::array<int, {ndim}> grid_size_;
+                std::array<int64_t, {ndim}> grid_stride_;
+
                std::array<float, {ndim * 2}> coors_range_;
                for (int i = 0; i < {ndim}; ++i){{
                    vsize_[i] = vsize[i];

--- a/spconv/csrc/sparse/convops.py
+++ b/spconv/csrc/sparse/convops.py
@@ -550,7 +550,7 @@ class GemmTunerSimple(pccm.ParameterizedClass):
        }}
        auto avail_algos = get_available_algo_str_from_arch(arch);
        std::vector<tv::gemm::GemmAlgoDesp> finally_algos;
-        auto is_arch_compiled = CompileInfo::arch_is_compiled(arch);
+        auto is_arch_compiled = CompileInfo::arch_is_compiled_gemm(arch);
        for (auto algo : avail_algos){{
            static_key_t static_key = std::make_tuple(trans_a, trans_b, trans_c, int(a.dtype()),
                int(b.dtype()), int(c.dtype()), shuffle_type, algo);
@@ -996,7 +996,7 @@ class ConvTunerSimple(pccm.ParameterizedClass):
        use_f32_as_accum = false;

        std::vector<tv::gemm::ConvAlgoDesp> finally_algos;
-        auto is_arch_compiled = CompileInfo::arch_is_compiled(arch);
+        auto is_arch_compiled = CompileInfo::arch_is_compiled_gemm(arch);
        for (auto algo : avail_algos){{
            static_key_t static_key = std::make_tuple(
                layout_i, layout_w, layout_o,
@@ -2048,7 +2048,12 @@ class ConvGemmOps(pccm.ParameterizedClass):
        
        for (int j = 0; j < num_split; ++j){{
            float beta = j == 0 ? 0 : 1;
-
+            if (!bias.empty()){{
+                beta = 1;
+            }}
+            if (j > 0){{
+                bias = tv::Tensor();
+            }}
            conv_tuner.run_with_tuned_result(
                tune_res,
                kForwardInt,

--- a/spconv/csrc/sparse/inference.py
+++ b/spconv/csrc/sparse/inference.py
@@ -13,29 +13,37 @@
 # limitations under the License.

 import pccm
-from cumm.common import TensorView, GemmDTypes, TensorViewKernel, ThrustLib, GemmBasic
+from cumm.common import TensorView, GemmDTypes, TensorViewKernel, TensorViewNVRTC, ThrustLib, GemmBasic
 from spconv.csrc.sparse.cpu_core import OMPLib
+from ..utils.launch import LaunchUtils
 from cumm.constants import CUMM_CPU_ONLY_BUILD

 class InferenceOpsKernel(pccm.ParameterizedClass):
    def __init__(self):
        super().__init__()
-        self.add_dependency(TensorViewKernel, GemmBasic)
+        self.add_dependency(TensorViewKernel, TensorViewNVRTC, GemmBasic, LaunchUtils)

    @pccm.cuda.cuda_global_function
    def bias_add_inplace_kernel(self):
        code = pccm.FunctionCode()
        code.targ("T")
+        code.nontype_targ("OneDim", "bool", "false")

        code.arg("out_features", f"T*")
        code.arg("bias", f"const T*")
        code.arg("size", "int")
        code.arg("num_features", "int")
+        code.arg("num_blocks_x", "int")
+        code.arg("num_blocks_y", "int")

        code.raw(f"""
-        for (int i : tv::KernelLoopY<int>(size)) {{
+        int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
+        int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
+
+
+        for (int i : tv::KernelLoopY<int>(size, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
            auto out_ptr = out_features + i * num_features;
-            for (int j : tv::KernelLoopX<int>(num_features)) {{
+            for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
                out_ptr[j] = bias[j] + out_ptr[j];
            }}
        }}
@@ -46,6 +54,7 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
    def bias_add_act_inplace_kernel(self):
        code = pccm.FunctionCode()
        code.targ("T")
+        code.nontype_targ("OneDim", "bool", "false")

        code.arg("out_features", f"T*")
        code.arg("bias", f"const T*")
@@ -54,20 +63,35 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
        code.arg("beta", f"T")
        code.arg("size", "int")
        code.arg("num_features", "int")
+        code.arg("num_blocks_x", "int")
+        code.arg("num_blocks_y", "int")

        code.raw(f"""
-        for (int i : tv::KernelLoopY<int>(size)) {{
+        int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
+        int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
+
+        namespace op = tv::arrayops;
+        using nv_scalar_t = tv::equivalent_data_type_t<T>;
+        for (int i : tv::KernelLoopY<int>(size, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
            auto out_ptr = out_features + i * num_features;
-            for (int j : tv::KernelLoopX<int>(num_features)) {{
+            for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
                T o = out_ptr[j] + bias[j];
+                auto* o_nv = reinterpret_cast<nv_scalar_t*>(&o);
                switch (act_type){{
                    case tv::gemm::Activation::kNone:
                        break;
                    case tv::gemm::Activation::kReLU:{{
                        o = o >= T(0) ? o : T(0);
+                        break;
                    }}
                    case tv::gemm::Activation::kLeakyReLU:{{
                        o = o >= T(0) ? o : o * alpha;
+                        break;
+                    }}
+                    case tv::gemm::Activation::kSigmoid:{{
+                        auto e = op::MathScalarOp<nv_scalar_t>::exp(-*o_nv);
+                        o = T(1) / (T(1) + *reinterpret_cast<T*>( &e ));
+                        break;
                    }}
                    default: ;
                }}
@@ -89,16 +113,27 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
        code.arg("size", "int")

        code.raw(f"""
+        namespace op = tv::arrayops;
+        using nv_scalar_t = tv::equivalent_data_type_t<T>;
        for (int i : tv::KernelLoopX<int>(size)) {{
            T o = out_features[i];
+            auto* o_nv = reinterpret_cast<nv_scalar_t*>(&o);
+
            switch (act_type){{
                case tv::gemm::Activation::kNone:
                    break;
                case tv::gemm::Activation::kReLU:{{
                    out_features[i] = o >= T(0) ? o : T(0);
+                    break;
                }}
                case tv::gemm::Activation::kLeakyReLU:{{
                    out_features[i] = o >= T(0) ? o : o * alpha;
+                    break;
+                }}
+                case tv::gemm::Activation::kSigmoid:{{
+                    auto e = op::MathScalarOp<nv_scalar_t>::exp(-*o_nv);
+                    out_features[i] = T(1) / (T(1) + *reinterpret_cast<T*>( &e ));
+                    break;
                }}
                default: ;
            }}
@@ -110,9 +145,10 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
 class InferenceOps(pccm.Class):
    def __init__(self):
        super().__init__()
-        self.add_dependency(TensorView)
+        self.add_dependency(TensorView, LaunchUtils)
        self.kernel = InferenceOpsKernel()
        self.add_include("tensorview/gemm/core/constants.h")
+        self.add_static_const("kMaxGridYZDim", "int", "65535")

    if CUMM_CPU_ONLY_BUILD:
        _DECORATOR = pccm.static_function
@@ -139,34 +175,30 @@ class InferenceOps(pccm.Class):
        auto nhot = out.dim(0);
        auto cudastream = reinterpret_cast<cudaStream_t>(stream);
        TV_ASSERT_RT_ERR(bias.dim(0) == out.dim(1), "error");
-        tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
+        tv::dispatch<float, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
            using T = TV_DECLTYPE(I);
-            constexpr int MaxThreads = 512;
-            tv::cuda::Launch launcher(1);
-            bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
-                // if out.dim(1) > value in list above, run this function.
-                // if a value is found, other value won't be executed.
-                int NumFeatures = TV_DECLTYPE(V)::value;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
-            }});
-            if (!found){{
-                int NumFeatures = 16;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
+            auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
+            int num_blocks_X = std::get<0>(launchdims);
+            int num_blocks_Y = std::get<1>(launchdims);
+            dim3 blocks;
+            dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
+            if (num_blocks_Y > kMaxGridYZDim){{
+                blocks = dim3(num_blocks_X * num_blocks_Y);
+            }}else{{
+                blocks = dim3(num_blocks_X, num_blocks_Y);
            }}
+            tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
+            tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
+                constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
                if (act_type == tv::gemm::Activation::kNone){{
-                launcher(ker::bias_add_inplace_kernel<T>, out.data_ptr<T>(), bias.data_ptr<const T>(),
-                    nhot, out.dim(1));
+                    launcher(ker::bias_add_inplace_kernel<T, OneDim>, out.data_ptr<T>(), bias.data_ptr<const T>(),
+                        nhot, out.dim(1), num_blocks_X, num_blocks_Y);
                }}else{{
-                launcher(ker::bias_add_act_inplace_kernel<T>, out.data_ptr<T>(), bias.data_ptr<const T>(),
-                    act_type, T(alpha), T(beta), nhot, out.dim(1));
+                    launcher(ker::bias_add_act_inplace_kernel<T, OneDim>, out.data_ptr<T>(), bias.data_ptr<const T>(),
+                        act_type, T(alpha), T(beta), nhot, out.dim(1), num_blocks_X, num_blocks_Y);
                }}
-
+            }});
+            TV_CHECK_CUDA_ERR_V2("bias add act failed!!!");
        }});
        """)
        return code
@@ -204,10 +236,11 @@ class InferenceOps(pccm.Class):
        auto nhot = out.size();
        auto cudastream = reinterpret_cast<cudaStream_t>(stream);
        tv::cuda::Launch launcher = tv::cuda::Launch(nhot, cudastream);
-        tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
+        tv::dispatch<float, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
            using T = TV_DECLTYPE(I);
            launcher(ker::activation_inplace_kernel<T>, out.data_ptr<T>(), act_type, T(alpha), T(beta),
                nhot);
+            TV_CHECK_CUDA_ERR_V2("bias add act failed!!!");
        }});
        """)
        return code
--- a/spconv/csrc/sparse/maxpool.py
+++ b/spconv/csrc/sparse/maxpool.py
@@ -26,19 +26,22 @@ import numpy as np
 from cumm.gemm import (thread_map)
 from spconv.csrc.sparse.cpu_core import OMPLib
 from cumm.constants import CUMM_CPU_ONLY_BUILD
-
+from ..utils.launch import LaunchUtils

 class IndiceMaxPool(pccm.Class):
    # TODO optimize this function
    def __init__(self):
        super().__init__()
        self.add_include("limits")
-        self.add_dependency(TensorViewKernel, TensorView, GemmBasic)
+        self.add_dependency(TensorViewKernel, TensorView, GemmBasic, LaunchUtils)
+        self.add_static_const("kMaxGridYZDim", "int", "65535")
+

    @pccm.cuda.cuda_global_function
    def forward_kernel(self):
        code = pccm.FunctionCode()
        code.targ("T")
+        code.nontype_targ("OneDim", "bool", "false")

        code.arg("out_features", f"T*")
        code.arg("in_features", f"const T*")
@@ -46,14 +49,19 @@ class IndiceMaxPool(pccm.Class):
        code.arg("in_indices", "const int*")
        code.arg("size", "int")
        code.arg("num_features", "int")
+        code.arg("num_blocks_x", "int")
+        code.arg("num_blocks_y", "int")

        code.raw(f"""
-        for (int i : tv::KernelLoopY<int>(size)) {{
+        int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
+        int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
+
+        for (int i : tv::KernelLoopY<int>(size, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
            int in_idx = in_indices[i];
            int out_idx = out_indices[i];
            auto in_ptr = in_features + in_idx * num_features;
            auto out_ptr = out_features + out_idx * num_features;
-            for (int j : tv::KernelLoopX<int>(num_features)) {{
+            for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
                auto in = in_ptr[j];
                auto out = out_ptr[j];
                if (in > out){{
@@ -68,6 +76,7 @@ class IndiceMaxPool(pccm.Class):
    def forward_implicit_gemm_kernel(self):
        code = pccm.FunctionCode()
        code.targ("T")
+        code.nontype_targ("OneDim", "bool", "false")

        code.arg("out_features", f"T*")
        code.arg("in_features", f"const T*")
@@ -76,11 +85,16 @@ class IndiceMaxPool(pccm.Class):
        code.arg("RS", "int")
        code.arg("num_indices", "int")
        code.arg("lowest", "T")
+        code.arg("num_blocks_x", "int")
+        code.arg("num_blocks_y", "int")

        code.raw(f"""
-        for (int i : tv::KernelLoopY<int>(num_indices)) {{
+        int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
+        int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
+
+        for (int i : tv::KernelLoopY<int>(num_indices, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
            auto out_ptr = out_features + i * num_features;
-            for (int j : tv::KernelLoopX<int>(num_features)) {{
+            for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
                auto indices_ptr = indices + i;
                int in_idx = indices_ptr[0];
                T in, in_temp;
@@ -106,6 +120,8 @@ class IndiceMaxPool(pccm.Class):
    def backward_kernel(self):
        code = pccm.FunctionCode()
        code.targ("T")
+        code.nontype_targ("OneDim", "bool", "false")
+
        code.arg("out_features", f"const T*")
        code.arg("in_features", f"const T*")
        code.arg("dout_features", f"const T*")
@@ -115,15 +131,20 @@ class IndiceMaxPool(pccm.Class):
        code.arg("size", "int")
        code.arg("num_features", "int")

+        code.arg("num_blocks_x", "int")
+        code.arg("num_blocks_y", "int")
+
        code.raw(f"""
-        for (int i : tv::KernelLoopY<int>(size)) {{
+        int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
+        int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
+        for (int i : tv::KernelLoopY<int>(size, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
            int in_idx_offset = in_indices[i] * num_features;
            int out_idx_offset = out_indices[i] * num_features;
            auto in_ptr = in_features + in_idx_offset;
            auto out_ptr = out_features + out_idx_offset;
            auto din_ptr = din_features + in_idx_offset;
            auto dout_ptr = dout_features + out_idx_offset;
-            for (int j : tv::KernelLoopX<int>(num_features)) {{
+            for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
                auto in = in_ptr[j];
                auto out = out_ptr[j];
                if (in == out){{
@@ -138,6 +159,7 @@ class IndiceMaxPool(pccm.Class):
    def backward_implicit_gemm_kernel(self):
        code = pccm.FunctionCode()
        code.targ("T")
+        code.nontype_targ("OneDim", "bool", "false")

        code.arg("out_features", f"const T*")
        code.arg("in_features", f"const T*")
@@ -148,13 +170,18 @@ class IndiceMaxPool(pccm.Class):
        code.arg("RS", "int")
        code.arg("num_indices", "int")

+        code.arg("num_blocks_x", "int")
+        code.arg("num_blocks_y", "int")
+
        code.raw(f"""
+        int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
+        int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;

-        for (int i : tv::KernelLoopY<int>(num_indices)) {{
+        for (int i : tv::KernelLoopY<int>(num_indices, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
            auto in_ptr = in_features + i * num_features;
            auto din_ptr = din_features + i * num_features;
            
-            for (int j : tv::KernelLoopX<int>(num_features)) {{
+            for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
                auto indices_ptr = indices_bwd + i;
                int out_idx = indices_ptr[0];
                T in = in_ptr[j];
@@ -184,6 +211,7 @@ class IndiceMaxPool(pccm.Class):
    def forward_avgpool_implicit_gemm_kernel(self):
        code = pccm.FunctionCode()
        code.targ("T")
+        code.nontype_targ("OneDim", "bool", "false")

        code.arg("out_features", f"T*")
        code.arg("in_features", f"const T*")
@@ -193,8 +221,13 @@ class IndiceMaxPool(pccm.Class):
        code.arg("RS", "int")
        code.arg("num_indices", "int")

+        code.arg("num_blocks_x", "int")
+        code.arg("num_blocks_y", "int")
+
        code.raw(f"""
-        for (int i : tv::KernelLoopY<int>(num_indices)) {{
+        int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
+        int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
+        for (int i : tv::KernelLoopY<int>(num_indices, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
            auto out_ptr = out_features + i * num_features;
            auto indices_ptr = indices + i;
            int in_idx = 0;
@@ -207,7 +240,7 @@ class IndiceMaxPool(pccm.Class):
            if (count_out != nullptr){{
                count_out[i] = count;
            }}
-            for (int j : tv::KernelLoopX<int>(num_features)) {{
+            for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
                indices_ptr = indices + i;
                int in_idx;
                T in, in_temp;
@@ -229,6 +262,8 @@ class IndiceMaxPool(pccm.Class):
    def backward_avgpool_implicit_gemm_kernel(self):
        code = pccm.FunctionCode()
        code.targ("T")
+        code.nontype_targ("OneDim", "bool", "false")
+
        code.arg("dout_features", f"const T*")
        code.arg("din_features", f"T*")
        code.arg("indices_bwd", "const int*")
@@ -237,11 +272,16 @@ class IndiceMaxPool(pccm.Class):
        code.arg("RS", "int")
        code.arg("num_indices", "int")

+        code.arg("num_blocks_x", "int")
+        code.arg("num_blocks_y", "int")
+
        code.raw(f"""
+        int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
+        int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;

-        for (int i : tv::KernelLoopY<int>(num_indices)) {{
+        for (int i : tv::KernelLoopY<int>(num_indices, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
            auto din_ptr = din_features + i * num_features;
-            for (int j : tv::KernelLoopX<int>(num_features)) {{
+            for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
                auto indices_ptr = indices_bwd + i;
                int out_idx = 0;
                T sum_val = T(0);
@@ -273,27 +313,24 @@ class IndiceMaxPool(pccm.Class):
        auto cudastream = reinterpret_cast<cudaStream_t>(stream);
        tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
            using T = TV_DECLTYPE(I);
-            constexpr int MaxThreads = 512;
-            tv::cuda::Launch launcher(1);
-            bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
-                // if out.dim(1) > value in list above, run this function.
-                // if a value is found, other value won't be executed.
-                int NumFeatures = TV_DECLTYPE(V)::value;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
+            auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
+            int num_blocks_X = std::get<0>(launchdims);
+            int num_blocks_Y = std::get<1>(launchdims);
+            dim3 blocks;
+            dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
+            if (num_blocks_Y > kMaxGridYZDim){{
+                blocks = dim3(num_blocks_X * num_blocks_Y);
+            }}else{{
+                blocks = dim3(num_blocks_X, num_blocks_Y);
+            }}
+            tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
+            tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
+                constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
+                launcher(forward_kernel<T, OneDim>, out.data_ptr<T>(), in.data_ptr<const T>(),
+                    out_inds.data_ptr<const int>(), in_inds.data_ptr<const int>(), nhot, out.dim(1),
+                    num_blocks_X, num_blocks_Y);
            }});
-            if (!found){{
-                int NumFeatures = 16;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
-            }}
-            launcher(forward_kernel<T>, out.data_ptr<T>(), in.data_ptr<const T>(),
-                out_inds.data_ptr<const int>(), in_inds.data_ptr<const int>(), nhot, out.dim(1));
-
+            TV_CHECK_CUDA_ERR_V2("max pool fwd failed!!!");
        }});
        """)
        return code
@@ -315,28 +352,28 @@ class IndiceMaxPool(pccm.Class):
        auto cudastream = reinterpret_cast<cudaStream_t>(stream);
        tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
            using T = TV_DECLTYPE(I);
-            constexpr int MaxThreads = 512;
-            tv::cuda::Launch launcher(1);
-            bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
-                // if out.dim(1) > value in list above, run this function.
-                // if a value is found, other value won't be executed.
-                int NumFeatures = TV_DECLTYPE(V)::value;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
-            }});
-            if (!found){{
-                int NumFeatures = 16;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
-            }}
+            auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
+            int num_blocks_X = std::get<0>(launchdims);
+            int num_blocks_Y = std::get<1>(launchdims);
+            dim3 blocks;
+            dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
+            if (num_blocks_Y > kMaxGridYZDim){{
+                blocks = dim3(num_blocks_X * num_blocks_Y);
+            }}else{{
+                blocks = dim3(num_blocks_X, num_blocks_Y);
+            }}
+            tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
            T lowest = std::numeric_limits<T>::lowest();
            lowest = T(0);
-            launcher(forward_implicit_gemm_kernel<T>, out.data_ptr<T>(), in.data_ptr<const T>(),
-                inds.data_ptr<const int>(), out.dim(1), inds.dim(0), inds.dim(1), lowest);
+            tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
+                constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
+                launcher(forward_implicit_gemm_kernel<T, OneDim>, out.data_ptr<T>(), in.data_ptr<const T>(),
+                    inds.data_ptr<const int>(), out.dim(1), inds.dim(0), inds.dim(1), lowest, 
+                    num_blocks_X, num_blocks_Y);
+            }});
+
+            TV_CHECK_CUDA_ERR_V2("max pool fwd failed!!!");
+
        }});
        """)
        return code
@@ -358,27 +395,26 @@ class IndiceMaxPool(pccm.Class):
        auto cudastream = reinterpret_cast<cudaStream_t>(stream);
        tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
            using T = TV_DECLTYPE(I);
-            constexpr int MaxThreads = 512;
-            tv::cuda::Launch launcher(1);
-            bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
-                // if out.dim(1) > value in list above, run this function.
-                // if a value is found, other value won't be executed.
-                int NumFeatures = TV_DECLTYPE(V)::value;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
-            }});
-            if (!found){{
-                int NumFeatures = 16;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
-            }}
-            launcher(backward_kernel<T>, out.data_ptr<const T>(), in.data_ptr<const T>(),
+            auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
+            int num_blocks_X = std::get<0>(launchdims);
+            int num_blocks_Y = std::get<1>(launchdims);
+            dim3 blocks;
+            dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
+            if (num_blocks_Y > kMaxGridYZDim){{
+                blocks = dim3(num_blocks_X * num_blocks_Y);
+            }}else{{
+                blocks = dim3(num_blocks_X, num_blocks_Y);
+            }}
+            tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
+            tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
+                constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
+                launcher(backward_kernel<T, OneDim>, out.data_ptr<const T>(), in.data_ptr<const T>(),
                    dout.data_ptr<const T>(), din.data_ptr<T>(),
-                out_inds.data_ptr<const int>(), in_inds.data_ptr<const int>(), nhot, out.dim(1));
+                    out_inds.data_ptr<const int>(), in_inds.data_ptr<const int>(), nhot, out.dim(1),
+                    num_blocks_X, num_blocks_Y);
+
+            }});
+            TV_CHECK_CUDA_ERR_V2("max pool backward failed!!!");
        }});
        """)
        return code
@@ -402,27 +438,25 @@ class IndiceMaxPool(pccm.Class):
        auto cudastream = reinterpret_cast<cudaStream_t>(stream);
        tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
            using T = TV_DECLTYPE(I);
-            constexpr int MaxThreads = 512;
-            tv::cuda::Launch launcher(1);
-            bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
-                // if out.dim(1) > value in list above, run this function.
-                // if a value is found, other value won't be executed.
-                int NumFeatures = TV_DECLTYPE(V)::value;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
-            }});
-            if (!found){{
-                int NumFeatures = 16;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
-            }}
-            launcher(backward_implicit_gemm_kernel<T>, out.data_ptr<const T>(), in.data_ptr<const T>(),
+            auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
+            int num_blocks_X = std::get<0>(launchdims);
+            int num_blocks_Y = std::get<1>(launchdims);
+            dim3 blocks;
+            dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
+            if (num_blocks_Y > kMaxGridYZDim){{
+                blocks = dim3(num_blocks_X * num_blocks_Y);
+            }}else{{
+                blocks = dim3(num_blocks_X, num_blocks_Y);
+            }}
+            tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
+            tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
+                constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
+                launcher(backward_implicit_gemm_kernel<T, OneDim>, out.data_ptr<const T>(), in.data_ptr<const T>(),
                    dout.data_ptr<const T>(), din.data_ptr<T>(),
-                inds.data_ptr<const int>(), out.dim(1), inds.dim(0), inds.dim(1));
+                    inds.data_ptr<const int>(), out.dim(1), inds.dim(0), inds.dim(1),
+                    num_blocks_X, num_blocks_Y);
+            }});
+            TV_CHECK_CUDA_ERR_V2("max pool fwd failed!!!");
        }});
        """)
        return code
@@ -446,26 +480,26 @@ class IndiceMaxPool(pccm.Class):
        auto cudastream = reinterpret_cast<cudaStream_t>(stream);
        tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
            using T = TV_DECLTYPE(I);
-            constexpr int MaxThreads = 512;
-            tv::cuda::Launch launcher(1);
-            bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
-                // if out.dim(1) > value in list above, run this function.
-                // if a value is found, other value won't be executed.
-                int NumFeatures = TV_DECLTYPE(V)::value;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
+            auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
+            int num_blocks_X = std::get<0>(launchdims);
+            int num_blocks_Y = std::get<1>(launchdims);
+            dim3 blocks;
+            dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
+            if (num_blocks_Y > kMaxGridYZDim){{
+                blocks = dim3(num_blocks_X * num_blocks_Y);
+            }}else{{
+                blocks = dim3(num_blocks_X, num_blocks_Y);
+            }}
+            tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
+            tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
+                constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
+                launcher(forward_avgpool_implicit_gemm_kernel<T, OneDim>, out.data_ptr<T>(), in.data_ptr<const T>(),
+                    inds.data_ptr<const int>(), count_out.data_ptr<int>(), out.dim(1), inds.dim(0), inds.dim(1),
+                    num_blocks_X, num_blocks_Y);
+
            }});
-            if (!found){{
-                int NumFeatures = 16;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
-            }}
-            launcher(forward_avgpool_implicit_gemm_kernel<T>, out.data_ptr<T>(), in.data_ptr<const T>(),
-                inds.data_ptr<const int>(), count_out.data_ptr<int>(), out.dim(1), inds.dim(0), inds.dim(1));
+            TV_CHECK_CUDA_ERR_V2("avg pool fwd failed!!!");
+
        }});
        """)
        return code
@@ -484,32 +518,31 @@ class IndiceMaxPool(pccm.Class):
        TV_ASSERT_RT_ERR(!count_out.empty(), "count out must not empty")
        tv::check_shape(inds, {{-1, nhot}});
        tv::check_shape(din, {{-1, dout.dim(1)}});
-        int num_act_out = dout.dim(1);
        auto cudastream = reinterpret_cast<cudaStream_t>(stream);
        tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(dout.dtype(), [&](auto I){{
            using T = TV_DECLTYPE(I);
-            constexpr int MaxThreads = 512;
-            tv::cuda::Launch launcher(1);
-            bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(dout.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
-                // if out.dim(1) > value in list above, run this function.
-                // if a value is found, other value won't be executed.
-                int NumFeatures = TV_DECLTYPE(V)::value;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(dout.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
-            }});
-            if (!found){{
-                int NumFeatures = 16;
-                int Num0 = MaxThreads / NumFeatures;
-                dim3 blocks(tv::div_up(dout.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
-                dim3 threads(NumFeatures, Num0);
-                launcher = tv::cuda::Launch(blocks, threads, cudastream);
-            }}
-            launcher(backward_avgpool_implicit_gemm_kernel<T>, 
+            auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, dout.dim(1));
+            int num_blocks_X = std::get<0>(launchdims);
+            int num_blocks_Y = std::get<1>(launchdims);
+            dim3 blocks;
+            dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
+            if (num_blocks_Y > kMaxGridYZDim){{
+                blocks = dim3(num_blocks_X * num_blocks_Y);
+            }}else{{
+                blocks = dim3(num_blocks_X, num_blocks_Y);
+            }}
+            tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
+            tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
+                constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
+
+                launcher(backward_avgpool_implicit_gemm_kernel<T, OneDim>, 
                    dout.data_ptr<const T>(), din.data_ptr<T>(),
                    inds.data_ptr<const int>(), count_out.data_ptr<const int>(),
-                dout.dim(1), inds.dim(0), inds.dim(1));
+                    dout.dim(1), inds.dim(0), inds.dim(1),
+                    num_blocks_X, num_blocks_Y);
+            }});
+            TV_CHECK_CUDA_ERR_V2("avg pool bwd failed!!!");
+
        }});
        """)
        return code

--- a/spconv/csrc/sparse/pointops.py
+++ b/spconv/csrc/sparse/pointops.py
@@ -32,9 +32,11 @@ class Point2VoxelCommon(pccm.ParameterizedClass):
        self.ndim = ndim
        self.zyx = zyx
        ret_str = f"std::array<int, {self.ndim}>"
+        ret64_str = f"std::array<int64_t, {self.ndim}>"
+
        retf_str = f"std::array<float, {self.ndim}>"
        retf2_str = f"std::array<float, {self.ndim * 2}>"
-        self.calc_meta_ret = f"std::tuple<{retf_str}, {ret_str}, {ret_str}, {retf2_str}>"
+        self.calc_meta_ret = f"std::tuple<{retf_str}, {ret_str}, {ret64_str}, {retf2_str}>"

    @pccm.static_function
    def calc_meta_data(self):
@@ -43,8 +45,8 @@ class Point2VoxelCommon(pccm.ParameterizedClass):
        code.arg("coors_range_xyz", f"std::array<float, {self.ndim * 2}>")
        code.raw(f"""
        std::array<float, {self.ndim}> vsize;
-        std::array<int, {self.ndim}> grid_size, grid_stride;
-
+        std::array<int, {self.ndim}> grid_size;
+        std::array<int64_t, {self.ndim}> grid_stride;
        std::array<float, {self.ndim * 2}> coors_range;

        """)
@@ -79,9 +81,10 @@ class Point2VoxelCommon(pccm.ParameterizedClass):
        ret_str = f"std::array<int, {self.ndim}>"
        retf_str = f"std::array<float, {self.ndim}>"
        retf2_str = f"std::array<float, {self.ndim * 2}>"
+        ret64_str = f"std::array<int64_t, {self.ndim}>"

        return code.ret(
-            f"std::tuple<{retf_str}, {ret_str}, {ret_str}, {retf2_str}>")
+            f"std::tuple<{retf_str}, {ret_str}, {ret64_str}, {retf2_str}>")

    @pccm.static_function
    def array2tvarray(self):
@@ -143,7 +146,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
        code.arg("vsize", f"tv::array<float, {self.ndim}>")
        code.arg("coors_range", f"tv::array<float, {self.ndim * 2}>")
        code.arg("grid_bound", f"tv::array<int, {self.ndim}>")
-        code.arg("grid_stride", f"tv::array<int, {self.ndim}>")
+        code.arg("grid_stride", f"tv::array<int64_t, {self.ndim}>")

        code.arg("num_points", f"int")
        point_xyz = f"{self.ndim - 1} - j"
@@ -163,7 +166,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
                if ((c < 0 || c >= grid_bound[j])) {{
                    failed = true;
                }}
-                prod += grid_stride[j] * c;
+                prod += grid_stride[j] * int64_t(c);
            }}
            if (!failed){{
                points_indice_data[i] = prod;
@@ -218,7 +221,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
        code.arg("vsize", f"tv::array<float, {self.ndim}>")
        code.arg("coors_range", f"tv::array<float, {self.ndim * 2}>")
        code.arg("grid_bound", f"tv::array<int, {self.ndim}>")
-        code.arg("grid_stride", f"tv::array<int, {self.ndim}>")
+        code.arg("grid_stride", f"tv::array<int64_t, {self.ndim}>")

        code.arg("num_points", f"int")
        # TODO add backward?
@@ -323,7 +326,7 @@ class Point2Voxel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
        self.add_member("vsize", f"tv::array<float, {self.ndim}>")
        self.add_member("coors_range", f"tv::array<float, {self.ndim * 2}>")
        self.add_member("grid_size", f"tv::array<int, {self.ndim}>")
-        self.add_member("grid_stride", f"tv::array<int, {self.ndim}>")
+        self.add_member("grid_stride", f"tv::array<int64_t, {self.ndim}>")

    @pccm.pybind.mark_prop_getter(prop_name="grid_size")
    @pccm.member_function
@@ -414,7 +417,9 @@ class Point2Voxel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
        code.arg("voxels, indices, num_per_voxel, hashdata, point_indice_data, points_voxel_id",
                 "tv::Tensor")
        code.arg("vsize", f"std::array<float, {self.ndim}>")
-        code.arg("grid_size, grid_stride", f"std::array<int, {self.ndim}>")
+        code.arg("grid_size", f"std::array<int, {self.ndim}>")
+        code.arg("grid_stride", f"std::array<int64_t, {self.ndim}>")
+
        code.arg("coors_range", f"std::array<float, {self.ndim * 2}>")
        code.arg("clear_voxels", "bool", "true")
        code.arg("empty_mean", "bool", "false")

--- a/spconv/csrc/utils/__init__.py
+++ b/spconv/csrc/utils/__init__.py
@@ -13,3 +13,4 @@
 # limitations under the License.

 from .boxops import BoxOps
+from .pcc import PointCloudCompress
\ No newline at end of file
--- a/spconv/csrc/utils/launch.py
+++ b/spconv/csrc/utils/launch.py
+import pccm
+from cumm.common import TensorView
+
+class LaunchUtils(pccm.Class):
+    def __init__(self):
+        super().__init__()
+        self.add_include("limits")
+        self.add_dependency(TensorView)
+        self.add_static_const("kMaxGridYZDim", "int", "65535")
+
+    @pccm.static_function
+    def get_blocks_threads_of_2d_tensor(self):
+        code = pccm.code()
+        code.arg("nhot", "int64_t")
+        code.arg("num_features", "int64_t")
+
+        code.raw(f"""
+        constexpr int MaxThreads = 512;
+        int num_blocks_X = 0;
+        int num_blocks_Y = 0;
+        int threads_X = 0;
+        int threads_Y = 0;
+
+        dim3 threads;
+        bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(int(num_features), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
+            // if num_features > value in list above, run this function.
+            // if a value is found, other value won't be executed.
+            int NumFeatures = TV_DECLTYPE(V)::value;
+            int Num0 = MaxThreads / NumFeatures;
+            num_blocks_X = tv::div_up(num_features, int64_t(NumFeatures));
+            num_blocks_Y = tv::div_up(nhot, int64_t(Num0));
+            threads_X = NumFeatures;
+            threads_Y = Num0;
+        }});
+        if (!found){{
+            int NumFeatures = 16;
+            int Num0 = MaxThreads / NumFeatures;
+            num_blocks_X = tv::div_up(num_features, int64_t(NumFeatures));
+            num_blocks_Y = tv::div_up(nhot, int64_t(Num0));
+            threads_X = NumFeatures;
+            threads_Y = Num0;
+        }}
+        return std::make_tuple(num_blocks_X, num_blocks_Y, threads_X, threads_Y);
+        """)
+        code.ret("std::tuple<int, int, int, int>")
+        return code 
+
--- a/spconv/csrc/utils/pcc.py
+++ b/spconv/csrc/utils/pcc.py
+# Copyright 2022 Yan Yan
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pccm 
+
+from cumm.common import TensorView, TensorViewNVRTC
+
+class PointCloudCompress(pccm.Class):
+    def __init__(self):
+        super().__init__()
+        self.add_dependency(TensorView, TensorViewNVRTC)
+        self.add_include("unordered_map")
+        self.add_include("tensorview/hash/hash_functions.h")
+        self.add_enum_class("EncodeType", [
+            ("XYZ_8", 0),
+            ("XYZI_8", 1),
+        ])
+
+    @pccm.pybind.mark 
+    @pccm.static_function
+    def encode_with_order(self):
+        code = pccm.code()
+        # TODO add checksum
+        code.arg("points", "tv::Tensor")
+        code.arg("intensity", "tv::Tensor")
+
+        code.arg("ex", "float")
+        code.arg("ey", "float")
+        code.arg("ez", "float")
+        code.arg("type", "EncodeType")
+
+        code.arg("with_order", "bool", "false")
+        code.raw(f"""
+        namespace op = tv::arrayops;
+        float vx = 256.0f * ex;
+        float vy = 256.0f * ey;
+        float vz = 256.0f * ez;
+        std::vector<std::tuple<uint64_t, tv::array<float, 3>>> offsets;
+        std::unordered_map<uint64_t, std::vector<int64_t>> order;
+        auto N = points.dim(0);
+        tv::array<float, 3> vsize{{vx, vy, vz}};
+        tv::array<float, 3> errors{{ex, ey, ez}};
+
+        tv::Tensor order_ten;
+        int64_t* order_ten_ptr = nullptr;
+
+        if (with_order){{
+            order_ten = tv::empty({{N}}, tv::int64);
+            order_ten_ptr = order_ten.data_ptr<int64_t>();
+        }}
+
+        using hash_t = tv::hash::SpatialHash<uint64_t>;
+        auto point_stride = points.stride(0);
+        int64_t final_size = sizeof(int64_t) * 5 + sizeof(float) * 3;
+        tv::Tensor res;
+        tv::ssprint(1);
+        tv::dispatch<float, double>(points.dtype(), [&](auto IP){{
+            using TPoint = TV_DECLTYPE(IP);
+
+            auto points_data = points.data_ptr<TPoint>();
+            tv::dispatch<float, double, uint8_t>(intensity.dtype(), [&](auto II){{
+                using TInten = TV_DECLTYPE(II); 
+                auto intensity_data = intensity.data_ptr<TInten>();
+                tv::dispatch_int<static_cast<int>(EncodeType::XYZI_8), static_cast<int>(EncodeType::XYZ_8)>(static_cast<int>(type), [&](auto I){{
+                    constexpr int kTypeInt = TV_DECLTYPE(I)::value;
+                    constexpr int kEncodeDim = kTypeInt == static_cast<int>(EncodeType::XYZI_8) ? 4 : 3;
+                    std::unordered_map<uint64_t, std::vector<tv::array<uint8_t, kEncodeDim>>> hash;
+                    int inten_stride = 0;
+                    if (kEncodeDim > 3){{
+                        TV_ASSERT_RT_ERR(!intensity.empty(), "inten must not empty");
+                        inten_stride = intensity.stride(0);
+                    }}
+
+                    for (size_t i = 0; i < N; ++i){{
+                        tv::array<float, 3> point = op::read_ptr<3>(points_data).template cast<float>();
+                        auto pos_unit_voxel = point / vsize;
+                        auto pos_int = op::apply(floorf, pos_unit_voxel).cast<int32_t>();
+                        auto pos_enc = (point / errors - pos_int.cast<float>() * float(256)).cast<uint8_t>();
+                        tv::array<uint8_t, kEncodeDim> enc;
+                        tv::if_constexpr<(kEncodeDim > 3)>([&](auto _){{
+                            TInten inten = intensity_data[0];
+                            enc = _(tv::array<uint8_t, kEncodeDim>{{pos_enc[0], pos_enc[1], pos_enc[2], uint8_t(inten)}});
+                            intensity_data += inten_stride;
+                        }}, [&](auto _){{
+                            enc = _(tv::array<uint8_t, kEncodeDim>{{pos_enc[0], pos_enc[1], pos_enc[2]}});
+                        }});
+                        auto pos_uint = pos_int + hash_t::direct_hash_offset();
+                        uint64_t scalar = hash_t::encode(pos_int[0], pos_int[1], pos_int[2]);
+                        auto iter = hash.find(scalar);
+                        if (iter == hash.end()){{
+                            auto pos_offset = pos_int.cast<float>() * vsize;
+                            std::vector<tv::array<uint8_t, kEncodeDim>> vec{{enc}};
+                            offsets.push_back({{scalar, pos_offset}});
+                            hash.insert({{scalar, vec}});
+                            final_size += sizeof(float) * 3 + sizeof(int) + sizeof(uint8_t) * kEncodeDim;
+                            if (with_order){{
+                                std::vector<int64_t> order_cluster{{int64_t(i)}};
+                                order.insert({{scalar, order_cluster}});
+                            }}
+                        }}else{{
+                            // iter.value().push_back(enc);
+                            iter->second.push_back(enc);
+                            final_size += sizeof(uint8_t) * kEncodeDim;
+                            if (with_order){{
+                                order.at(scalar).push_back(i);
+                            }}
+                        }}
+                        points_data += point_stride;
+
+                    }}
+                    res = tv::empty({{final_size}}, tv::uint8, -1);
+                    auto res_ptr = res.raw_data();
+                    int64_t* res_ptr_header = reinterpret_cast<int64_t*>(res_ptr);
+                    res_ptr_header[0] = int64_t(final_size); 
+                    res_ptr_header[1] = static_cast<int>(type);
+                    res_ptr_header[2] = int64_t(N);
+                    res_ptr_header[3] = int64_t(offsets.size());
+                    res_ptr_header[4] = 0;
+
+                    // TODO add checksum in header
+                    res_ptr += sizeof(int64_t) * 5;
+                    float* error_header = reinterpret_cast<float*>(res_ptr);
+                    error_header[0] = errors[0];
+                    error_header[1] = errors[1];
+                    error_header[2] = errors[2];
+                    res_ptr += sizeof(float) * 3;
+                    for (auto& p : offsets){{
+                        auto& offset = std::get<1>(p);
+                        auto& encodes = hash.at(std::get<0>(p));
+                        int cluster_size = encodes.size();
+                        reinterpret_cast<int*>(res_ptr)[0] = cluster_size;
+                        reinterpret_cast<float*>(res_ptr)[1] = offset[0];
+                        reinterpret_cast<float*>(res_ptr)[2] = offset[1];
+                        reinterpret_cast<float*>(res_ptr)[3] = offset[2];
+                        res_ptr += sizeof(int) + sizeof(float) * 3;
+                    }}
+                    for (auto& p : offsets){{
+                        auto& offset = std::get<1>(p);
+                        auto& encodes = hash.at(std::get<0>(p));
+                        int cluster_size = encodes.size();
+                        auto enc_ptr = reinterpret_cast<tv::array<uint8_t, kEncodeDim>*>(res_ptr);
+                        for (int i = 0; i < cluster_size; ++i){{
+                            enc_ptr[i] = encodes[i];
+                        }}
+                        if (with_order){{
+                            auto& orders = order.at(std::get<0>(p));
+                            for (int i = 0; i < cluster_size; ++i){{
+                                order_ten_ptr[i] = orders[i];
+                            }}
+                            order_ten_ptr += cluster_size;
+                        }}
+                        res_ptr += cluster_size * sizeof(tv::array<uint8_t, kEncodeDim>);
+                    }}
+                    TV_ASSERT_RT_ERR(res_ptr - res.raw_data() == final_size, "error");
+                }});
+            }});
+        }});
+        return std::make_tuple(res, order_ten);
+        """)
+        return code.ret("std::tuple<tv::Tensor, tv::Tensor>")
+
+    @pccm.pybind.mark 
+    @pccm.static_function
+    def encode_xyzi(self):
+        code = pccm.code()
+        code.arg("points", "tv::Tensor")
+        code.arg("intensity", "tv::Tensor")
+
+        code.arg("ex", "float")
+        code.arg("ey", "float")
+        code.arg("ez", "float")
+        code.raw(f"""
+        auto res = encode_with_order(points, intensity, ex, ey, ez, EncodeType::XYZI_8, false);
+        return std::get<0>(res);
+        """)
+        return code.ret("tv::Tensor")
+
+    @pccm.pybind.mark 
+    @pccm.static_function
+    def encode_xyz(self):
+        code = pccm.code()
+        code.arg("points", "tv::Tensor")
+        code.arg("ex", "float")
+        code.arg("ey", "float")
+        code.arg("ez", "float")
+        code.raw(f"""
+        auto res = encode_with_order(points, tv::Tensor(), ex, ey, ez, EncodeType::XYZ_8, false);
+        return std::get<0>(res);
+        """)
+        return code.ret("tv::Tensor")
+
+    @pccm.pybind.mark 
+    @pccm.static_function
+    def decode(self):
+        code = pccm.code()
+        code.arg("data", "tv::Tensor")
+        code.raw(f"""
+        namespace op = tv::arrayops;
+        const uint8_t* data_ptr = data.data_ptr<const uint8_t>();
+
+        auto res_ptr = data.raw_data();
+        int64_t* res_ptr_header = reinterpret_cast<int64_t*>(res_ptr);
+        int64_t final_size = res_ptr_header[0];
+        int type = res_ptr_header[1];
+        int64_t N = res_ptr_header[2];
+        int64_t voxel_num = res_ptr_header[3];
+
+        TV_ASSERT_RT_ERR(final_size == data.raw_size(), "size mismatch");
+        res_ptr += sizeof(int64_t) * 5;
+        float* error_header = reinterpret_cast<float*>(res_ptr);
+        tv::array<float, 3> error;
+        error[0] = error_header[0];
+        error[1] = error_header[1];
+        error[2] = error_header[2];
+        res_ptr += sizeof(float) * 3;
+        tv::Tensor points;
+        tv::dispatch_int<static_cast<int>(EncodeType::XYZI_8), static_cast<int>(EncodeType::XYZ_8)>(static_cast<int>(type), [&](auto I){{
+            constexpr int kTypeInt = TV_DECLTYPE(I)::value;
+            constexpr int kEncodeDim = kTypeInt == static_cast<int>(EncodeType::XYZI_8) ? 4 : 3;
+            points = tv::empty({{N, kEncodeDim}}, tv::float32);
+            auto points_ptr = points.data_ptr<float>();
+
+            auto enc_ptr = reinterpret_cast<tv::array<uint8_t, kEncodeDim>*>(res_ptr + voxel_num * (sizeof(int) * 1 + sizeof(float) * 3));
+            for (int i = 0; i < voxel_num; ++i){{
+                int cluster_size = reinterpret_cast<int*>(res_ptr)[0];
+                tv::array<float, 3> offset;
+                offset[0] = reinterpret_cast<float*>(res_ptr)[1];
+                offset[1] = reinterpret_cast<float*>(res_ptr)[2];
+                offset[2] = reinterpret_cast<float*>(res_ptr)[3];
+                auto point_cur_ptr = points_ptr;
+                for (int j = 0; j < cluster_size; ++j){{
+                    auto& enc = enc_ptr[j];
+                    auto point = op::slice<0, 3>(enc).template cast<float>() * error + offset;
+                    point_cur_ptr[0] = point[0];
+                    point_cur_ptr[1] = point[1];
+                    point_cur_ptr[2] = point[2];
+                    if (kEncodeDim > 3){{
+                        point_cur_ptr[3] = enc[3];
+                    }}
+                    point_cur_ptr += kEncodeDim;
+                }}
+                res_ptr += sizeof(int) + sizeof(float) * 3;
+                enc_ptr += cluster_size;
+                points_ptr += cluster_size * kEncodeDim;
+            }}
+        }});
+        return points;
+        """)
+        return code.ret("tv::Tensor")
+
--- a/spconv/pytorch/conv.py
+++ b/spconv/pytorch/conv.py
@@ -270,6 +270,7 @@ class SparseConvolution(SparseModule):
        if self.training:
            msg =  "act don't support backward, only used in inference"
            assert self.act_type == tv.gemm.Activation.None_, msg
+        
        if not self.subm:
            if self.transposed:
                out_spatial_shape = ops.get_deconv_output_size(
@@ -428,6 +429,7 @@ class SparseConvolution(SparseModule):
                            indice_pair_num,
                            outids.shape[0],
                            algo,
+                            input._timer,
                            bias_for_infer,
                            self.act_alpha,
                            self.act_beta,
@@ -551,7 +553,6 @@ class SparseConvolution(SparseModule):
                    self.act_alpha,
                    self.act_beta,
                    self.act_type)
-        
        if bias_for_training is not None:
            out_features += bias_for_training
        if input.benchmark:
@@ -571,6 +572,9 @@ class SparseConvolution(SparseModule):
        out_tensor.indices = outids
        out_tensor.indice_dict = indice_dict
        out_tensor.spatial_shape = out_spatial_shape
+        # print(outids.shape, spatial_shape, self.kernel_size, self.stride, self.padding,
+        #             self.dilation, self.output_padding, out_spatial_shape)
+
        return out_tensor

    def _check_subm_reuse_valid(self, inp: SparseConvTensor,

--- a/spconv/pytorch/ops.py
+++ b/spconv/pytorch/ops.py
@@ -1561,6 +1561,8 @@ def implicit_gemm(features: torch.Tensor,
    with timer.record("implicit_gemm", stream):
        for j in range(num_split):
            beta = 0 if j == 0 else 1
+            if bias is not None:
+                beta = 1 
            CONV.run_with_tuned_result(
                tune_res,
                ConvOpType.kForward,