Commit 838aa644 authored by yan.yan's avatar yan.yan
Browse files

Merge branch 'feature/ampere'

parents a31b131f f3a22f19
......@@ -15,7 +15,7 @@ jobs:
runs-on: windows-2019
strategy:
matrix:
python-version: ['3.7', '3.8', '3.9', '3.10']
python-version: ['3.7', '3.8', '3.9', '3.10', '3.11']
cuda-version: ['10.2', '11.1', '11.4']
steps:
- uses: actions/checkout@master
......@@ -115,7 +115,7 @@ jobs:
runs-on: ubuntu-20.04
strategy:
matrix:
python-version: ['3.6', '3.7', '3.8', '3.9', '3.10'] # this version is only used for upload.
python-version: ['3.7', '3.8', '3.9', '3.10', '3.11'] # this version is only used for upload.
cuda-version: ['102', '111', '113', '114', '']
steps:
......
# Changelog
## [2.2.0] - 2022-9-24
### Added
- Add Ampere support. faster fp16, faster tf32 and greatly faster int8 kernels in Ampere GPUs.
- Add pure c++ code generation (libspconv.so) for deploy (or train in another deeplearning framework)
- Add NVRTC support for all gemm kernels. if your GPU architecture isn't compiled in prebuilt, spconv will use slightly slower (10-20us overhead for every kernel launch) NVRTC kernels.
### Fixed
- Fix launch fail in maxpool if too much voxels
### Changed
- all weight layout will be KRSC, don't support old spconv 1.x weights anymore.
- previous gemm ops in ops.py now move to c++ by default (controlled by spconv.constants.SPCONV_CPP_GEMM)
### Removed
- drop python 3.6 support.
- pascal and kepler architecture is removed in CUDA 12 prebuilt.
## [2.1.22] - 2022-6-11
### Fixed
- Fix thrust problem by adding -fvisibility=hidden
......
......@@ -17,16 +17,19 @@
[pypi-ver-114]: https://img.shields.io/pypi/v/spconv-cu114
[pypi-ver-111]: https://img.shields.io/pypi/v/spconv-cu111
[pypi-ver-113]: https://img.shields.io/pypi/v/spconv-cu113
[pypi-ver-120]: https://img.shields.io/pypi/v/spconv-cu120
[pypi-ver-102]: https://img.shields.io/pypi/v/spconv-cu102
[pypi-url-102]: https://pypi.org/project/spconv-cu102/
[pypi-download-102]: https://img.shields.io/pypi/dm/spconv-cu102
[pypi-url-111]: https://pypi.org/project/spconv-cu111/
[pypi-download-111]: https://img.shields.io/pypi/dm/spconv-cu111
[pypi-url-113]: https://pypi.org/project/spconv-cu113/
[pypi-download-113]: https://img.shields.io/pypi/dm/spconv-cu113
[pypi-url-102]: https://pypi.org/project/spconv-cu102/
[pypi-download-102]: https://img.shields.io/pypi/dm/spconv-cu102
[pypi-url-114]: https://pypi.org/project/spconv-cu114/
[pypi-download-114]: https://img.shields.io/pypi/dm/spconv-cu114
[pypi-url-120]: https://pypi.org/project/spconv-cu120/
[pypi-download-120]: https://img.shields.io/pypi/dm/spconv-cu120
[pypi-url-cpu]: https://pypi.org/project/spconv/
[pypi-download-cpu]: https://img.shields.io/pypi/dm/spconv
......@@ -37,10 +40,11 @@
| | PyPI | Install |Downloads |
| -------------- |:---------------------:| ---------------------:| ---------------------:|
| CPU (Linux Only) | [![PyPI Version][pypi-ver-cpu]][pypi-url-cpu] | ```pip install spconv``` | [![pypi monthly download][pypi-download-cpu]][pypi-url-cpu] |
| CUDA 10.2 | [![PyPI Version][pypi-ver-102]][pypi-url-102] | ```pip install spconv-cu102``` | [![pypi monthly download][pypi-download-102]][pypi-url-102] |
| CUDA 10.2 | [![PyPI Version][pypi-ver-102]][pypi-url-102] | ```pip install spconv-cu102```| [![pypi monthly download][pypi-download-102]][pypi-url-102]|
| CUDA 11.1 | [![PyPI Version][pypi-ver-111]][pypi-url-111] | ```pip install spconv-cu111```| [![pypi monthly download][pypi-download-111]][pypi-url-111]|
| CUDA 11.3 (Linux Only) | [![PyPI Version][pypi-ver-113]][pypi-url-113] | ```pip install spconv-cu113```| [![pypi monthly download][pypi-download-113]][pypi-url-113]|
| CUDA 11.4 | [![PyPI Version][pypi-ver-114]][pypi-url-114] | ```pip install spconv-cu114```| [![pypi monthly download][pypi-download-114]][pypi-url-114]|
<!-- | CUDA 12.0 | [![PyPI Version][pypi-ver-120]][pypi-url-120] | ```pip install spconv-cu120```| [![pypi monthly download][pypi-download-120]][pypi-url-120]| -->
```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support. check [benchmark](docs/BENCHMARK.md) to see how fast spconv 2.x runs.
......@@ -48,11 +52,9 @@
Check [spconv 2.x algorithm introduction](docs/spconv2_algo.pdf) to understand sparse convolution algorithm in spconv 2.x!
**WARNING** spconv < 2.1.18 users need to upgrade your version to 2.1.18, it fix a bug in conv weight init which cause std of inited weight too large, and a bug in PointToVoxel.
## Breaking changes in Spconv 2.x
## NEWS
Spconv 1.x users **NEED READ [THIS](docs/SPCONV_2_BREAKING_CHANGEs.md)** before using spconv 2.x.
* spconv 2.2: ampere feature support (by [EvernightAurora](https://github.com/EvernightAurora)), pure c++ code generation, nvrtc, drop python 3.6
## Spconv 2.1 vs Spconv 1.x
......@@ -64,11 +66,12 @@ Spconv 1.x users **NEED READ [THIS](docs/SPCONV_2_BREAKING_CHANGEs.md)** before
* [doesn't depend on pytorch binary](docs/FAQ.md#What-does-no-dependency-on-pytorch-mean), but you may need at least pytorch >= 1.5.0 to run spconv 2.x.
* since spconv 2.x doesn't depend on pytorch binary (never in future), it's impossible to support torch.jit/libtorch inference.
## Spconv 2.x Development and Roadmap
Spconv 2.2 development has started. See [this issue](https://github.com/traveller59/spconv/issues/380) for more details.
## Spconv 2.2 vs Spconv 2.1
See [dev plan](docs/SPCONV_DEVELOP_PLAN.md). A complete guide of spconv development will be released soon.
* faster fp16 kernels (~5-30%) in ampere GPUs (tested in RTX 3090)
* greatly faster int8 kernels (~1.2x~2.7x) in ampere GPUs (tested in RTX 3090)
* no python 3.6 support
* nvrtc support: kernel in old GPUs will be compiled in runtime.
## Usage
......@@ -80,24 +83,20 @@ Don't forget to check [performance guide](docs/PERFORMANCE_GUIDE.md).
## Install
You need to install python >= 3.6 (>=3.7 for windows) first to use spconv 2.x.
You need to install python >= 3.7 first to use spconv 2.x.
You need to install CUDA toolkit first before using prebuilt binaries or build from source.
You need at least CUDA 10.2 to build and run spconv 2.x. We won't offer any support for CUDA < 10.2.
You need at least CUDA 11.0 to build and run spconv 2.x. We won't offer any support for CUDA < 11.0.
### Prebuilt
We offer python 3.6-3.10 and cuda 10.2/11.1/11.3/11.4 prebuilt binaries for linux (manylinux).
We offer python 3.7-3.11 and cuda 10.2/11.1/11.3/11.4/12.0 prebuilt binaries for linux (manylinux).
We offer python 3.7-3.10 and cuda 10.2/11.1/11.4 prebuilt binaries for windows 10/11.
We will provide prebuilts for CUDA versions supported by latest pytorch release. For example, pytorch 1.10 provide cuda 10.2 and 11.3 prebuilts, so we provide them too.
We offer python 3.7-3.11 and cuda 10.2/11.1/11.4/12.0 prebuilt binaries for windows 10/11.
For Linux users, you need to install pip >= 20.3 first to install prebuilt.
CUDA 11.1 will be removed in spconv 2.2 because pytorch 1.10 don't provide prebuilts for it.
```pip install spconv``` for CPU only (**Linux Only**). you should only use this for debug usage, the performance isn't optimized due to manylinux limit (no omp support).
```pip install spconv-cu102``` for CUDA 10.2
......@@ -108,9 +107,9 @@ CUDA 11.1 will be removed in spconv 2.2 because pytorch 1.10 don't provide prebu
```pip install spconv-cu114``` for CUDA 11.4
**NOTE** It's safe to have different **minor** cuda version between system and conda (pytorch) in **CUDA >= 11.0** because of [CUDA Minor Version Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility). For example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed.
```pip install spconv-cu120``` for CUDA 12.0
For CUDA 10, we don't know whether ```spconv-cu102``` works with CUDA 10.0 and 10.1. Users can have a try.
**NOTE** It's safe to have different **minor** cuda version between system and conda (pytorch) in **CUDA >= 11.0** because of [CUDA Minor Version Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility). For example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed.
**NOTE** In Linux, you can install spconv-cuxxx without install CUDA to system! only suitable NVIDIA driver is required. for CUDA 11, we need driver >= 450.82.
......@@ -118,11 +117,12 @@ For CUDA 10, we don't know whether ```spconv-cu102``` works with CUDA 10.0 and 1
See [this page](https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/) to check supported GPU names by arch.
If you use a GPU architecture that isn't compiled in prebuilt, spconv will use NVRTC to compile a slightly slower kernel.
| CUDA version | GPU Arch List |
| -------------- |:---------------------:|
| 10.2 | 50,52,60,61,70,75 |
| 11.x | 52,60,61,70,75,80,86 |
| 12.x | 60,61,70,75,80,86,90 |
| 12.x | 70,75,80,86,90 |
### Build from source for development (JIT, recommend)
......@@ -171,9 +171,9 @@ You need to rebuild ```cumm``` first if you are build along a CUDA version that
5. run ```pip install pccm cumm wheel```
6. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl```
## Know issues
## Contributers
* Spconv 2.x F16 runs slow in A100.
* [EvernightAurora](https://github.com/EvernightAurora): add ampere feature.
## Note
......
# Copyright 2021 Yan Yan
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""simple fuse
see https://pytorch.org/tutorials/intermediate/fx_conv_bn_fuser.html
"""
import time
from pathlib import Path
from typing import Any, Dict, Tuple
import numpy as np
import torch
from torch import nn
from cumm import tensorview as tv
from spconv.core import ConvAlgo
import torch.fx
import spconv.pytorch as spconv
import copy
import pickle
from spconv.pytorch.conv import SparseConvolution
from spconv.pytorch import functional as Fsp
def fuse_bn_weights(conv_w_OKI, conv_b, bn_rm, bn_rv, bn_eps, bn_w, bn_b):
NDim = conv_w_OKI.ndim - 2
permute = [0, NDim+1] + [i+1 for i in range(NDim)]
conv_w_OIK = conv_w_OKI.permute(*permute)
# OIDHW
if conv_b is None:
conv_b = torch.zeros_like(bn_rm)
if bn_w is None:
bn_w = torch.ones_like(bn_rm)
if bn_b is None:
bn_b = torch.zeros_like(bn_rm)
bn_var_rsqrt = torch.rsqrt(bn_rv + bn_eps)
conv_w_OIK = conv_w_OIK * (bn_w * bn_var_rsqrt).reshape([-1] + [1] * (len(conv_w_OIK.shape) - 1))
conv_b = (conv_b - bn_rm) * bn_var_rsqrt * bn_w + bn_b
permute = [0,] + [i+2 for i in range(NDim)] + [1,]
conv_w_OKI = conv_w_OIK.permute(*permute).contiguous()
return torch.nn.Parameter(conv_w_OKI), torch.nn.Parameter(conv_b)
def fuse_bn(conv, bn):
"""
Given a conv Module `A` and an batch_norm module `B`, returns a conv
module `C` such that C(x) == B(A(x)) in inference mode.
"""
assert(not (conv.training or bn.training)), "Fusion only for eval!"
fused_conv = copy.deepcopy(conv)
fused_conv.weight, fused_conv.bias = \
fuse_bn_weights(fused_conv.weight, fused_conv.bias,
bn.running_mean, bn.running_var, bn.eps, bn.weight, bn.bias)
return fused_conv
def fuse_act_net(conv, act):
"""
Given a conv Module `A` and an batch_norm module `B`, returns a conv
module `C` such that C(x) == B(A(x)) in inference mode.
"""
assert(not (conv.training)), "Fusion only for eval!"
fused_conv = copy.deepcopy(conv)
if isinstance(act, torch.nn.ReLU):
fused_conv.act_type = tv.gemm.Activation.ReLU
if isinstance(act, torch.nn.Sigmoid):
fused_conv.act_type = tv.gemm.Activation.Sigmoid
elif isinstance(act, torch.nn.LeakyReLU):
fused_conv.act_type = tv.gemm.Activation.LeakyReLU
fused_conv.act_alpha = act.negative_slope
else:
raise NotImplementedError
return fused_conv
def _parent_name(target : str) -> Tuple[str, str]:
"""
Splits a qualname into parent path and last atom.
For example, `foo.bar.baz` -> (`foo.bar`, `baz`)
"""
*parent, name = target.rsplit('.', 1)
return parent[0] if parent else '', name
def replace_node_module(node: torch.fx.Node, modules: Dict[str, Any], new_module: torch.nn.Module):
assert(isinstance(node.target, str))
parent_name, name = _parent_name(node.target)
setattr(modules[parent_name], name, new_module)
def fuse(model: torch.fx.GraphModule) -> torch.fx.GraphModule:
model = copy.deepcopy(model)
# The first step of most FX passes is to symbolically trace our model to
# obtain a `GraphModule`. This is a representation of our original model
# that is functionally identical to our original model, except that we now
# also have a graph representation of our forward pass.
fx_model = model
modules = dict(fx_model.named_modules())
# The primary representation for working with FX are the `Graph` and the
# `Node`. Each `GraphModule` has a `Graph` associated with it - this
# `Graph` is also what generates `GraphModule.code`.
# The `Graph` itself is represented as a list of `Node` objects. Thus, to
# iterate through all of the operations in our graph, we iterate over each
# `Node` in our `Graph`.
for node in fx_model.graph.nodes:
# The FX IR contains several types of nodes, which generally represent
# call sites to modules, functions, or methods. The type of node is
# determined by `Node.op`.
if node.op != 'call_module': # If our current node isn't calling a Module then we can ignore it.
continue
# For call sites, `Node.target` represents the module/function/method
# that's being called. Here, we check `Node.target` to see if it's a
# batch norm module, and then check `Node.args[0].target` to see if the
# input `Node` is a convolution.
# print(node.target, node.args, node.args[0].args)
if isinstance(modules[node.target], torch.nn.BatchNorm1d):
if node.args[0].target in modules and isinstance(modules[node.args[0].target], SparseConvolution):
if len(node.args[0].users) > 1: # Output of conv is used by other nodes
continue
conv = modules[node.args[0].target]
bn = modules[node.target]
fused_conv = fuse_bn(conv, bn)
assert isinstance(fused_conv, SparseConvolution)
replace_node_module(node.args[0], modules, fused_conv)
modules_to_fuse = [node.args[0].target, node.target]
try:
if isinstance(modules[node.next.target], torch.nn.ReLU):
modules_to_fuse.append(node.next.target)
except Exception as e:
pass
# As we've folded the batch nor into the conv, we need to replace all uses
# of the batch norm with the conv.
node.replace_all_uses_with(node.args[0])
# Now that all uses of the batch norm have been replaced, we can
# safely remove the batch norm.
fx_model.graph.erase_node(node)
fx_model.graph.lint()
# After we've modified our graph, we need to recompile our graph in order
# to keep the generated code in sync.
fx_model.recompile()
return fx_model
def fuse_act(model: torch.fx.GraphModule) -> torch.fx.GraphModule:
model = copy.deepcopy(model)
# The first step of most FX passes is to symbolically trace our model to
# obtain a `GraphModule`. This is a representation of our original model
# that is functionally identical to our original model, except that we now
# also have a graph representation of our forward pass.
fx_model = model
modules = dict(fx_model.named_modules())
# The primary representation for working with FX are the `Graph` and the
# `Node`. Each `GraphModule` has a `Graph` associated with it - this
# `Graph` is also what generates `GraphModule.code`.
# The `Graph` itself is represented as a list of `Node` objects. Thus, to
# iterate through all of the operations in our graph, we iterate over each
# `Node` in our `Graph`.
for node in fx_model.graph.nodes:
# The FX IR contains several types of nodes, which generally represent
# call sites to modules, functions, or methods. The type of node is
# determined by `Node.op`.
if node.op != 'call_module': # If our current node isn't calling a Module then we can ignore it.
continue
# For call sites, `Node.target` represents the module/function/method
# that's being called. Here, we check `Node.target` to see if it's a
# batch norm module, and then check `Node.args[0].target` to see if the
# input `Node` is a convolution.
# print(node.target, node.args, node.args[0].args)
if isinstance(modules[node.target], (torch.nn.ReLU, torch.nn.LeakyReLU, torch.nn.Sigmoid)):
if node.args[0].target in modules and isinstance(modules[node.args[0].target], SparseConvolution):
if len(node.args[0].users) > 1: # Output of conv is used by other nodes
continue
conv = modules[node.args[0].target]
act = modules[node.target]
fused_conv = fuse_act_net(conv, act)
assert isinstance(fused_conv, SparseConvolution)
replace_node_module(node.args[0], modules, fused_conv)
# As we've folded the batch nor into the conv, we need to replace all uses
# of the batch norm with the conv.
node.replace_all_uses_with(node.args[0])
# Now that all uses of the batch norm have been replaced, we can
# safely remove the batch norm.
fx_model.graph.erase_node(node)
fx_model.graph.lint()
# After we've modified our graph, we need to recompile our graph in order
# to keep the generated code in sync.
fx_model.recompile()
return fx_model
class Net(nn.Module):
def __init__(self, shape, algo):
super().__init__()
pool_algo = algo
# pool_algo = ConvAlgo.Native
self.net = spconv.SparseSequential(
spconv.SubMConv3d(3, 64, 3, bias=False, indice_key="c0",
algo=algo),
nn.BatchNorm1d(64),
nn.ReLU(),
spconv.SubMConv3d(64,
64,
3,
bias=False,
indice_key="c0",
algo=algo),
nn.BatchNorm1d(64),
nn.ReLU(),
spconv.SparseConv3d(64, 64, 2, 2, bias=False, indice_key="m0", algo=algo),
nn.BatchNorm1d(64),
nn.ReLU(),
spconv.SubMConv3d(64,
96,
3,
bias=False,
indice_key="c1",
algo=algo),
nn.BatchNorm1d(96),
nn.ReLU(),
spconv.SubMConv3d(96,
96,
3,
bias=False,
indice_key="c1",
algo=algo),
nn.BatchNorm1d(96),
nn.ReLU(),
spconv.SparseConv3d(96, 96, 2, 2, bias=False, indice_key="m1", algo=algo),
nn.BatchNorm1d(96),
nn.ReLU(),
spconv.SubMConv3d(96,
128,
3,
bias=False,
indice_key="c2",
algo=algo),
nn.BatchNorm1d(128),
nn.ReLU(),
spconv.SubMConv3d(128,
128,
3,
bias=False,
indice_key="c2",
algo=algo),
nn.BatchNorm1d(128),
nn.ReLU(),
spconv.SparseConv3d(128, 128, 2, 2, bias=False, indice_key="m2", algo=algo),
nn.BatchNorm1d(128),
nn.ReLU(),
spconv.SubMConv3d(128,
160,
3,
bias=False,
indice_key="c3",
algo=algo),
nn.BatchNorm1d(160),
nn.ReLU(),
spconv.SubMConv3d(160,
160,
3,
bias=False,
indice_key="c3",
algo=algo),
nn.BatchNorm1d(160),
nn.ReLU(),
spconv.SparseConv3d(160, 160, 2, 2, bias=False, indice_key="m3", algo=algo),
nn.BatchNorm1d(160),
nn.ReLU(),
spconv.SubMConv3d(160,
192,
3,
bias=False,
indice_key="c4",
algo=algo),
nn.BatchNorm1d(192),
nn.ReLU(),
spconv.SubMConv3d(192,
192,
3,
bias=False,
indice_key="c4",
algo=algo),
nn.BatchNorm1d(192),
nn.ReLU(),
spconv.SparseConv3d(192, 192, 2, 2, bias=False, indice_key="m4", algo=algo),
nn.BatchNorm1d(192),
nn.ReLU(),
spconv.SubMConv3d(192,
224,
3,
bias=False,
indice_key="c5",
algo=algo),
nn.BatchNorm1d(224),
nn.ReLU(),
spconv.SubMConv3d(224,
224,
3,
bias=False,
indice_key="c5",
algo=algo),
nn.BatchNorm1d(224),
nn.ReLU(),
spconv.SparseConv3d(224, 224, 2, 2, bias=False, indice_key="m5", algo=algo),
nn.BatchNorm1d(224),
nn.ReLU(),
spconv.SubMConv3d(224,
256,
3,
bias=False,
indice_key="c6",
algo=algo),
nn.BatchNorm1d(256),
nn.ReLU(),
spconv.SubMConv3d(256,
256,
3,
bias=False,
indice_key="c6",
algo=algo),
nn.BatchNorm1d(256),
nn.ReLU(),
spconv.SparseInverseConv3d(256,
128,
2,
indice_key="m5",
bias=False,
algo=algo),
nn.BatchNorm1d(128),
nn.ReLU(),
spconv.SparseInverseConv3d(128,
64,
2,
indice_key="m4",
bias=False,
algo=algo),
nn.BatchNorm1d(64),
nn.ReLU(),
)
max_batch_size = 1
# grid (dense map) is used for indice generation. use pre-allocated grid can run faster.
# self.grid = None
self.shape = shape
for n in self.net.modules():
if isinstance(n, nn.BatchNorm1d):
n.bias.data.uniform_(-0.1, 0.1)
def forward(self, features, coors, batch_size, vx_num=None):
x = spconv.SparseConvTensor(features, coors, self.shape, batch_size, voxel_num=vx_num)
return self.net(x)
class MyTracer(torch.fx.Tracer):
def is_leaf_module(self, m: torch.nn.Module, module_qualified_name):
is_custom_leaf_module = isinstance(m, SparseConvolution)
return super().is_leaf_module(m, module_qualified_name) or is_custom_leaf_module
def main():
# run this file with SPCONV_FX_TRACE_MODE=1
torch.manual_seed(50051)
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False
with open(Path(__file__).parent / "data" / "test_spconv.pkl", "rb") as f:
(voxels, coors, spatial_shape) = pickle.load(f)
np.random.seed(50051)
device = torch.device("cuda:0")
device_cpu = torch.device("cpu:0")
dtype = torch.float32
net = Net(spatial_shape, ConvAlgo.MaskImplicitGemm).cuda().eval().to(dtype)
tracer = MyTracer()
graph_trace = tracer.trace(net)
net_fused = torch.fx.GraphModule(tracer.root, graph_trace)
net_fused = fuse(net_fused)
net_fused = fuse_act(net_fused)
print(net_fused)
voxels_th = torch.from_numpy(voxels).to(device_cpu).to(dtype)
coors_th = torch.from_numpy(coors).to(device_cpu).int()
voxels_th_cuda = torch.from_numpy(voxels).to(device).to(dtype)
coors_th_cuda = torch.from_numpy(coors).to(device).int()
out_ref = net(voxels_th_cuda, coors_th_cuda, 1)
print("-------------fused------------")
out_fused = net_fused(voxels_th_cuda, coors_th_cuda, 1)
res = Fsp.sparse_add_hash_based(out_ref, out_fused.minus())
print(torch.linalg.norm(res.features))
if __name__ == "__main__":
main()
\ No newline at end of file
# Copyright 2022 Yan Yan
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from spconv.core_cc.csrc.utils.pcc import PointCloudCompress
from pathlib import Path
import numpy as np
from cumm import tensorview as tv
def main():
data = np.load(Path(__file__).parent.parent / "test" / "data" / "benchmark-pc.npz")
pc = np.ascontiguousarray(data["pc"]).astype(np.float32)
pc_encoded, order = PointCloudCompress.encode_with_order(tv.from_numpy(pc), tv.Tensor(), 0.01, 0.01, 0.01, PointCloudCompress.EncodeType.XYZ_8, True)
print(pc.nbytes, pc_encoded.bytesize())
pc_recover = PointCloudCompress.decode(pc_encoded)
pc_reorder = pc[order.numpy_view()]
error = pc_recover.numpy_view()[:, :3] - pc_reorder[:, :3]
print("ERROR", np.abs(error).max())
if __name__ == "__main__":
main()
\ No newline at end of file
......@@ -38,9 +38,9 @@ if cuda_ver:
cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102
RELEASE_NAME += "-cu{}".format(cuda_ver)
deps = ["cumm-cu{}>=0.2.8".format(cuda_ver)]
deps = ["cumm-cu{}>=0.3.0".format(cuda_ver)]
else:
deps = ["cumm>=0.2.8"]
deps = ["cumm>=0.3.0"]
......@@ -156,7 +156,7 @@ if disable_jit is not None and disable_jit == "1":
from cumm.conv.main import ConvMainUnitTest
from cumm.constants import CUMM_CPU_ONLY_BUILD
from spconv.csrc.sparse.all import SpconvOps
from spconv.csrc.utils import BoxOps
from spconv.csrc.utils import BoxOps, PointCloudCompress
from spconv.csrc.hash.core import HashTable
from cumm.common import CompileInfo
from spconv.csrc.sparse.alloc import ExternalAllocator
......@@ -192,7 +192,7 @@ if disable_jit is not None and disable_jit == "1":
convops.namespace = "csrc.sparse.convops.spops"
cus = [gemmtuner, convtuner,
convops, SpconvOps(), BoxOps(), HashTable(), CompileInfo(),
ExternalAllocator(),
ExternalAllocator(), PointCloudCompress(),
ExternalSpconvMatmul(), InferenceOps()]
if not CUMM_CPU_ONLY_BUILD:
cus.extend([cu, convcu])
......
......@@ -31,7 +31,7 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(
from spconv.csrc.sparse.all import SpconvOps
from spconv.csrc.sparse.alloc import ExternalAllocator
from spconv.csrc.utils import BoxOps
from spconv.csrc.utils import BoxOps, PointCloudCompress
from spconv.csrc.hash.core import HashTable
from spconv.csrc.sparse.convops import GemmTunerSimple, ExternalSpconvMatmul
from spconv.csrc.sparse.convops import ConvTunerSimple, ConvGemmOps
......@@ -65,6 +65,7 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(
ExternalSpconvMatmul(),
SimpleExternalSpconvMatmul(), # for debug, won't be included in release
InferenceOps(),
PointCloudCompress(),
]
pccm.builder.build_pybind(cus,
PACKAGE_ROOT / "core_cc",
......
from typing import overload, Any, Callable, Dict, List, Optional, Set, Tuple, Type, Union
from pccm.stubs import EnumValue, EnumClassValue
from cumm.tensorview import Tensor
class PointCloudCompress:
@staticmethod
def encode_with_order(points: Tensor, intensity: Tensor, ex: float, ey: float, ez: float, type, with_order: bool = False) -> Tuple[Tensor, Tensor]:
"""
Args:
points:
intensity:
ex:
ey:
ez:
type:
with_order:
"""
...
@staticmethod
def encode_xyzi(points: Tensor, intensity: Tensor, ex: float, ey: float, ez: float) -> Tensor:
"""
Args:
points:
intensity:
ex:
ey:
ez:
"""
...
@staticmethod
def encode_xyz(points: Tensor, ex: float, ey: float, ez: float) -> Tensor:
"""
Args:
points:
ex:
ey:
ez:
"""
...
@staticmethod
def decode(data: Tensor) -> Tensor:
"""
Args:
data:
"""
...
class EncodeType:
XYZ_8 = EnumClassValue(0) # type: EnumClassValue
XYZI_8 = EnumClassValue(1) # type: EnumClassValue
@staticmethod
def __members__() -> Dict[str, EnumClassValue]: ...
......@@ -27,4 +27,4 @@ from spconv.core_cc.csrc.utils.boxops import BoxOps
from spconv.core_cc.cumm.common import CompileInfo
HAS_BOOST = BoxOps.has_boost()
COMPILED_CUDA_ARCHS = set(CompileInfo.get_compiled_cuda_arch())
COMPILED_CUDA_ARCHS = set(CompileInfo.get_compiled_gemm_cuda_arch())
......@@ -1384,7 +1384,8 @@ class SpconvOps(pccm.Class):
code.arg("voxels, indices, num_per_voxel, hashdata, point_indice_data, pc_voxel_id",
"tv::Tensor")
code.arg("vsize", f"std::vector<float>")
code.arg("grid_size, grid_stride", f"std::vector<int>")
code.arg("grid_size", f"std::vector<int>")
code.arg("grid_stride", f"std::vector<int64_t>")
code.arg("coors_range", f"std::vector<float>")
code.arg("empty_mean", "bool", "false")
......@@ -1404,7 +1405,9 @@ class SpconvOps(pccm.Class):
code.raw(f"""
if (ndim == {ndim}){{
std::array<float, {ndim}> vsize_;
std::array<int, {ndim}> grid_size_, grid_stride_;
std::array<int, {ndim}> grid_size_;
std::array<int64_t, {ndim}> grid_stride_;
std::array<float, {ndim * 2}> coors_range_;
for (int i = 0; i < {ndim}; ++i){{
vsize_[i] = vsize[i];
......
......@@ -550,7 +550,7 @@ class GemmTunerSimple(pccm.ParameterizedClass):
}}
auto avail_algos = get_available_algo_str_from_arch(arch);
std::vector<tv::gemm::GemmAlgoDesp> finally_algos;
auto is_arch_compiled = CompileInfo::arch_is_compiled(arch);
auto is_arch_compiled = CompileInfo::arch_is_compiled_gemm(arch);
for (auto algo : avail_algos){{
static_key_t static_key = std::make_tuple(trans_a, trans_b, trans_c, int(a.dtype()),
int(b.dtype()), int(c.dtype()), shuffle_type, algo);
......@@ -996,7 +996,7 @@ class ConvTunerSimple(pccm.ParameterizedClass):
use_f32_as_accum = false;
std::vector<tv::gemm::ConvAlgoDesp> finally_algos;
auto is_arch_compiled = CompileInfo::arch_is_compiled(arch);
auto is_arch_compiled = CompileInfo::arch_is_compiled_gemm(arch);
for (auto algo : avail_algos){{
static_key_t static_key = std::make_tuple(
layout_i, layout_w, layout_o,
......@@ -2048,7 +2048,12 @@ class ConvGemmOps(pccm.ParameterizedClass):
for (int j = 0; j < num_split; ++j){{
float beta = j == 0 ? 0 : 1;
if (!bias.empty()){{
beta = 1;
}}
if (j > 0){{
bias = tv::Tensor();
}}
conv_tuner.run_with_tuned_result(
tune_res,
kForwardInt,
......
......@@ -13,29 +13,37 @@
# limitations under the License.
import pccm
from cumm.common import TensorView, GemmDTypes, TensorViewKernel, ThrustLib, GemmBasic
from cumm.common import TensorView, GemmDTypes, TensorViewKernel, TensorViewNVRTC, ThrustLib, GemmBasic
from spconv.csrc.sparse.cpu_core import OMPLib
from ..utils.launch import LaunchUtils
from cumm.constants import CUMM_CPU_ONLY_BUILD
class InferenceOpsKernel(pccm.ParameterizedClass):
def __init__(self):
super().__init__()
self.add_dependency(TensorViewKernel, GemmBasic)
self.add_dependency(TensorViewKernel, TensorViewNVRTC, GemmBasic, LaunchUtils)
@pccm.cuda.cuda_global_function
def bias_add_inplace_kernel(self):
code = pccm.FunctionCode()
code.targ("T")
code.nontype_targ("OneDim", "bool", "false")
code.arg("out_features", f"T*")
code.arg("bias", f"const T*")
code.arg("size", "int")
code.arg("num_features", "int")
code.arg("num_blocks_x", "int")
code.arg("num_blocks_y", "int")
code.raw(f"""
for (int i : tv::KernelLoopY<int>(size)) {{
int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
for (int i : tv::KernelLoopY<int>(size, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
auto out_ptr = out_features + i * num_features;
for (int j : tv::KernelLoopX<int>(num_features)) {{
for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
out_ptr[j] = bias[j] + out_ptr[j];
}}
}}
......@@ -46,6 +54,7 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
def bias_add_act_inplace_kernel(self):
code = pccm.FunctionCode()
code.targ("T")
code.nontype_targ("OneDim", "bool", "false")
code.arg("out_features", f"T*")
code.arg("bias", f"const T*")
......@@ -54,20 +63,35 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
code.arg("beta", f"T")
code.arg("size", "int")
code.arg("num_features", "int")
code.arg("num_blocks_x", "int")
code.arg("num_blocks_y", "int")
code.raw(f"""
for (int i : tv::KernelLoopY<int>(size)) {{
int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
namespace op = tv::arrayops;
using nv_scalar_t = tv::equivalent_data_type_t<T>;
for (int i : tv::KernelLoopY<int>(size, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
auto out_ptr = out_features + i * num_features;
for (int j : tv::KernelLoopX<int>(num_features)) {{
for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
T o = out_ptr[j] + bias[j];
auto* o_nv = reinterpret_cast<nv_scalar_t*>(&o);
switch (act_type){{
case tv::gemm::Activation::kNone:
break;
case tv::gemm::Activation::kReLU:{{
o = o >= T(0) ? o : T(0);
break;
}}
case tv::gemm::Activation::kLeakyReLU:{{
o = o >= T(0) ? o : o * alpha;
break;
}}
case tv::gemm::Activation::kSigmoid:{{
auto e = op::MathScalarOp<nv_scalar_t>::exp(-*o_nv);
o = T(1) / (T(1) + *reinterpret_cast<T*>( &e ));
break;
}}
default: ;
}}
......@@ -89,16 +113,27 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
code.arg("size", "int")
code.raw(f"""
namespace op = tv::arrayops;
using nv_scalar_t = tv::equivalent_data_type_t<T>;
for (int i : tv::KernelLoopX<int>(size)) {{
T o = out_features[i];
auto* o_nv = reinterpret_cast<nv_scalar_t*>(&o);
switch (act_type){{
case tv::gemm::Activation::kNone:
break;
case tv::gemm::Activation::kReLU:{{
out_features[i] = o >= T(0) ? o : T(0);
break;
}}
case tv::gemm::Activation::kLeakyReLU:{{
out_features[i] = o >= T(0) ? o : o * alpha;
break;
}}
case tv::gemm::Activation::kSigmoid:{{
auto e = op::MathScalarOp<nv_scalar_t>::exp(-*o_nv);
out_features[i] = T(1) / (T(1) + *reinterpret_cast<T*>( &e ));
break;
}}
default: ;
}}
......@@ -110,9 +145,10 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
class InferenceOps(pccm.Class):
def __init__(self):
super().__init__()
self.add_dependency(TensorView)
self.add_dependency(TensorView, LaunchUtils)
self.kernel = InferenceOpsKernel()
self.add_include("tensorview/gemm/core/constants.h")
self.add_static_const("kMaxGridYZDim", "int", "65535")
if CUMM_CPU_ONLY_BUILD:
_DECORATOR = pccm.static_function
......@@ -139,34 +175,30 @@ class InferenceOps(pccm.Class):
auto nhot = out.dim(0);
auto cudastream = reinterpret_cast<cudaStream_t>(stream);
TV_ASSERT_RT_ERR(bias.dim(0) == out.dim(1), "error");
tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
tv::dispatch<float, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
using T = TV_DECLTYPE(I);
constexpr int MaxThreads = 512;
tv::cuda::Launch launcher(1);
bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
// if out.dim(1) > value in list above, run this function.
// if a value is found, other value won't be executed.
int NumFeatures = TV_DECLTYPE(V)::value;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}});
if (!found){{
int NumFeatures = 16;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
int num_blocks_X = std::get<0>(launchdims);
int num_blocks_Y = std::get<1>(launchdims);
dim3 blocks;
dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
if (num_blocks_Y > kMaxGridYZDim){{
blocks = dim3(num_blocks_X * num_blocks_Y);
}}else{{
blocks = dim3(num_blocks_X, num_blocks_Y);
}}
tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
if (act_type == tv::gemm::Activation::kNone){{
launcher(ker::bias_add_inplace_kernel<T>, out.data_ptr<T>(), bias.data_ptr<const T>(),
nhot, out.dim(1));
launcher(ker::bias_add_inplace_kernel<T, OneDim>, out.data_ptr<T>(), bias.data_ptr<const T>(),
nhot, out.dim(1), num_blocks_X, num_blocks_Y);
}}else{{
launcher(ker::bias_add_act_inplace_kernel<T>, out.data_ptr<T>(), bias.data_ptr<const T>(),
act_type, T(alpha), T(beta), nhot, out.dim(1));
launcher(ker::bias_add_act_inplace_kernel<T, OneDim>, out.data_ptr<T>(), bias.data_ptr<const T>(),
act_type, T(alpha), T(beta), nhot, out.dim(1), num_blocks_X, num_blocks_Y);
}}
}});
TV_CHECK_CUDA_ERR_V2("bias add act failed!!!");
}});
""")
return code
......@@ -204,10 +236,11 @@ class InferenceOps(pccm.Class):
auto nhot = out.size();
auto cudastream = reinterpret_cast<cudaStream_t>(stream);
tv::cuda::Launch launcher = tv::cuda::Launch(nhot, cudastream);
tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
tv::dispatch<float, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
using T = TV_DECLTYPE(I);
launcher(ker::activation_inplace_kernel<T>, out.data_ptr<T>(), act_type, T(alpha), T(beta),
nhot);
TV_CHECK_CUDA_ERR_V2("bias add act failed!!!");
}});
""")
return code
......@@ -26,19 +26,22 @@ import numpy as np
from cumm.gemm import (thread_map)
from spconv.csrc.sparse.cpu_core import OMPLib
from cumm.constants import CUMM_CPU_ONLY_BUILD
from ..utils.launch import LaunchUtils
class IndiceMaxPool(pccm.Class):
# TODO optimize this function
def __init__(self):
super().__init__()
self.add_include("limits")
self.add_dependency(TensorViewKernel, TensorView, GemmBasic)
self.add_dependency(TensorViewKernel, TensorView, GemmBasic, LaunchUtils)
self.add_static_const("kMaxGridYZDim", "int", "65535")
@pccm.cuda.cuda_global_function
def forward_kernel(self):
code = pccm.FunctionCode()
code.targ("T")
code.nontype_targ("OneDim", "bool", "false")
code.arg("out_features", f"T*")
code.arg("in_features", f"const T*")
......@@ -46,14 +49,19 @@ class IndiceMaxPool(pccm.Class):
code.arg("in_indices", "const int*")
code.arg("size", "int")
code.arg("num_features", "int")
code.arg("num_blocks_x", "int")
code.arg("num_blocks_y", "int")
code.raw(f"""
for (int i : tv::KernelLoopY<int>(size)) {{
int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
for (int i : tv::KernelLoopY<int>(size, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
int in_idx = in_indices[i];
int out_idx = out_indices[i];
auto in_ptr = in_features + in_idx * num_features;
auto out_ptr = out_features + out_idx * num_features;
for (int j : tv::KernelLoopX<int>(num_features)) {{
for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
auto in = in_ptr[j];
auto out = out_ptr[j];
if (in > out){{
......@@ -68,6 +76,7 @@ class IndiceMaxPool(pccm.Class):
def forward_implicit_gemm_kernel(self):
code = pccm.FunctionCode()
code.targ("T")
code.nontype_targ("OneDim", "bool", "false")
code.arg("out_features", f"T*")
code.arg("in_features", f"const T*")
......@@ -76,11 +85,16 @@ class IndiceMaxPool(pccm.Class):
code.arg("RS", "int")
code.arg("num_indices", "int")
code.arg("lowest", "T")
code.arg("num_blocks_x", "int")
code.arg("num_blocks_y", "int")
code.raw(f"""
for (int i : tv::KernelLoopY<int>(num_indices)) {{
int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
for (int i : tv::KernelLoopY<int>(num_indices, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
auto out_ptr = out_features + i * num_features;
for (int j : tv::KernelLoopX<int>(num_features)) {{
for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
auto indices_ptr = indices + i;
int in_idx = indices_ptr[0];
T in, in_temp;
......@@ -106,6 +120,8 @@ class IndiceMaxPool(pccm.Class):
def backward_kernel(self):
code = pccm.FunctionCode()
code.targ("T")
code.nontype_targ("OneDim", "bool", "false")
code.arg("out_features", f"const T*")
code.arg("in_features", f"const T*")
code.arg("dout_features", f"const T*")
......@@ -115,15 +131,20 @@ class IndiceMaxPool(pccm.Class):
code.arg("size", "int")
code.arg("num_features", "int")
code.arg("num_blocks_x", "int")
code.arg("num_blocks_y", "int")
code.raw(f"""
for (int i : tv::KernelLoopY<int>(size)) {{
int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
for (int i : tv::KernelLoopY<int>(size, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
int in_idx_offset = in_indices[i] * num_features;
int out_idx_offset = out_indices[i] * num_features;
auto in_ptr = in_features + in_idx_offset;
auto out_ptr = out_features + out_idx_offset;
auto din_ptr = din_features + in_idx_offset;
auto dout_ptr = dout_features + out_idx_offset;
for (int j : tv::KernelLoopX<int>(num_features)) {{
for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
auto in = in_ptr[j];
auto out = out_ptr[j];
if (in == out){{
......@@ -138,6 +159,7 @@ class IndiceMaxPool(pccm.Class):
def backward_implicit_gemm_kernel(self):
code = pccm.FunctionCode()
code.targ("T")
code.nontype_targ("OneDim", "bool", "false")
code.arg("out_features", f"const T*")
code.arg("in_features", f"const T*")
......@@ -148,13 +170,18 @@ class IndiceMaxPool(pccm.Class):
code.arg("RS", "int")
code.arg("num_indices", "int")
code.arg("num_blocks_x", "int")
code.arg("num_blocks_y", "int")
code.raw(f"""
int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
for (int i : tv::KernelLoopY<int>(num_indices)) {{
for (int i : tv::KernelLoopY<int>(num_indices, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
auto in_ptr = in_features + i * num_features;
auto din_ptr = din_features + i * num_features;
for (int j : tv::KernelLoopX<int>(num_features)) {{
for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
auto indices_ptr = indices_bwd + i;
int out_idx = indices_ptr[0];
T in = in_ptr[j];
......@@ -184,6 +211,7 @@ class IndiceMaxPool(pccm.Class):
def forward_avgpool_implicit_gemm_kernel(self):
code = pccm.FunctionCode()
code.targ("T")
code.nontype_targ("OneDim", "bool", "false")
code.arg("out_features", f"T*")
code.arg("in_features", f"const T*")
......@@ -193,8 +221,13 @@ class IndiceMaxPool(pccm.Class):
code.arg("RS", "int")
code.arg("num_indices", "int")
code.arg("num_blocks_x", "int")
code.arg("num_blocks_y", "int")
code.raw(f"""
for (int i : tv::KernelLoopY<int>(num_indices)) {{
int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
for (int i : tv::KernelLoopY<int>(num_indices, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
auto out_ptr = out_features + i * num_features;
auto indices_ptr = indices + i;
int in_idx = 0;
......@@ -207,7 +240,7 @@ class IndiceMaxPool(pccm.Class):
if (count_out != nullptr){{
count_out[i] = count;
}}
for (int j : tv::KernelLoopX<int>(num_features)) {{
for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
indices_ptr = indices + i;
int in_idx;
T in, in_temp;
......@@ -229,6 +262,8 @@ class IndiceMaxPool(pccm.Class):
def backward_avgpool_implicit_gemm_kernel(self):
code = pccm.FunctionCode()
code.targ("T")
code.nontype_targ("OneDim", "bool", "false")
code.arg("dout_features", f"const T*")
code.arg("din_features", f"T*")
code.arg("indices_bwd", "const int*")
......@@ -237,11 +272,16 @@ class IndiceMaxPool(pccm.Class):
code.arg("RS", "int")
code.arg("num_indices", "int")
code.arg("num_blocks_x", "int")
code.arg("num_blocks_y", "int")
code.raw(f"""
int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
for (int i : tv::KernelLoopY<int>(num_indices)) {{
for (int i : tv::KernelLoopY<int>(num_indices, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
auto din_ptr = din_features + i * num_features;
for (int j : tv::KernelLoopX<int>(num_features)) {{
for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
auto indices_ptr = indices_bwd + i;
int out_idx = 0;
T sum_val = T(0);
......@@ -273,27 +313,24 @@ class IndiceMaxPool(pccm.Class):
auto cudastream = reinterpret_cast<cudaStream_t>(stream);
tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
using T = TV_DECLTYPE(I);
constexpr int MaxThreads = 512;
tv::cuda::Launch launcher(1);
bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
// if out.dim(1) > value in list above, run this function.
// if a value is found, other value won't be executed.
int NumFeatures = TV_DECLTYPE(V)::value;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
int num_blocks_X = std::get<0>(launchdims);
int num_blocks_Y = std::get<1>(launchdims);
dim3 blocks;
dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
if (num_blocks_Y > kMaxGridYZDim){{
blocks = dim3(num_blocks_X * num_blocks_Y);
}}else{{
blocks = dim3(num_blocks_X, num_blocks_Y);
}}
tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
launcher(forward_kernel<T, OneDim>, out.data_ptr<T>(), in.data_ptr<const T>(),
out_inds.data_ptr<const int>(), in_inds.data_ptr<const int>(), nhot, out.dim(1),
num_blocks_X, num_blocks_Y);
}});
if (!found){{
int NumFeatures = 16;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}}
launcher(forward_kernel<T>, out.data_ptr<T>(), in.data_ptr<const T>(),
out_inds.data_ptr<const int>(), in_inds.data_ptr<const int>(), nhot, out.dim(1));
TV_CHECK_CUDA_ERR_V2("max pool fwd failed!!!");
}});
""")
return code
......@@ -315,28 +352,28 @@ class IndiceMaxPool(pccm.Class):
auto cudastream = reinterpret_cast<cudaStream_t>(stream);
tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
using T = TV_DECLTYPE(I);
constexpr int MaxThreads = 512;
tv::cuda::Launch launcher(1);
bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
// if out.dim(1) > value in list above, run this function.
// if a value is found, other value won't be executed.
int NumFeatures = TV_DECLTYPE(V)::value;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}});
if (!found){{
int NumFeatures = 16;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}}
auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
int num_blocks_X = std::get<0>(launchdims);
int num_blocks_Y = std::get<1>(launchdims);
dim3 blocks;
dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
if (num_blocks_Y > kMaxGridYZDim){{
blocks = dim3(num_blocks_X * num_blocks_Y);
}}else{{
blocks = dim3(num_blocks_X, num_blocks_Y);
}}
tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
T lowest = std::numeric_limits<T>::lowest();
lowest = T(0);
launcher(forward_implicit_gemm_kernel<T>, out.data_ptr<T>(), in.data_ptr<const T>(),
inds.data_ptr<const int>(), out.dim(1), inds.dim(0), inds.dim(1), lowest);
tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
launcher(forward_implicit_gemm_kernel<T, OneDim>, out.data_ptr<T>(), in.data_ptr<const T>(),
inds.data_ptr<const int>(), out.dim(1), inds.dim(0), inds.dim(1), lowest,
num_blocks_X, num_blocks_Y);
}});
TV_CHECK_CUDA_ERR_V2("max pool fwd failed!!!");
}});
""")
return code
......@@ -358,27 +395,26 @@ class IndiceMaxPool(pccm.Class):
auto cudastream = reinterpret_cast<cudaStream_t>(stream);
tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
using T = TV_DECLTYPE(I);
constexpr int MaxThreads = 512;
tv::cuda::Launch launcher(1);
bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
// if out.dim(1) > value in list above, run this function.
// if a value is found, other value won't be executed.
int NumFeatures = TV_DECLTYPE(V)::value;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}});
if (!found){{
int NumFeatures = 16;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}}
launcher(backward_kernel<T>, out.data_ptr<const T>(), in.data_ptr<const T>(),
auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
int num_blocks_X = std::get<0>(launchdims);
int num_blocks_Y = std::get<1>(launchdims);
dim3 blocks;
dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
if (num_blocks_Y > kMaxGridYZDim){{
blocks = dim3(num_blocks_X * num_blocks_Y);
}}else{{
blocks = dim3(num_blocks_X, num_blocks_Y);
}}
tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
launcher(backward_kernel<T, OneDim>, out.data_ptr<const T>(), in.data_ptr<const T>(),
dout.data_ptr<const T>(), din.data_ptr<T>(),
out_inds.data_ptr<const int>(), in_inds.data_ptr<const int>(), nhot, out.dim(1));
out_inds.data_ptr<const int>(), in_inds.data_ptr<const int>(), nhot, out.dim(1),
num_blocks_X, num_blocks_Y);
}});
TV_CHECK_CUDA_ERR_V2("max pool backward failed!!!");
}});
""")
return code
......@@ -402,27 +438,25 @@ class IndiceMaxPool(pccm.Class):
auto cudastream = reinterpret_cast<cudaStream_t>(stream);
tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
using T = TV_DECLTYPE(I);
constexpr int MaxThreads = 512;
tv::cuda::Launch launcher(1);
bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
// if out.dim(1) > value in list above, run this function.
// if a value is found, other value won't be executed.
int NumFeatures = TV_DECLTYPE(V)::value;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}});
if (!found){{
int NumFeatures = 16;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}}
launcher(backward_implicit_gemm_kernel<T>, out.data_ptr<const T>(), in.data_ptr<const T>(),
auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
int num_blocks_X = std::get<0>(launchdims);
int num_blocks_Y = std::get<1>(launchdims);
dim3 blocks;
dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
if (num_blocks_Y > kMaxGridYZDim){{
blocks = dim3(num_blocks_X * num_blocks_Y);
}}else{{
blocks = dim3(num_blocks_X, num_blocks_Y);
}}
tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
launcher(backward_implicit_gemm_kernel<T, OneDim>, out.data_ptr<const T>(), in.data_ptr<const T>(),
dout.data_ptr<const T>(), din.data_ptr<T>(),
inds.data_ptr<const int>(), out.dim(1), inds.dim(0), inds.dim(1));
inds.data_ptr<const int>(), out.dim(1), inds.dim(0), inds.dim(1),
num_blocks_X, num_blocks_Y);
}});
TV_CHECK_CUDA_ERR_V2("max pool fwd failed!!!");
}});
""")
return code
......@@ -446,26 +480,26 @@ class IndiceMaxPool(pccm.Class):
auto cudastream = reinterpret_cast<cudaStream_t>(stream);
tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
using T = TV_DECLTYPE(I);
constexpr int MaxThreads = 512;
tv::cuda::Launch launcher(1);
bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
// if out.dim(1) > value in list above, run this function.
// if a value is found, other value won't be executed.
int NumFeatures = TV_DECLTYPE(V)::value;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
int num_blocks_X = std::get<0>(launchdims);
int num_blocks_Y = std::get<1>(launchdims);
dim3 blocks;
dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
if (num_blocks_Y > kMaxGridYZDim){{
blocks = dim3(num_blocks_X * num_blocks_Y);
}}else{{
blocks = dim3(num_blocks_X, num_blocks_Y);
}}
tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
launcher(forward_avgpool_implicit_gemm_kernel<T, OneDim>, out.data_ptr<T>(), in.data_ptr<const T>(),
inds.data_ptr<const int>(), count_out.data_ptr<int>(), out.dim(1), inds.dim(0), inds.dim(1),
num_blocks_X, num_blocks_Y);
}});
if (!found){{
int NumFeatures = 16;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}}
launcher(forward_avgpool_implicit_gemm_kernel<T>, out.data_ptr<T>(), in.data_ptr<const T>(),
inds.data_ptr<const int>(), count_out.data_ptr<int>(), out.dim(1), inds.dim(0), inds.dim(1));
TV_CHECK_CUDA_ERR_V2("avg pool fwd failed!!!");
}});
""")
return code
......@@ -484,32 +518,31 @@ class IndiceMaxPool(pccm.Class):
TV_ASSERT_RT_ERR(!count_out.empty(), "count out must not empty")
tv::check_shape(inds, {{-1, nhot}});
tv::check_shape(din, {{-1, dout.dim(1)}});
int num_act_out = dout.dim(1);
auto cudastream = reinterpret_cast<cudaStream_t>(stream);
tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(dout.dtype(), [&](auto I){{
using T = TV_DECLTYPE(I);
constexpr int MaxThreads = 512;
tv::cuda::Launch launcher(1);
bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(dout.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
// if out.dim(1) > value in list above, run this function.
// if a value is found, other value won't be executed.
int NumFeatures = TV_DECLTYPE(V)::value;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(dout.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}});
if (!found){{
int NumFeatures = 16;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(dout.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}}
launcher(backward_avgpool_implicit_gemm_kernel<T>,
auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, dout.dim(1));
int num_blocks_X = std::get<0>(launchdims);
int num_blocks_Y = std::get<1>(launchdims);
dim3 blocks;
dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
if (num_blocks_Y > kMaxGridYZDim){{
blocks = dim3(num_blocks_X * num_blocks_Y);
}}else{{
blocks = dim3(num_blocks_X, num_blocks_Y);
}}
tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
launcher(backward_avgpool_implicit_gemm_kernel<T, OneDim>,
dout.data_ptr<const T>(), din.data_ptr<T>(),
inds.data_ptr<const int>(), count_out.data_ptr<const int>(),
dout.dim(1), inds.dim(0), inds.dim(1));
dout.dim(1), inds.dim(0), inds.dim(1),
num_blocks_X, num_blocks_Y);
}});
TV_CHECK_CUDA_ERR_V2("avg pool bwd failed!!!");
}});
""")
return code
......
......@@ -32,9 +32,11 @@ class Point2VoxelCommon(pccm.ParameterizedClass):
self.ndim = ndim
self.zyx = zyx
ret_str = f"std::array<int, {self.ndim}>"
ret64_str = f"std::array<int64_t, {self.ndim}>"
retf_str = f"std::array<float, {self.ndim}>"
retf2_str = f"std::array<float, {self.ndim * 2}>"
self.calc_meta_ret = f"std::tuple<{retf_str}, {ret_str}, {ret_str}, {retf2_str}>"
self.calc_meta_ret = f"std::tuple<{retf_str}, {ret_str}, {ret64_str}, {retf2_str}>"
@pccm.static_function
def calc_meta_data(self):
......@@ -43,8 +45,8 @@ class Point2VoxelCommon(pccm.ParameterizedClass):
code.arg("coors_range_xyz", f"std::array<float, {self.ndim * 2}>")
code.raw(f"""
std::array<float, {self.ndim}> vsize;
std::array<int, {self.ndim}> grid_size, grid_stride;
std::array<int, {self.ndim}> grid_size;
std::array<int64_t, {self.ndim}> grid_stride;
std::array<float, {self.ndim * 2}> coors_range;
""")
......@@ -79,9 +81,10 @@ class Point2VoxelCommon(pccm.ParameterizedClass):
ret_str = f"std::array<int, {self.ndim}>"
retf_str = f"std::array<float, {self.ndim}>"
retf2_str = f"std::array<float, {self.ndim * 2}>"
ret64_str = f"std::array<int64_t, {self.ndim}>"
return code.ret(
f"std::tuple<{retf_str}, {ret_str}, {ret_str}, {retf2_str}>")
f"std::tuple<{retf_str}, {ret_str}, {ret64_str}, {retf2_str}>")
@pccm.static_function
def array2tvarray(self):
......@@ -143,7 +146,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
code.arg("vsize", f"tv::array<float, {self.ndim}>")
code.arg("coors_range", f"tv::array<float, {self.ndim * 2}>")
code.arg("grid_bound", f"tv::array<int, {self.ndim}>")
code.arg("grid_stride", f"tv::array<int, {self.ndim}>")
code.arg("grid_stride", f"tv::array<int64_t, {self.ndim}>")
code.arg("num_points", f"int")
point_xyz = f"{self.ndim - 1} - j"
......@@ -163,7 +166,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
if ((c < 0 || c >= grid_bound[j])) {{
failed = true;
}}
prod += grid_stride[j] * c;
prod += grid_stride[j] * int64_t(c);
}}
if (!failed){{
points_indice_data[i] = prod;
......@@ -218,7 +221,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
code.arg("vsize", f"tv::array<float, {self.ndim}>")
code.arg("coors_range", f"tv::array<float, {self.ndim * 2}>")
code.arg("grid_bound", f"tv::array<int, {self.ndim}>")
code.arg("grid_stride", f"tv::array<int, {self.ndim}>")
code.arg("grid_stride", f"tv::array<int64_t, {self.ndim}>")
code.arg("num_points", f"int")
# TODO add backward?
......@@ -323,7 +326,7 @@ class Point2Voxel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
self.add_member("vsize", f"tv::array<float, {self.ndim}>")
self.add_member("coors_range", f"tv::array<float, {self.ndim * 2}>")
self.add_member("grid_size", f"tv::array<int, {self.ndim}>")
self.add_member("grid_stride", f"tv::array<int, {self.ndim}>")
self.add_member("grid_stride", f"tv::array<int64_t, {self.ndim}>")
@pccm.pybind.mark_prop_getter(prop_name="grid_size")
@pccm.member_function
......@@ -414,7 +417,9 @@ class Point2Voxel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
code.arg("voxels, indices, num_per_voxel, hashdata, point_indice_data, points_voxel_id",
"tv::Tensor")
code.arg("vsize", f"std::array<float, {self.ndim}>")
code.arg("grid_size, grid_stride", f"std::array<int, {self.ndim}>")
code.arg("grid_size", f"std::array<int, {self.ndim}>")
code.arg("grid_stride", f"std::array<int64_t, {self.ndim}>")
code.arg("coors_range", f"std::array<float, {self.ndim * 2}>")
code.arg("clear_voxels", "bool", "true")
code.arg("empty_mean", "bool", "false")
......
......@@ -13,3 +13,4 @@
# limitations under the License.
from .boxops import BoxOps
from .pcc import PointCloudCompress
\ No newline at end of file
import pccm
from cumm.common import TensorView
class LaunchUtils(pccm.Class):
def __init__(self):
super().__init__()
self.add_include("limits")
self.add_dependency(TensorView)
self.add_static_const("kMaxGridYZDim", "int", "65535")
@pccm.static_function
def get_blocks_threads_of_2d_tensor(self):
code = pccm.code()
code.arg("nhot", "int64_t")
code.arg("num_features", "int64_t")
code.raw(f"""
constexpr int MaxThreads = 512;
int num_blocks_X = 0;
int num_blocks_Y = 0;
int threads_X = 0;
int threads_Y = 0;
dim3 threads;
bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(int(num_features), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
// if num_features > value in list above, run this function.
// if a value is found, other value won't be executed.
int NumFeatures = TV_DECLTYPE(V)::value;
int Num0 = MaxThreads / NumFeatures;
num_blocks_X = tv::div_up(num_features, int64_t(NumFeatures));
num_blocks_Y = tv::div_up(nhot, int64_t(Num0));
threads_X = NumFeatures;
threads_Y = Num0;
}});
if (!found){{
int NumFeatures = 16;
int Num0 = MaxThreads / NumFeatures;
num_blocks_X = tv::div_up(num_features, int64_t(NumFeatures));
num_blocks_Y = tv::div_up(nhot, int64_t(Num0));
threads_X = NumFeatures;
threads_Y = Num0;
}}
return std::make_tuple(num_blocks_X, num_blocks_Y, threads_X, threads_Y);
""")
code.ret("std::tuple<int, int, int, int>")
return code
# Copyright 2022 Yan Yan
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pccm
from cumm.common import TensorView, TensorViewNVRTC
class PointCloudCompress(pccm.Class):
def __init__(self):
super().__init__()
self.add_dependency(TensorView, TensorViewNVRTC)
self.add_include("unordered_map")
self.add_include("tensorview/hash/hash_functions.h")
self.add_enum_class("EncodeType", [
("XYZ_8", 0),
("XYZI_8", 1),
])
@pccm.pybind.mark
@pccm.static_function
def encode_with_order(self):
code = pccm.code()
# TODO add checksum
code.arg("points", "tv::Tensor")
code.arg("intensity", "tv::Tensor")
code.arg("ex", "float")
code.arg("ey", "float")
code.arg("ez", "float")
code.arg("type", "EncodeType")
code.arg("with_order", "bool", "false")
code.raw(f"""
namespace op = tv::arrayops;
float vx = 256.0f * ex;
float vy = 256.0f * ey;
float vz = 256.0f * ez;
std::vector<std::tuple<uint64_t, tv::array<float, 3>>> offsets;
std::unordered_map<uint64_t, std::vector<int64_t>> order;
auto N = points.dim(0);
tv::array<float, 3> vsize{{vx, vy, vz}};
tv::array<float, 3> errors{{ex, ey, ez}};
tv::Tensor order_ten;
int64_t* order_ten_ptr = nullptr;
if (with_order){{
order_ten = tv::empty({{N}}, tv::int64);
order_ten_ptr = order_ten.data_ptr<int64_t>();
}}
using hash_t = tv::hash::SpatialHash<uint64_t>;
auto point_stride = points.stride(0);
int64_t final_size = sizeof(int64_t) * 5 + sizeof(float) * 3;
tv::Tensor res;
tv::ssprint(1);
tv::dispatch<float, double>(points.dtype(), [&](auto IP){{
using TPoint = TV_DECLTYPE(IP);
auto points_data = points.data_ptr<TPoint>();
tv::dispatch<float, double, uint8_t>(intensity.dtype(), [&](auto II){{
using TInten = TV_DECLTYPE(II);
auto intensity_data = intensity.data_ptr<TInten>();
tv::dispatch_int<static_cast<int>(EncodeType::XYZI_8), static_cast<int>(EncodeType::XYZ_8)>(static_cast<int>(type), [&](auto I){{
constexpr int kTypeInt = TV_DECLTYPE(I)::value;
constexpr int kEncodeDim = kTypeInt == static_cast<int>(EncodeType::XYZI_8) ? 4 : 3;
std::unordered_map<uint64_t, std::vector<tv::array<uint8_t, kEncodeDim>>> hash;
int inten_stride = 0;
if (kEncodeDim > 3){{
TV_ASSERT_RT_ERR(!intensity.empty(), "inten must not empty");
inten_stride = intensity.stride(0);
}}
for (size_t i = 0; i < N; ++i){{
tv::array<float, 3> point = op::read_ptr<3>(points_data).template cast<float>();
auto pos_unit_voxel = point / vsize;
auto pos_int = op::apply(floorf, pos_unit_voxel).cast<int32_t>();
auto pos_enc = (point / errors - pos_int.cast<float>() * float(256)).cast<uint8_t>();
tv::array<uint8_t, kEncodeDim> enc;
tv::if_constexpr<(kEncodeDim > 3)>([&](auto _){{
TInten inten = intensity_data[0];
enc = _(tv::array<uint8_t, kEncodeDim>{{pos_enc[0], pos_enc[1], pos_enc[2], uint8_t(inten)}});
intensity_data += inten_stride;
}}, [&](auto _){{
enc = _(tv::array<uint8_t, kEncodeDim>{{pos_enc[0], pos_enc[1], pos_enc[2]}});
}});
auto pos_uint = pos_int + hash_t::direct_hash_offset();
uint64_t scalar = hash_t::encode(pos_int[0], pos_int[1], pos_int[2]);
auto iter = hash.find(scalar);
if (iter == hash.end()){{
auto pos_offset = pos_int.cast<float>() * vsize;
std::vector<tv::array<uint8_t, kEncodeDim>> vec{{enc}};
offsets.push_back({{scalar, pos_offset}});
hash.insert({{scalar, vec}});
final_size += sizeof(float) * 3 + sizeof(int) + sizeof(uint8_t) * kEncodeDim;
if (with_order){{
std::vector<int64_t> order_cluster{{int64_t(i)}};
order.insert({{scalar, order_cluster}});
}}
}}else{{
// iter.value().push_back(enc);
iter->second.push_back(enc);
final_size += sizeof(uint8_t) * kEncodeDim;
if (with_order){{
order.at(scalar).push_back(i);
}}
}}
points_data += point_stride;
}}
res = tv::empty({{final_size}}, tv::uint8, -1);
auto res_ptr = res.raw_data();
int64_t* res_ptr_header = reinterpret_cast<int64_t*>(res_ptr);
res_ptr_header[0] = int64_t(final_size);
res_ptr_header[1] = static_cast<int>(type);
res_ptr_header[2] = int64_t(N);
res_ptr_header[3] = int64_t(offsets.size());
res_ptr_header[4] = 0;
// TODO add checksum in header
res_ptr += sizeof(int64_t) * 5;
float* error_header = reinterpret_cast<float*>(res_ptr);
error_header[0] = errors[0];
error_header[1] = errors[1];
error_header[2] = errors[2];
res_ptr += sizeof(float) * 3;
for (auto& p : offsets){{
auto& offset = std::get<1>(p);
auto& encodes = hash.at(std::get<0>(p));
int cluster_size = encodes.size();
reinterpret_cast<int*>(res_ptr)[0] = cluster_size;
reinterpret_cast<float*>(res_ptr)[1] = offset[0];
reinterpret_cast<float*>(res_ptr)[2] = offset[1];
reinterpret_cast<float*>(res_ptr)[3] = offset[2];
res_ptr += sizeof(int) + sizeof(float) * 3;
}}
for (auto& p : offsets){{
auto& offset = std::get<1>(p);
auto& encodes = hash.at(std::get<0>(p));
int cluster_size = encodes.size();
auto enc_ptr = reinterpret_cast<tv::array<uint8_t, kEncodeDim>*>(res_ptr);
for (int i = 0; i < cluster_size; ++i){{
enc_ptr[i] = encodes[i];
}}
if (with_order){{
auto& orders = order.at(std::get<0>(p));
for (int i = 0; i < cluster_size; ++i){{
order_ten_ptr[i] = orders[i];
}}
order_ten_ptr += cluster_size;
}}
res_ptr += cluster_size * sizeof(tv::array<uint8_t, kEncodeDim>);
}}
TV_ASSERT_RT_ERR(res_ptr - res.raw_data() == final_size, "error");
}});
}});
}});
return std::make_tuple(res, order_ten);
""")
return code.ret("std::tuple<tv::Tensor, tv::Tensor>")
@pccm.pybind.mark
@pccm.static_function
def encode_xyzi(self):
code = pccm.code()
code.arg("points", "tv::Tensor")
code.arg("intensity", "tv::Tensor")
code.arg("ex", "float")
code.arg("ey", "float")
code.arg("ez", "float")
code.raw(f"""
auto res = encode_with_order(points, intensity, ex, ey, ez, EncodeType::XYZI_8, false);
return std::get<0>(res);
""")
return code.ret("tv::Tensor")
@pccm.pybind.mark
@pccm.static_function
def encode_xyz(self):
code = pccm.code()
code.arg("points", "tv::Tensor")
code.arg("ex", "float")
code.arg("ey", "float")
code.arg("ez", "float")
code.raw(f"""
auto res = encode_with_order(points, tv::Tensor(), ex, ey, ez, EncodeType::XYZ_8, false);
return std::get<0>(res);
""")
return code.ret("tv::Tensor")
@pccm.pybind.mark
@pccm.static_function
def decode(self):
code = pccm.code()
code.arg("data", "tv::Tensor")
code.raw(f"""
namespace op = tv::arrayops;
const uint8_t* data_ptr = data.data_ptr<const uint8_t>();
auto res_ptr = data.raw_data();
int64_t* res_ptr_header = reinterpret_cast<int64_t*>(res_ptr);
int64_t final_size = res_ptr_header[0];
int type = res_ptr_header[1];
int64_t N = res_ptr_header[2];
int64_t voxel_num = res_ptr_header[3];
TV_ASSERT_RT_ERR(final_size == data.raw_size(), "size mismatch");
res_ptr += sizeof(int64_t) * 5;
float* error_header = reinterpret_cast<float*>(res_ptr);
tv::array<float, 3> error;
error[0] = error_header[0];
error[1] = error_header[1];
error[2] = error_header[2];
res_ptr += sizeof(float) * 3;
tv::Tensor points;
tv::dispatch_int<static_cast<int>(EncodeType::XYZI_8), static_cast<int>(EncodeType::XYZ_8)>(static_cast<int>(type), [&](auto I){{
constexpr int kTypeInt = TV_DECLTYPE(I)::value;
constexpr int kEncodeDim = kTypeInt == static_cast<int>(EncodeType::XYZI_8) ? 4 : 3;
points = tv::empty({{N, kEncodeDim}}, tv::float32);
auto points_ptr = points.data_ptr<float>();
auto enc_ptr = reinterpret_cast<tv::array<uint8_t, kEncodeDim>*>(res_ptr + voxel_num * (sizeof(int) * 1 + sizeof(float) * 3));
for (int i = 0; i < voxel_num; ++i){{
int cluster_size = reinterpret_cast<int*>(res_ptr)[0];
tv::array<float, 3> offset;
offset[0] = reinterpret_cast<float*>(res_ptr)[1];
offset[1] = reinterpret_cast<float*>(res_ptr)[2];
offset[2] = reinterpret_cast<float*>(res_ptr)[3];
auto point_cur_ptr = points_ptr;
for (int j = 0; j < cluster_size; ++j){{
auto& enc = enc_ptr[j];
auto point = op::slice<0, 3>(enc).template cast<float>() * error + offset;
point_cur_ptr[0] = point[0];
point_cur_ptr[1] = point[1];
point_cur_ptr[2] = point[2];
if (kEncodeDim > 3){{
point_cur_ptr[3] = enc[3];
}}
point_cur_ptr += kEncodeDim;
}}
res_ptr += sizeof(int) + sizeof(float) * 3;
enc_ptr += cluster_size;
points_ptr += cluster_size * kEncodeDim;
}}
}});
return points;
""")
return code.ret("tv::Tensor")
......@@ -270,6 +270,7 @@ class SparseConvolution(SparseModule):
if self.training:
msg = "act don't support backward, only used in inference"
assert self.act_type == tv.gemm.Activation.None_, msg
if not self.subm:
if self.transposed:
out_spatial_shape = ops.get_deconv_output_size(
......@@ -428,6 +429,7 @@ class SparseConvolution(SparseModule):
indice_pair_num,
outids.shape[0],
algo,
input._timer,
bias_for_infer,
self.act_alpha,
self.act_beta,
......@@ -551,7 +553,6 @@ class SparseConvolution(SparseModule):
self.act_alpha,
self.act_beta,
self.act_type)
if bias_for_training is not None:
out_features += bias_for_training
if input.benchmark:
......@@ -571,6 +572,9 @@ class SparseConvolution(SparseModule):
out_tensor.indices = outids
out_tensor.indice_dict = indice_dict
out_tensor.spatial_shape = out_spatial_shape
# print(outids.shape, spatial_shape, self.kernel_size, self.stride, self.padding,
# self.dilation, self.output_padding, out_spatial_shape)
return out_tensor
def _check_subm_reuse_valid(self, inp: SparseConvTensor,
......
......@@ -1561,6 +1561,8 @@ def implicit_gemm(features: torch.Tensor,
with timer.record("implicit_gemm", stream):
for j in range(num_split):
beta = 0 if j == 0 else 1
if bias is not None:
beta = 1
CONV.run_with_tuned_result(
tune_res,
ConvOpType.kForward,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment