Commit 1635e9ef authored by yan.yan's avatar yan.yan
Browse files

prepare spconv 2.2

parent f8c25027
# Changelog # Changelog
## [2.2.0] - 2022-9-24
### Added
- Add Ampere support. faster fp16, faster tf32 and greatly faster int8 kernels in Ampere GPUs.
- Add pure c++ code generation (libspconv.so) for deploy (or train in another deeplearning framework)
- Add NVRTC support for all gemm kernels. if your GPU architecture isn't compiled in prebuilt, spconv will use slightly slower (10-20us overhead for every kernel launch) NVRTC kernels.
### Fixed
- Fix launch fail in maxpool if too much voxels
### Changed
- all weight layout will be KRSC, don't support old spconv 1.x weights anymore.
- previous gemm ops in ops.py now move to c++ by default (controlled by spconv.constants.SPCONV_CPP_GEMM)
### Removed
- drop python 3.6 support.
- drop CUDA 10.2 support.
- pascal and kepler architecture is removed in CUDA 12 prebuilt.
## [2.1.22] - 2022-6-11 ## [2.1.22] - 2022-6-11
### Fixed ### Fixed
- Fix thrust problem by adding -fvisibility=hidden - Fix thrust problem by adding -fvisibility=hidden
......
...@@ -17,16 +17,15 @@ ...@@ -17,16 +17,15 @@
[pypi-ver-114]: https://img.shields.io/pypi/v/spconv-cu114 [pypi-ver-114]: https://img.shields.io/pypi/v/spconv-cu114
[pypi-ver-111]: https://img.shields.io/pypi/v/spconv-cu111 [pypi-ver-111]: https://img.shields.io/pypi/v/spconv-cu111
[pypi-ver-113]: https://img.shields.io/pypi/v/spconv-cu113 [pypi-ver-113]: https://img.shields.io/pypi/v/spconv-cu113
[pypi-ver-102]: https://img.shields.io/pypi/v/spconv-cu102
[pypi-url-111]: https://pypi.org/project/spconv-cu111/ [pypi-url-111]: https://pypi.org/project/spconv-cu111/
[pypi-download-111]: https://img.shields.io/pypi/dm/spconv-cu111 [pypi-download-111]: https://img.shields.io/pypi/dm/spconv-cu111
[pypi-url-113]: https://pypi.org/project/spconv-cu113/ [pypi-url-113]: https://pypi.org/project/spconv-cu113/
[pypi-download-113]: https://img.shields.io/pypi/dm/spconv-cu113 [pypi-download-113]: https://img.shields.io/pypi/dm/spconv-cu113
[pypi-url-102]: https://pypi.org/project/spconv-cu102/
[pypi-download-102]: https://img.shields.io/pypi/dm/spconv-cu102
[pypi-url-114]: https://pypi.org/project/spconv-cu114/ [pypi-url-114]: https://pypi.org/project/spconv-cu114/
[pypi-download-114]: https://img.shields.io/pypi/dm/spconv-cu114 [pypi-download-114]: https://img.shields.io/pypi/dm/spconv-cu114
[pypi-url-120]: https://pypi.org/project/spconv-cu120/
[pypi-download-120]: https://img.shields.io/pypi/dm/spconv-cu120
[pypi-url-cpu]: https://pypi.org/project/spconv/ [pypi-url-cpu]: https://pypi.org/project/spconv/
[pypi-download-cpu]: https://img.shields.io/pypi/dm/spconv [pypi-download-cpu]: https://img.shields.io/pypi/dm/spconv
...@@ -37,10 +36,10 @@ ...@@ -37,10 +36,10 @@
| | PyPI | Install |Downloads | | | PyPI | Install |Downloads |
| -------------- |:---------------------:| ---------------------:| ---------------------:| | -------------- |:---------------------:| ---------------------:| ---------------------:|
| CPU (Linux Only) | [![PyPI Version][pypi-ver-cpu]][pypi-url-cpu] | ```pip install spconv``` | [![pypi monthly download][pypi-download-cpu]][pypi-url-cpu] | | CPU (Linux Only) | [![PyPI Version][pypi-ver-cpu]][pypi-url-cpu] | ```pip install spconv``` | [![pypi monthly download][pypi-download-cpu]][pypi-url-cpu] |
| CUDA 10.2 | [![PyPI Version][pypi-ver-102]][pypi-url-102] | ```pip install spconv-cu102``` | [![pypi monthly download][pypi-download-102]][pypi-url-102] |
| CUDA 11.1 | [![PyPI Version][pypi-ver-111]][pypi-url-111] | ```pip install spconv-cu111```| [![pypi monthly download][pypi-download-111]][pypi-url-111]| | CUDA 11.1 | [![PyPI Version][pypi-ver-111]][pypi-url-111] | ```pip install spconv-cu111```| [![pypi monthly download][pypi-download-111]][pypi-url-111]|
| CUDA 11.3 (Linux Only) | [![PyPI Version][pypi-ver-113]][pypi-url-113] | ```pip install spconv-cu113```| [![pypi monthly download][pypi-download-113]][pypi-url-113]| | CUDA 11.3 (Linux Only) | [![PyPI Version][pypi-ver-113]][pypi-url-113] | ```pip install spconv-cu113```| [![pypi monthly download][pypi-download-113]][pypi-url-113]|
| CUDA 11.4 | [![PyPI Version][pypi-ver-114]][pypi-url-114] | ```pip install spconv-cu114```| [![pypi monthly download][pypi-download-114]][pypi-url-114]| | CUDA 11.4 | [![PyPI Version][pypi-ver-114]][pypi-url-114] | ```pip install spconv-cu114```| [![pypi monthly download][pypi-download-114]][pypi-url-114]|
| CUDA 12.0 | [![PyPI Version][pypi-ver-120]][pypi-url-120] | ```pip install spconv-cu120```| [![pypi monthly download][pypi-download-120]][pypi-url-120]|
```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support. check [benchmark](docs/BENCHMARK.md) to see how fast spconv 2.x runs. ```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support. check [benchmark](docs/BENCHMARK.md) to see how fast spconv 2.x runs.
...@@ -50,9 +49,9 @@ Check [spconv 2.x algorithm introduction](docs/spconv2_algo.pdf) to understand s ...@@ -50,9 +49,9 @@ Check [spconv 2.x algorithm introduction](docs/spconv2_algo.pdf) to understand s
**WARNING** spconv < 2.1.18 users need to upgrade your version to 2.1.18, it fix a bug in conv weight init which cause std of inited weight too large, and a bug in PointToVoxel. **WARNING** spconv < 2.1.18 users need to upgrade your version to 2.1.18, it fix a bug in conv weight init which cause std of inited weight too large, and a bug in PointToVoxel.
## Breaking changes in Spconv 2.x ## NEWS
Spconv 1.x users **NEED READ [THIS](docs/SPCONV_2_BREAKING_CHANGEs.md)** before using spconv 2.x. * spconv 2.2: ampere feature support (by @[EvernightAurora](https://github.com/EvernightAurora)), pure c++ code generation, nvrtc, drop cuda 10.2, drop python 3.6
## Spconv 2.1 vs Spconv 1.x ## Spconv 2.1 vs Spconv 1.x
...@@ -64,6 +63,13 @@ Spconv 1.x users **NEED READ [THIS](docs/SPCONV_2_BREAKING_CHANGEs.md)** before ...@@ -64,6 +63,13 @@ Spconv 1.x users **NEED READ [THIS](docs/SPCONV_2_BREAKING_CHANGEs.md)** before
* [doesn't depend on pytorch binary](docs/FAQ.md#What-does-no-dependency-on-pytorch-mean), but you may need at least pytorch >= 1.5.0 to run spconv 2.x. * [doesn't depend on pytorch binary](docs/FAQ.md#What-does-no-dependency-on-pytorch-mean), but you may need at least pytorch >= 1.5.0 to run spconv 2.x.
* since spconv 2.x doesn't depend on pytorch binary (never in future), it's impossible to support torch.jit/libtorch inference. * since spconv 2.x doesn't depend on pytorch binary (never in future), it's impossible to support torch.jit/libtorch inference.
## Spconv 2.2 vs Spconv 2.1
* faster fp16 kernels (~10-30%) in ampere GPUs (tested in RTX 3090)
* greatly faster int8 kernels (~1.2x~2.7x) in ampere GPUs (tested in RTX 3090)
* no python 3.6 support
* no CUDA 10.2 support
## Spconv 2.x Development and Roadmap ## Spconv 2.x Development and Roadmap
Spconv 2.2 development has started. See [this issue](https://github.com/traveller59/spconv/issues/380) for more details. Spconv 2.2 development has started. See [this issue](https://github.com/traveller59/spconv/issues/380) for more details.
...@@ -80,17 +86,17 @@ Don't forget to check [performance guide](docs/PERFORMANCE_GUIDE.md). ...@@ -80,17 +86,17 @@ Don't forget to check [performance guide](docs/PERFORMANCE_GUIDE.md).
## Install ## Install
You need to install python >= 3.6 (>=3.7 for windows) first to use spconv 2.x. You need to install python >= 3.7 first to use spconv 2.x.
You need to install CUDA toolkit first before using prebuilt binaries or build from source. You need to install CUDA toolkit first before using prebuilt binaries or build from source.
You need at least CUDA 10.2 to build and run spconv 2.x. We won't offer any support for CUDA < 10.2. You need at least CUDA 11.0 to build and run spconv 2.x. We won't offer any support for CUDA < 11.0.
### Prebuilt ### Prebuilt
We offer python 3.6-3.10 and cuda 10.2/11.1/11.3/11.4 prebuilt binaries for linux (manylinux). We offer python 3.7-3.10 and cuda 11.1/11.3/11.4/12.0 prebuilt binaries for linux (manylinux).
We offer python 3.7-3.10 and cuda 10.2/11.1/11.4 prebuilt binaries for windows 10/11. We offer python 3.7-3.10 and cuda 11.1/11.4/12.0 prebuilt binaries for windows 10/11.
We will provide prebuilts for CUDA versions supported by latest pytorch release. For example, pytorch 1.10 provide cuda 10.2 and 11.3 prebuilts, so we provide them too. We will provide prebuilts for CUDA versions supported by latest pytorch release. For example, pytorch 1.10 provide cuda 10.2 and 11.3 prebuilts, so we provide them too.
...@@ -100,17 +106,15 @@ CUDA 11.1 will be removed in spconv 2.2 because pytorch 1.10 don't provide prebu ...@@ -100,17 +106,15 @@ CUDA 11.1 will be removed in spconv 2.2 because pytorch 1.10 don't provide prebu
```pip install spconv``` for CPU only (**Linux Only**). you should only use this for debug usage, the performance isn't optimized due to manylinux limit (no omp support). ```pip install spconv``` for CPU only (**Linux Only**). you should only use this for debug usage, the performance isn't optimized due to manylinux limit (no omp support).
```pip install spconv-cu102``` for CUDA 10.2
```pip install spconv-cu111``` for CUDA 11.1 ```pip install spconv-cu111``` for CUDA 11.1
```pip install spconv-cu113``` for CUDA 11.3 (**Linux Only**) ```pip install spconv-cu113``` for CUDA 11.3 (**Linux Only**)
```pip install spconv-cu114``` for CUDA 11.4 ```pip install spconv-cu114``` for CUDA 11.4
**NOTE** It's safe to have different **minor** cuda version between system and conda (pytorch) in **CUDA >= 11.0** because of [CUDA Minor Version Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility). For example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed. ```pip install spconv-cu120``` for CUDA 12.0
For CUDA 10, we don't know whether ```spconv-cu102``` works with CUDA 10.0 and 10.1. Users can have a try. **NOTE** It's safe to have different **minor** cuda version between system and conda (pytorch) in **CUDA >= 11.0** because of [CUDA Minor Version Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility). For example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed.
**NOTE** In Linux, you can install spconv-cuxxx without install CUDA to system! only suitable NVIDIA driver is required. for CUDA 11, we need driver >= 450.82. **NOTE** In Linux, you can install spconv-cuxxx without install CUDA to system! only suitable NVIDIA driver is required. for CUDA 11, we need driver >= 450.82.
...@@ -118,11 +122,12 @@ For CUDA 10, we don't know whether ```spconv-cu102``` works with CUDA 10.0 and 1 ...@@ -118,11 +122,12 @@ For CUDA 10, we don't know whether ```spconv-cu102``` works with CUDA 10.0 and 1
See [this page](https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/) to check supported GPU names by arch. See [this page](https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/) to check supported GPU names by arch.
If you use a GPU architecture that isn't compiled in prebuilt, spconv will use NVRTC to compile a slightly slower kernel.
| CUDA version | GPU Arch List | | CUDA version | GPU Arch List |
| -------------- |:---------------------:| | -------------- |:---------------------:|
| 10.2 | 50,52,60,61,70,75 |
| 11.x | 52,60,61,70,75,80,86 | | 11.x | 52,60,61,70,75,80,86 |
| 12.x | 60,61,70,75,80,86,90 | | 12.x | 70,75,80,86,90 |
### Build from source for development (JIT, recommend) ### Build from source for development (JIT, recommend)
...@@ -171,10 +176,6 @@ You need to rebuild ```cumm``` first if you are build along a CUDA version that ...@@ -171,10 +176,6 @@ You need to rebuild ```cumm``` first if you are build along a CUDA version that
5. run ```pip install pccm cumm wheel``` 5. run ```pip install pccm cumm wheel```
6. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl``` 6. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl```
## Know issues
* Spconv 2.x F16 runs slow in A100.
## Note ## Note
The work is done when the author is an employee at [Tusimple](https://www.tusimple.com/). The work is done when the author is an employee at [Tusimple](https://www.tusimple.com/).
......
# Copyright 2021 Yan Yan
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""simple fuse
see https://pytorch.org/tutorials/intermediate/fx_conv_bn_fuser.html
"""
import time
from pathlib import Path
from typing import Any, Dict, Tuple
import numpy as np
import torch
from torch import nn
from cumm import tensorview as tv
from spconv.core import ConvAlgo
import torch.fx
import spconv.pytorch as spconv
import copy
import pickle
from spconv.pytorch.conv import SparseConvolution
from spconv.pytorch import functional as Fsp
def fuse_bn_weights(conv_w_OKI, conv_b, bn_rm, bn_rv, bn_eps, bn_w, bn_b):
NDim = conv_w_OKI.ndim - 2
permute = [0, NDim+1] + [i+1 for i in range(NDim)]
conv_w_OIK = conv_w_OKI.permute(*permute)
# OIDHW
if conv_b is None:
conv_b = torch.zeros_like(bn_rm)
if bn_w is None:
bn_w = torch.ones_like(bn_rm)
if bn_b is None:
bn_b = torch.zeros_like(bn_rm)
bn_var_rsqrt = torch.rsqrt(bn_rv + bn_eps)
conv_w_OIK = conv_w_OIK * (bn_w * bn_var_rsqrt).reshape([-1] + [1] * (len(conv_w_OIK.shape) - 1))
conv_b = (conv_b - bn_rm) * bn_var_rsqrt * bn_w + bn_b
permute = [0,] + [i+2 for i in range(NDim)] + [1,]
conv_w_OKI = conv_w_OIK.permute(*permute).contiguous()
return torch.nn.Parameter(conv_w_OKI), torch.nn.Parameter(conv_b)
def fuse_bn(conv, bn):
"""
Given a conv Module `A` and an batch_norm module `B`, returns a conv
module `C` such that C(x) == B(A(x)) in inference mode.
"""
assert(not (conv.training or bn.training)), "Fusion only for eval!"
fused_conv = copy.deepcopy(conv)
fused_conv.weight, fused_conv.bias = \
fuse_bn_weights(fused_conv.weight, fused_conv.bias,
bn.running_mean, bn.running_var, bn.eps, bn.weight, bn.bias)
return fused_conv
def fuse_act_net(conv, act):
"""
Given a conv Module `A` and an batch_norm module `B`, returns a conv
module `C` such that C(x) == B(A(x)) in inference mode.
"""
assert(not (conv.training)), "Fusion only for eval!"
fused_conv = copy.deepcopy(conv)
if isinstance(act, torch.nn.ReLU):
fused_conv.act_type = tv.gemm.Activation.ReLU
if isinstance(act, torch.nn.Sigmoid):
fused_conv.act_type = tv.gemm.Activation.Sigmoid
elif isinstance(act, torch.nn.LeakyReLU):
fused_conv.act_type = tv.gemm.Activation.LeakyReLU
fused_conv.act_alpha = act.negative_slope
else:
raise NotImplementedError
return fused_conv
def _parent_name(target : str) -> Tuple[str, str]:
"""
Splits a qualname into parent path and last atom.
For example, `foo.bar.baz` -> (`foo.bar`, `baz`)
"""
*parent, name = target.rsplit('.', 1)
return parent[0] if parent else '', name
def replace_node_module(node: torch.fx.Node, modules: Dict[str, Any], new_module: torch.nn.Module):
assert(isinstance(node.target, str))
parent_name, name = _parent_name(node.target)
setattr(modules[parent_name], name, new_module)
def fuse(model: torch.fx.GraphModule) -> torch.fx.GraphModule:
model = copy.deepcopy(model)
# The first step of most FX passes is to symbolically trace our model to
# obtain a `GraphModule`. This is a representation of our original model
# that is functionally identical to our original model, except that we now
# also have a graph representation of our forward pass.
fx_model = model
modules = dict(fx_model.named_modules())
# The primary representation for working with FX are the `Graph` and the
# `Node`. Each `GraphModule` has a `Graph` associated with it - this
# `Graph` is also what generates `GraphModule.code`.
# The `Graph` itself is represented as a list of `Node` objects. Thus, to
# iterate through all of the operations in our graph, we iterate over each
# `Node` in our `Graph`.
for node in fx_model.graph.nodes:
# The FX IR contains several types of nodes, which generally represent
# call sites to modules, functions, or methods. The type of node is
# determined by `Node.op`.
if node.op != 'call_module': # If our current node isn't calling a Module then we can ignore it.
continue
# For call sites, `Node.target` represents the module/function/method
# that's being called. Here, we check `Node.target` to see if it's a
# batch norm module, and then check `Node.args[0].target` to see if the
# input `Node` is a convolution.
# print(node.target, node.args, node.args[0].args)
if isinstance(modules[node.target], torch.nn.BatchNorm1d):
if node.args[0].target in modules and isinstance(modules[node.args[0].target], SparseConvolution):
if len(node.args[0].users) > 1: # Output of conv is used by other nodes
continue
conv = modules[node.args[0].target]
bn = modules[node.target]
fused_conv = fuse_bn(conv, bn)
assert isinstance(fused_conv, SparseConvolution)
replace_node_module(node.args[0], modules, fused_conv)
modules_to_fuse = [node.args[0].target, node.target]
try:
if isinstance(modules[node.next.target], torch.nn.ReLU):
modules_to_fuse.append(node.next.target)
except Exception as e:
pass
# As we've folded the batch nor into the conv, we need to replace all uses
# of the batch norm with the conv.
node.replace_all_uses_with(node.args[0])
# Now that all uses of the batch norm have been replaced, we can
# safely remove the batch norm.
fx_model.graph.erase_node(node)
fx_model.graph.lint()
# After we've modified our graph, we need to recompile our graph in order
# to keep the generated code in sync.
fx_model.recompile()
return fx_model
def fuse_act(model: torch.fx.GraphModule) -> torch.fx.GraphModule:
model = copy.deepcopy(model)
# The first step of most FX passes is to symbolically trace our model to
# obtain a `GraphModule`. This is a representation of our original model
# that is functionally identical to our original model, except that we now
# also have a graph representation of our forward pass.
fx_model = model
modules = dict(fx_model.named_modules())
# The primary representation for working with FX are the `Graph` and the
# `Node`. Each `GraphModule` has a `Graph` associated with it - this
# `Graph` is also what generates `GraphModule.code`.
# The `Graph` itself is represented as a list of `Node` objects. Thus, to
# iterate through all of the operations in our graph, we iterate over each
# `Node` in our `Graph`.
for node in fx_model.graph.nodes:
# The FX IR contains several types of nodes, which generally represent
# call sites to modules, functions, or methods. The type of node is
# determined by `Node.op`.
if node.op != 'call_module': # If our current node isn't calling a Module then we can ignore it.
continue
# For call sites, `Node.target` represents the module/function/method
# that's being called. Here, we check `Node.target` to see if it's a
# batch norm module, and then check `Node.args[0].target` to see if the
# input `Node` is a convolution.
# print(node.target, node.args, node.args[0].args)
if isinstance(modules[node.target], (torch.nn.ReLU, torch.nn.LeakyReLU, torch.nn.Sigmoid)):
if node.args[0].target in modules and isinstance(modules[node.args[0].target], SparseConvolution):
if len(node.args[0].users) > 1: # Output of conv is used by other nodes
continue
conv = modules[node.args[0].target]
act = modules[node.target]
fused_conv = fuse_act_net(conv, act)
assert isinstance(fused_conv, SparseConvolution)
replace_node_module(node.args[0], modules, fused_conv)
# As we've folded the batch nor into the conv, we need to replace all uses
# of the batch norm with the conv.
node.replace_all_uses_with(node.args[0])
# Now that all uses of the batch norm have been replaced, we can
# safely remove the batch norm.
fx_model.graph.erase_node(node)
fx_model.graph.lint()
# After we've modified our graph, we need to recompile our graph in order
# to keep the generated code in sync.
fx_model.recompile()
return fx_model
class Net(nn.Module):
def __init__(self, shape, algo):
super().__init__()
pool_algo = algo
# pool_algo = ConvAlgo.Native
self.net = spconv.SparseSequential(
spconv.SubMConv3d(3, 64, 3, bias=False, indice_key="c0",
algo=algo),
nn.BatchNorm1d(64),
nn.ReLU(),
spconv.SubMConv3d(64,
64,
3,
bias=False,
indice_key="c0",
algo=algo),
nn.BatchNorm1d(64),
nn.ReLU(),
spconv.SparseConv3d(64, 64, 2, 2, bias=False, indice_key="m0", algo=algo),
nn.BatchNorm1d(64),
nn.ReLU(),
spconv.SubMConv3d(64,
96,
3,
bias=False,
indice_key="c1",
algo=algo),
nn.BatchNorm1d(96),
nn.ReLU(),
spconv.SubMConv3d(96,
96,
3,
bias=False,
indice_key="c1",
algo=algo),
nn.BatchNorm1d(96),
nn.ReLU(),
spconv.SparseConv3d(96, 96, 2, 2, bias=False, indice_key="m1", algo=algo),
nn.BatchNorm1d(96),
nn.ReLU(),
spconv.SubMConv3d(96,
128,
3,
bias=False,
indice_key="c2",
algo=algo),
nn.BatchNorm1d(128),
nn.ReLU(),
spconv.SubMConv3d(128,
128,
3,
bias=False,
indice_key="c2",
algo=algo),
nn.BatchNorm1d(128),
nn.ReLU(),
spconv.SparseConv3d(128, 128, 2, 2, bias=False, indice_key="m2", algo=algo),
nn.BatchNorm1d(128),
nn.ReLU(),
spconv.SubMConv3d(128,
160,
3,
bias=False,
indice_key="c3",
algo=algo),
nn.BatchNorm1d(160),
nn.ReLU(),
spconv.SubMConv3d(160,
160,
3,
bias=False,
indice_key="c3",
algo=algo),
nn.BatchNorm1d(160),
nn.ReLU(),
spconv.SparseConv3d(160, 160, 2, 2, bias=False, indice_key="m3", algo=algo),
nn.BatchNorm1d(160),
nn.ReLU(),
spconv.SubMConv3d(160,
192,
3,
bias=False,
indice_key="c4",
algo=algo),
nn.BatchNorm1d(192),
nn.ReLU(),
spconv.SubMConv3d(192,
192,
3,
bias=False,
indice_key="c4",
algo=algo),
nn.BatchNorm1d(192),
nn.ReLU(),
spconv.SparseConv3d(192, 192, 2, 2, bias=False, indice_key="m4", algo=algo),
nn.BatchNorm1d(192),
nn.ReLU(),
spconv.SubMConv3d(192,
224,
3,
bias=False,
indice_key="c5",
algo=algo),
nn.BatchNorm1d(224),
nn.ReLU(),
spconv.SubMConv3d(224,
224,
3,
bias=False,
indice_key="c5",
algo=algo),
nn.BatchNorm1d(224),
nn.ReLU(),
spconv.SparseConv3d(224, 224, 2, 2, bias=False, indice_key="m5", algo=algo),
nn.BatchNorm1d(224),
nn.ReLU(),
spconv.SubMConv3d(224,
256,
3,
bias=False,
indice_key="c6",
algo=algo),
nn.BatchNorm1d(256),
nn.ReLU(),
spconv.SubMConv3d(256,
256,
3,
bias=False,
indice_key="c6",
algo=algo),
nn.BatchNorm1d(256),
nn.ReLU(),
spconv.SparseInverseConv3d(256,
128,
2,
indice_key="m5",
bias=False,
algo=algo),
nn.BatchNorm1d(128),
nn.ReLU(),
spconv.SparseInverseConv3d(128,
64,
2,
indice_key="m4",
bias=False,
algo=algo),
nn.BatchNorm1d(64),
nn.ReLU(),
)
max_batch_size = 1
# grid (dense map) is used for indice generation. use pre-allocated grid can run faster.
# self.grid = None
self.shape = shape
for n in self.net.modules():
if isinstance(n, nn.BatchNorm1d):
n.bias.data.uniform_(-0.1, 0.1)
def forward(self, features, coors, batch_size, vx_num=None):
x = spconv.SparseConvTensor(features, coors, self.shape, batch_size, voxel_num=vx_num)
return self.net(x)
class MyTracer(torch.fx.Tracer):
def is_leaf_module(self, m: torch.nn.Module, module_qualified_name):
is_custom_leaf_module = isinstance(m, SparseConvolution)
return super().is_leaf_module(m, module_qualified_name) or is_custom_leaf_module
def main():
# run this file with SPCONV_FX_TRACE_MODE=1
torch.manual_seed(50051)
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False
with open(Path(__file__).parent / "data" / "test_spconv.pkl", "rb") as f:
(voxels, coors, spatial_shape) = pickle.load(f)
np.random.seed(50051)
device = torch.device("cuda:0")
device_cpu = torch.device("cpu:0")
dtype = torch.float32
net = Net(spatial_shape, ConvAlgo.MaskImplicitGemm).cuda().eval().to(dtype)
tracer = MyTracer()
graph_trace = tracer.trace(net)
net_fused = torch.fx.GraphModule(tracer.root, graph_trace)
net_fused = fuse(net_fused)
net_fused = fuse_act(net_fused)
print(net_fused)
voxels_th = torch.from_numpy(voxels).to(device_cpu).to(dtype)
coors_th = torch.from_numpy(coors).to(device_cpu).int()
voxels_th_cuda = torch.from_numpy(voxels).to(device).to(dtype)
coors_th_cuda = torch.from_numpy(coors).to(device).int()
out_ref = net(voxels_th_cuda, coors_th_cuda, 1)
print("-------------fused------------")
out_fused = net_fused(voxels_th_cuda, coors_th_cuda, 1)
res = Fsp.sparse_add_hash_based(out_ref, out_fused.minus())
print(torch.linalg.norm(res.features))
if __name__ == "__main__":
main()
\ No newline at end of file
# Copyright 2022 Yan Yan
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from spconv.core_cc.csrc.utils.pcc import PointCloudCompress
from pathlib import Path
import numpy as np
from cumm import tensorview as tv
def main():
data = np.load(Path(__file__).parent.parent / "test" / "data" / "benchmark-pc.npz")
pc = np.ascontiguousarray(data["pc"]).astype(np.float32)
pc_encoded, order = PointCloudCompress.encode_with_order(tv.from_numpy(pc), tv.Tensor(), 0.01, 0.01, 0.01, PointCloudCompress.EncodeType.XYZ_8, True)
print(pc.nbytes, pc_encoded.bytesize())
pc_recover = PointCloudCompress.decode(pc_encoded)
pc_reorder = pc[order.numpy_view()]
error = pc_recover.numpy_view()[:, :3] - pc_reorder[:, :3]
print("ERROR", np.abs(error).max())
if __name__ == "__main__":
main()
\ No newline at end of file
...@@ -38,9 +38,9 @@ if cuda_ver: ...@@ -38,9 +38,9 @@ if cuda_ver:
cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102 cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102
RELEASE_NAME += "-cu{}".format(cuda_ver) RELEASE_NAME += "-cu{}".format(cuda_ver)
deps = ["cumm-cu{}>=0.2.8".format(cuda_ver)] deps = ["cumm-cu{}>=0.3.0".format(cuda_ver)]
else: else:
deps = ["cumm>=0.2.8"] deps = ["cumm>=0.3.0"]
...@@ -156,7 +156,7 @@ if disable_jit is not None and disable_jit == "1": ...@@ -156,7 +156,7 @@ if disable_jit is not None and disable_jit == "1":
from cumm.conv.main import ConvMainUnitTest from cumm.conv.main import ConvMainUnitTest
from cumm.constants import CUMM_CPU_ONLY_BUILD from cumm.constants import CUMM_CPU_ONLY_BUILD
from spconv.csrc.sparse.all import SpconvOps from spconv.csrc.sparse.all import SpconvOps
from spconv.csrc.utils import BoxOps from spconv.csrc.utils import BoxOps, PointCloudCompress
from spconv.csrc.hash.core import HashTable from spconv.csrc.hash.core import HashTable
from cumm.common import CompileInfo from cumm.common import CompileInfo
from spconv.csrc.sparse.alloc import ExternalAllocator from spconv.csrc.sparse.alloc import ExternalAllocator
...@@ -192,7 +192,7 @@ if disable_jit is not None and disable_jit == "1": ...@@ -192,7 +192,7 @@ if disable_jit is not None and disable_jit == "1":
convops.namespace = "csrc.sparse.convops.spops" convops.namespace = "csrc.sparse.convops.spops"
cus = [gemmtuner, convtuner, cus = [gemmtuner, convtuner,
convops, SpconvOps(), BoxOps(), HashTable(), CompileInfo(), convops, SpconvOps(), BoxOps(), HashTable(), CompileInfo(),
ExternalAllocator(), ExternalAllocator(), PointCloudCompress(),
ExternalSpconvMatmul(), InferenceOps()] ExternalSpconvMatmul(), InferenceOps()]
if not CUMM_CPU_ONLY_BUILD: if not CUMM_CPU_ONLY_BUILD:
cus.extend([cu, convcu]) cus.extend([cu, convcu])
......
...@@ -31,7 +31,7 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable( ...@@ -31,7 +31,7 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(
from spconv.csrc.sparse.all import SpconvOps from spconv.csrc.sparse.all import SpconvOps
from spconv.csrc.sparse.alloc import ExternalAllocator from spconv.csrc.sparse.alloc import ExternalAllocator
from spconv.csrc.utils import BoxOps from spconv.csrc.utils import BoxOps, PointCloudCompress
from spconv.csrc.hash.core import HashTable from spconv.csrc.hash.core import HashTable
from spconv.csrc.sparse.convops import GemmTunerSimple, ExternalSpconvMatmul from spconv.csrc.sparse.convops import GemmTunerSimple, ExternalSpconvMatmul
from spconv.csrc.sparse.convops import ConvTunerSimple, ConvGemmOps from spconv.csrc.sparse.convops import ConvTunerSimple, ConvGemmOps
...@@ -65,6 +65,7 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable( ...@@ -65,6 +65,7 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(
ExternalSpconvMatmul(), ExternalSpconvMatmul(),
SimpleExternalSpconvMatmul(), # for debug, won't be included in release SimpleExternalSpconvMatmul(), # for debug, won't be included in release
InferenceOps(), InferenceOps(),
PointCloudCompress(),
] ]
pccm.builder.build_pybind(cus, pccm.builder.build_pybind(cus,
PACKAGE_ROOT / "core_cc", PACKAGE_ROOT / "core_cc",
......
from typing import overload, Any, Callable, Dict, List, Optional, Set, Tuple, Type, Union
from pccm.stubs import EnumValue, EnumClassValue
from cumm.tensorview import Tensor
class PointCloudCompress:
@staticmethod
def encode_with_order(points: Tensor, intensity: Tensor, ex: float, ey: float, ez: float, type, with_order: bool = False) -> Tuple[Tensor, Tensor]:
"""
Args:
points:
intensity:
ex:
ey:
ez:
type:
with_order:
"""
...
@staticmethod
def encode_xyzi(points: Tensor, intensity: Tensor, ex: float, ey: float, ez: float) -> Tensor:
"""
Args:
points:
intensity:
ex:
ey:
ez:
"""
...
@staticmethod
def encode_xyz(points: Tensor, ex: float, ey: float, ez: float) -> Tensor:
"""
Args:
points:
ex:
ey:
ez:
"""
...
@staticmethod
def decode(data: Tensor) -> Tensor:
"""
Args:
data:
"""
...
class EncodeType:
XYZ_8 = EnumClassValue(0) # type: EnumClassValue
XYZI_8 = EnumClassValue(1) # type: EnumClassValue
@staticmethod
def __members__() -> Dict[str, EnumClassValue]: ...
...@@ -4,7 +4,16 @@ class CompileInfo: ...@@ -4,7 +4,16 @@ class CompileInfo:
@staticmethod @staticmethod
def get_compiled_cuda_arch() -> List[Tuple[int, int]]: ... def get_compiled_cuda_arch() -> List[Tuple[int, int]]: ...
@staticmethod @staticmethod
def arch_is_compiled(arch: Tuple[int, int]) -> bool: def get_compiled_gemm_cuda_arch() -> List[Tuple[int, int]]: ...
@staticmethod
def arch_is_compiled_gemm(arch: Tuple[int, int]) -> bool:
"""
Args:
arch:
"""
...
@staticmethod
def arch_is_compiled_gemm(arch: Tuple[int, int]) -> bool:
""" """
Args: Args:
arch: arch:
......
...@@ -27,4 +27,4 @@ from spconv.core_cc.csrc.utils.boxops import BoxOps ...@@ -27,4 +27,4 @@ from spconv.core_cc.csrc.utils.boxops import BoxOps
from spconv.core_cc.cumm.common import CompileInfo from spconv.core_cc.cumm.common import CompileInfo
HAS_BOOST = BoxOps.has_boost() HAS_BOOST = BoxOps.has_boost()
COMPILED_CUDA_ARCHS = set(CompileInfo.get_compiled_cuda_arch()) COMPILED_CUDA_ARCHS = set(CompileInfo.get_compiled_gemm_cuda_arch())
...@@ -1384,7 +1384,8 @@ class SpconvOps(pccm.Class): ...@@ -1384,7 +1384,8 @@ class SpconvOps(pccm.Class):
code.arg("voxels, indices, num_per_voxel, hashdata, point_indice_data, pc_voxel_id", code.arg("voxels, indices, num_per_voxel, hashdata, point_indice_data, pc_voxel_id",
"tv::Tensor") "tv::Tensor")
code.arg("vsize", f"std::vector<float>") code.arg("vsize", f"std::vector<float>")
code.arg("grid_size, grid_stride", f"std::vector<int>") code.arg("grid_size", f"std::vector<int>")
code.arg("grid_stride", f"std::vector<int64_t>")
code.arg("coors_range", f"std::vector<float>") code.arg("coors_range", f"std::vector<float>")
code.arg("empty_mean", "bool", "false") code.arg("empty_mean", "bool", "false")
...@@ -1404,7 +1405,9 @@ class SpconvOps(pccm.Class): ...@@ -1404,7 +1405,9 @@ class SpconvOps(pccm.Class):
code.raw(f""" code.raw(f"""
if (ndim == {ndim}){{ if (ndim == {ndim}){{
std::array<float, {ndim}> vsize_; std::array<float, {ndim}> vsize_;
std::array<int, {ndim}> grid_size_, grid_stride_; std::array<int, {ndim}> grid_size_;
std::array<int64_t, {ndim}> grid_stride_;
std::array<float, {ndim * 2}> coors_range_; std::array<float, {ndim * 2}> coors_range_;
for (int i = 0; i < {ndim}; ++i){{ for (int i = 0; i < {ndim}; ++i){{
vsize_[i] = vsize[i]; vsize_[i] = vsize[i];
......
...@@ -550,7 +550,7 @@ class GemmTunerSimple(pccm.ParameterizedClass): ...@@ -550,7 +550,7 @@ class GemmTunerSimple(pccm.ParameterizedClass):
}} }}
auto avail_algos = get_available_algo_str_from_arch(arch); auto avail_algos = get_available_algo_str_from_arch(arch);
std::vector<tv::gemm::GemmAlgoDesp> finally_algos; std::vector<tv::gemm::GemmAlgoDesp> finally_algos;
auto is_arch_compiled = CompileInfo::arch_is_compiled(arch); auto is_arch_compiled = CompileInfo::arch_is_compiled_gemm(arch);
for (auto algo : avail_algos){{ for (auto algo : avail_algos){{
static_key_t static_key = std::make_tuple(trans_a, trans_b, trans_c, int(a.dtype()), static_key_t static_key = std::make_tuple(trans_a, trans_b, trans_c, int(a.dtype()),
int(b.dtype()), int(c.dtype()), shuffle_type, algo); int(b.dtype()), int(c.dtype()), shuffle_type, algo);
...@@ -996,7 +996,7 @@ class ConvTunerSimple(pccm.ParameterizedClass): ...@@ -996,7 +996,7 @@ class ConvTunerSimple(pccm.ParameterizedClass):
use_f32_as_accum = false; use_f32_as_accum = false;
std::vector<tv::gemm::ConvAlgoDesp> finally_algos; std::vector<tv::gemm::ConvAlgoDesp> finally_algos;
auto is_arch_compiled = CompileInfo::arch_is_compiled(arch); auto is_arch_compiled = CompileInfo::arch_is_compiled_gemm(arch);
for (auto algo : avail_algos){{ for (auto algo : avail_algos){{
static_key_t static_key = std::make_tuple( static_key_t static_key = std::make_tuple(
layout_i, layout_w, layout_o, layout_i, layout_w, layout_o,
...@@ -2048,7 +2048,12 @@ class ConvGemmOps(pccm.ParameterizedClass): ...@@ -2048,7 +2048,12 @@ class ConvGemmOps(pccm.ParameterizedClass):
for (int j = 0; j < num_split; ++j){{ for (int j = 0; j < num_split; ++j){{
float beta = j == 0 ? 0 : 1; float beta = j == 0 ? 0 : 1;
if (!bias.empty()){{
beta = 1;
}}
if (j > 0){{
bias = tv::Tensor();
}}
conv_tuner.run_with_tuned_result( conv_tuner.run_with_tuned_result(
tune_res, tune_res,
kForwardInt, kForwardInt,
......
...@@ -13,29 +13,37 @@ ...@@ -13,29 +13,37 @@
# limitations under the License. # limitations under the License.
import pccm import pccm
from cumm.common import TensorView, GemmDTypes, TensorViewKernel, ThrustLib, GemmBasic from cumm.common import TensorView, GemmDTypes, TensorViewKernel, TensorViewNVRTC, ThrustLib, GemmBasic
from spconv.csrc.sparse.cpu_core import OMPLib from spconv.csrc.sparse.cpu_core import OMPLib
from ..utils.launch import LaunchUtils
from cumm.constants import CUMM_CPU_ONLY_BUILD from cumm.constants import CUMM_CPU_ONLY_BUILD
class InferenceOpsKernel(pccm.ParameterizedClass): class InferenceOpsKernel(pccm.ParameterizedClass):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
self.add_dependency(TensorViewKernel, GemmBasic) self.add_dependency(TensorViewKernel, TensorViewNVRTC, GemmBasic, LaunchUtils)
@pccm.cuda.cuda_global_function @pccm.cuda.cuda_global_function
def bias_add_inplace_kernel(self): def bias_add_inplace_kernel(self):
code = pccm.FunctionCode() code = pccm.FunctionCode()
code.targ("T") code.targ("T")
code.nontype_targ("OneDim", "bool", "false")
code.arg("out_features", f"T*") code.arg("out_features", f"T*")
code.arg("bias", f"const T*") code.arg("bias", f"const T*")
code.arg("size", "int") code.arg("size", "int")
code.arg("num_features", "int") code.arg("num_features", "int")
code.arg("num_blocks_x", "int")
code.arg("num_blocks_y", "int")
code.raw(f""" code.raw(f"""
for (int i : tv::KernelLoopY<int>(size)) {{ int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
for (int i : tv::KernelLoopY<int>(size, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
auto out_ptr = out_features + i * num_features; auto out_ptr = out_features + i * num_features;
for (int j : tv::KernelLoopX<int>(num_features)) {{ for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
out_ptr[j] = bias[j] + out_ptr[j]; out_ptr[j] = bias[j] + out_ptr[j];
}} }}
}} }}
...@@ -46,6 +54,7 @@ class InferenceOpsKernel(pccm.ParameterizedClass): ...@@ -46,6 +54,7 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
def bias_add_act_inplace_kernel(self): def bias_add_act_inplace_kernel(self):
code = pccm.FunctionCode() code = pccm.FunctionCode()
code.targ("T") code.targ("T")
code.nontype_targ("OneDim", "bool", "false")
code.arg("out_features", f"T*") code.arg("out_features", f"T*")
code.arg("bias", f"const T*") code.arg("bias", f"const T*")
...@@ -54,20 +63,35 @@ class InferenceOpsKernel(pccm.ParameterizedClass): ...@@ -54,20 +63,35 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
code.arg("beta", f"T") code.arg("beta", f"T")
code.arg("size", "int") code.arg("size", "int")
code.arg("num_features", "int") code.arg("num_features", "int")
code.arg("num_blocks_x", "int")
code.arg("num_blocks_y", "int")
code.raw(f""" code.raw(f"""
for (int i : tv::KernelLoopY<int>(size)) {{ int block_idx_x = OneDim ? blockIdx.x % num_blocks_x : blockIdx.x;
int block_idx_y = OneDim ? blockIdx.x / num_blocks_x : blockIdx.y;
namespace op = tv::arrayops;
using nv_scalar_t = tv::equivalent_data_type_t<T>;
for (int i : tv::KernelLoopY<int>(size, block_idx_y, OneDim ? num_blocks_y : gridDim.y)) {{
auto out_ptr = out_features + i * num_features; auto out_ptr = out_features + i * num_features;
for (int j : tv::KernelLoopX<int>(num_features)) {{ for (int j : tv::KernelLoopX<int>(num_features, block_idx_x, OneDim ? num_blocks_x : gridDim.x)) {{
T o = out_ptr[j] + bias[j]; T o = out_ptr[j] + bias[j];
auto* o_nv = reinterpret_cast<nv_scalar_t*>(&o);
switch (act_type){{ switch (act_type){{
case tv::gemm::Activation::kNone: case tv::gemm::Activation::kNone:
break; break;
case tv::gemm::Activation::kReLU:{{ case tv::gemm::Activation::kReLU:{{
o = o >= T(0) ? o : T(0); o = o >= T(0) ? o : T(0);
break;
}} }}
case tv::gemm::Activation::kLeakyReLU:{{ case tv::gemm::Activation::kLeakyReLU:{{
o = o >= T(0) ? o : o * alpha; o = o >= T(0) ? o : o * alpha;
break;
}}
case tv::gemm::Activation::kSigmoid:{{
auto e = op::MathScalarOp<nv_scalar_t>::exp(-*o_nv);
o = T(1) / (T(1) + *reinterpret_cast<T*>( &e ));
break;
}} }}
default: ; default: ;
}} }}
...@@ -89,16 +113,27 @@ class InferenceOpsKernel(pccm.ParameterizedClass): ...@@ -89,16 +113,27 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
code.arg("size", "int") code.arg("size", "int")
code.raw(f""" code.raw(f"""
namespace op = tv::arrayops;
using nv_scalar_t = tv::equivalent_data_type_t<T>;
for (int i : tv::KernelLoopX<int>(size)) {{ for (int i : tv::KernelLoopX<int>(size)) {{
T o = out_features[i]; T o = out_features[i];
auto* o_nv = reinterpret_cast<nv_scalar_t*>(&o);
switch (act_type){{ switch (act_type){{
case tv::gemm::Activation::kNone: case tv::gemm::Activation::kNone:
break; break;
case tv::gemm::Activation::kReLU:{{ case tv::gemm::Activation::kReLU:{{
out_features[i] = o >= T(0) ? o : T(0); out_features[i] = o >= T(0) ? o : T(0);
break;
}} }}
case tv::gemm::Activation::kLeakyReLU:{{ case tv::gemm::Activation::kLeakyReLU:{{
out_features[i] = o >= T(0) ? o : o * alpha; out_features[i] = o >= T(0) ? o : o * alpha;
break;
}}
case tv::gemm::Activation::kSigmoid:{{
auto e = op::MathScalarOp<nv_scalar_t>::exp(-*o_nv);
out_features[i] = T(1) / (T(1) + *reinterpret_cast<T*>( &e ));
break;
}} }}
default: ; default: ;
}} }}
...@@ -110,9 +145,10 @@ class InferenceOpsKernel(pccm.ParameterizedClass): ...@@ -110,9 +145,10 @@ class InferenceOpsKernel(pccm.ParameterizedClass):
class InferenceOps(pccm.Class): class InferenceOps(pccm.Class):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
self.add_dependency(TensorView) self.add_dependency(TensorView, LaunchUtils)
self.kernel = InferenceOpsKernel() self.kernel = InferenceOpsKernel()
self.add_include("tensorview/gemm/core/constants.h") self.add_include("tensorview/gemm/core/constants.h")
self.add_static_const("kMaxGridYZDim", "int", "65535")
if CUMM_CPU_ONLY_BUILD: if CUMM_CPU_ONLY_BUILD:
_DECORATOR = pccm.static_function _DECORATOR = pccm.static_function
...@@ -139,34 +175,30 @@ class InferenceOps(pccm.Class): ...@@ -139,34 +175,30 @@ class InferenceOps(pccm.Class):
auto nhot = out.dim(0); auto nhot = out.dim(0);
auto cudastream = reinterpret_cast<cudaStream_t>(stream); auto cudastream = reinterpret_cast<cudaStream_t>(stream);
TV_ASSERT_RT_ERR(bias.dim(0) == out.dim(1), "error"); TV_ASSERT_RT_ERR(bias.dim(0) == out.dim(1), "error");
tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{ tv::dispatch<float, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
using T = TV_DECLTYPE(I); using T = TV_DECLTYPE(I);
constexpr int MaxThreads = 512; auto launchdims = LaunchUtils::get_blocks_threads_of_2d_tensor(nhot, out.dim(1));
tv::cuda::Launch launcher(1); int num_blocks_X = std::get<0>(launchdims);
bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(out.dim(1), [](int my, int expect){{return my >= expect;}}, [&](auto V){{ int num_blocks_Y = std::get<1>(launchdims);
// if out.dim(1) > value in list above, run this function. dim3 blocks;
// if a value is found, other value won't be executed. dim3 threads(std::get<2>(launchdims), std::get<3>(launchdims));
int NumFeatures = TV_DECLTYPE(V)::value; if (num_blocks_Y > kMaxGridYZDim){{
int Num0 = MaxThreads / NumFeatures; blocks = dim3(num_blocks_X * num_blocks_Y);
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}});
if (!found){{
int NumFeatures = 16;
int Num0 = MaxThreads / NumFeatures;
dim3 blocks(tv::div_up(out.dim(1), int64_t(NumFeatures)), tv::div_up(nhot, int64_t(Num0)));
dim3 threads(NumFeatures, Num0);
launcher = tv::cuda::Launch(blocks, threads, cudastream);
}}
if (act_type == tv::gemm::Activation::kNone){{
launcher(ker::bias_add_inplace_kernel<T>, out.data_ptr<T>(), bias.data_ptr<const T>(),
nhot, out.dim(1));
}}else{{ }}else{{
launcher(ker::bias_add_act_inplace_kernel<T>, out.data_ptr<T>(), bias.data_ptr<const T>(), blocks = dim3(num_blocks_X, num_blocks_Y);
act_type, T(alpha), T(beta), nhot, out.dim(1));
}} }}
tv::cuda::Launch launcher = tv::cuda::Launch(blocks, threads, cudastream);
tv::dispatch_int<0, 1>(int(num_blocks_Y > kMaxGridYZDim), [&](auto I2){{
constexpr bool OneDim = TV_DECLTYPE(I2)::value == 1;
if (act_type == tv::gemm::Activation::kNone){{
launcher(ker::bias_add_inplace_kernel<T, OneDim>, out.data_ptr<T>(), bias.data_ptr<const T>(),
nhot, out.dim(1), num_blocks_X, num_blocks_Y);
}}else{{
launcher(ker::bias_add_act_inplace_kernel<T, OneDim>, out.data_ptr<T>(), bias.data_ptr<const T>(),
act_type, T(alpha), T(beta), nhot, out.dim(1), num_blocks_X, num_blocks_Y);
}}
}});
TV_CHECK_CUDA_ERR_V2("bias add act failed!!!");
}}); }});
""") """)
return code return code
...@@ -204,10 +236,11 @@ class InferenceOps(pccm.Class): ...@@ -204,10 +236,11 @@ class InferenceOps(pccm.Class):
auto nhot = out.size(); auto nhot = out.size();
auto cudastream = reinterpret_cast<cudaStream_t>(stream); auto cudastream = reinterpret_cast<cudaStream_t>(stream);
tv::cuda::Launch launcher = tv::cuda::Launch(nhot, cudastream); tv::cuda::Launch launcher = tv::cuda::Launch(nhot, cudastream);
tv::dispatch<float, double, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{ tv::dispatch<float, tv::half_t, tv::bfloat16_t>(out.dtype(), [&](auto I){{
using T = TV_DECLTYPE(I); using T = TV_DECLTYPE(I);
launcher(ker::activation_inplace_kernel<T>, out.data_ptr<T>(), act_type, T(alpha), T(beta), launcher(ker::activation_inplace_kernel<T>, out.data_ptr<T>(), act_type, T(alpha), T(beta),
nhot); nhot);
TV_CHECK_CUDA_ERR_V2("bias add act failed!!!");
}}); }});
""") """)
return code return code
This diff is collapsed.
...@@ -32,9 +32,11 @@ class Point2VoxelCommon(pccm.ParameterizedClass): ...@@ -32,9 +32,11 @@ class Point2VoxelCommon(pccm.ParameterizedClass):
self.ndim = ndim self.ndim = ndim
self.zyx = zyx self.zyx = zyx
ret_str = f"std::array<int, {self.ndim}>" ret_str = f"std::array<int, {self.ndim}>"
ret64_str = f"std::array<int64_t, {self.ndim}>"
retf_str = f"std::array<float, {self.ndim}>" retf_str = f"std::array<float, {self.ndim}>"
retf2_str = f"std::array<float, {self.ndim * 2}>" retf2_str = f"std::array<float, {self.ndim * 2}>"
self.calc_meta_ret = f"std::tuple<{retf_str}, {ret_str}, {ret_str}, {retf2_str}>" self.calc_meta_ret = f"std::tuple<{retf_str}, {ret_str}, {ret64_str}, {retf2_str}>"
@pccm.static_function @pccm.static_function
def calc_meta_data(self): def calc_meta_data(self):
...@@ -43,8 +45,8 @@ class Point2VoxelCommon(pccm.ParameterizedClass): ...@@ -43,8 +45,8 @@ class Point2VoxelCommon(pccm.ParameterizedClass):
code.arg("coors_range_xyz", f"std::array<float, {self.ndim * 2}>") code.arg("coors_range_xyz", f"std::array<float, {self.ndim * 2}>")
code.raw(f""" code.raw(f"""
std::array<float, {self.ndim}> vsize; std::array<float, {self.ndim}> vsize;
std::array<int, {self.ndim}> grid_size, grid_stride; std::array<int, {self.ndim}> grid_size;
std::array<int64_t, {self.ndim}> grid_stride;
std::array<float, {self.ndim * 2}> coors_range; std::array<float, {self.ndim * 2}> coors_range;
""") """)
...@@ -79,9 +81,10 @@ class Point2VoxelCommon(pccm.ParameterizedClass): ...@@ -79,9 +81,10 @@ class Point2VoxelCommon(pccm.ParameterizedClass):
ret_str = f"std::array<int, {self.ndim}>" ret_str = f"std::array<int, {self.ndim}>"
retf_str = f"std::array<float, {self.ndim}>" retf_str = f"std::array<float, {self.ndim}>"
retf2_str = f"std::array<float, {self.ndim * 2}>" retf2_str = f"std::array<float, {self.ndim * 2}>"
ret64_str = f"std::array<int64_t, {self.ndim}>"
return code.ret( return code.ret(
f"std::tuple<{retf_str}, {ret_str}, {ret_str}, {retf2_str}>") f"std::tuple<{retf_str}, {ret_str}, {ret64_str}, {retf2_str}>")
@pccm.static_function @pccm.static_function
def array2tvarray(self): def array2tvarray(self):
...@@ -143,7 +146,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin): ...@@ -143,7 +146,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
code.arg("vsize", f"tv::array<float, {self.ndim}>") code.arg("vsize", f"tv::array<float, {self.ndim}>")
code.arg("coors_range", f"tv::array<float, {self.ndim * 2}>") code.arg("coors_range", f"tv::array<float, {self.ndim * 2}>")
code.arg("grid_bound", f"tv::array<int, {self.ndim}>") code.arg("grid_bound", f"tv::array<int, {self.ndim}>")
code.arg("grid_stride", f"tv::array<int, {self.ndim}>") code.arg("grid_stride", f"tv::array<int64_t, {self.ndim}>")
code.arg("num_points", f"int") code.arg("num_points", f"int")
point_xyz = f"{self.ndim - 1} - j" point_xyz = f"{self.ndim - 1} - j"
...@@ -163,7 +166,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin): ...@@ -163,7 +166,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
if ((c < 0 || c >= grid_bound[j])) {{ if ((c < 0 || c >= grid_bound[j])) {{
failed = true; failed = true;
}} }}
prod += grid_stride[j] * c; prod += grid_stride[j] * int64_t(c);
}} }}
if (!failed){{ if (!failed){{
points_indice_data[i] = prod; points_indice_data[i] = prod;
...@@ -218,7 +221,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin): ...@@ -218,7 +221,7 @@ class Point2VoxelKernel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
code.arg("vsize", f"tv::array<float, {self.ndim}>") code.arg("vsize", f"tv::array<float, {self.ndim}>")
code.arg("coors_range", f"tv::array<float, {self.ndim * 2}>") code.arg("coors_range", f"tv::array<float, {self.ndim * 2}>")
code.arg("grid_bound", f"tv::array<int, {self.ndim}>") code.arg("grid_bound", f"tv::array<int, {self.ndim}>")
code.arg("grid_stride", f"tv::array<int, {self.ndim}>") code.arg("grid_stride", f"tv::array<int64_t, {self.ndim}>")
code.arg("num_points", f"int") code.arg("num_points", f"int")
# TODO add backward? # TODO add backward?
...@@ -323,7 +326,7 @@ class Point2Voxel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin): ...@@ -323,7 +326,7 @@ class Point2Voxel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
self.add_member("vsize", f"tv::array<float, {self.ndim}>") self.add_member("vsize", f"tv::array<float, {self.ndim}>")
self.add_member("coors_range", f"tv::array<float, {self.ndim * 2}>") self.add_member("coors_range", f"tv::array<float, {self.ndim * 2}>")
self.add_member("grid_size", f"tv::array<int, {self.ndim}>") self.add_member("grid_size", f"tv::array<int, {self.ndim}>")
self.add_member("grid_stride", f"tv::array<int, {self.ndim}>") self.add_member("grid_stride", f"tv::array<int64_t, {self.ndim}>")
@pccm.pybind.mark_prop_getter(prop_name="grid_size") @pccm.pybind.mark_prop_getter(prop_name="grid_size")
@pccm.member_function @pccm.member_function
...@@ -414,7 +417,9 @@ class Point2Voxel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin): ...@@ -414,7 +417,9 @@ class Point2Voxel(pccm.ParameterizedClass, pccm.pybind.PybindClassMixin):
code.arg("voxels, indices, num_per_voxel, hashdata, point_indice_data, points_voxel_id", code.arg("voxels, indices, num_per_voxel, hashdata, point_indice_data, points_voxel_id",
"tv::Tensor") "tv::Tensor")
code.arg("vsize", f"std::array<float, {self.ndim}>") code.arg("vsize", f"std::array<float, {self.ndim}>")
code.arg("grid_size, grid_stride", f"std::array<int, {self.ndim}>") code.arg("grid_size", f"std::array<int, {self.ndim}>")
code.arg("grid_stride", f"std::array<int64_t, {self.ndim}>")
code.arg("coors_range", f"std::array<float, {self.ndim * 2}>") code.arg("coors_range", f"std::array<float, {self.ndim * 2}>")
code.arg("clear_voxels", "bool", "true") code.arg("clear_voxels", "bool", "true")
code.arg("empty_mean", "bool", "false") code.arg("empty_mean", "bool", "false")
......
...@@ -12,4 +12,5 @@ ...@@ -12,4 +12,5 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from .boxops import BoxOps from .boxops import BoxOps
\ No newline at end of file from .pcc import PointCloudCompress
\ No newline at end of file
import pccm
from cumm.common import TensorView
class LaunchUtils(pccm.Class):
def __init__(self):
super().__init__()
self.add_include("limits")
self.add_dependency(TensorView)
self.add_static_const("kMaxGridYZDim", "int", "65535")
@pccm.static_function
def get_blocks_threads_of_2d_tensor(self):
code = pccm.code()
code.arg("nhot", "int64_t")
code.arg("num_features", "int64_t")
code.raw(f"""
constexpr int MaxThreads = 512;
int num_blocks_X = 0;
int num_blocks_Y = 0;
int threads_X = 0;
int threads_Y = 0;
dim3 threads;
bool found = tv::dispatch_int_noexcept<512, 256, 128, 64, 32, 16>(int(num_features), [](int my, int expect){{return my >= expect;}}, [&](auto V){{
// if num_features > value in list above, run this function.
// if a value is found, other value won't be executed.
int NumFeatures = TV_DECLTYPE(V)::value;
int Num0 = MaxThreads / NumFeatures;
num_blocks_X = tv::div_up(num_features, int64_t(NumFeatures));
num_blocks_Y = tv::div_up(nhot, int64_t(Num0));
threads_X = NumFeatures;
threads_Y = Num0;
}});
if (!found){{
int NumFeatures = 16;
int Num0 = MaxThreads / NumFeatures;
num_blocks_X = tv::div_up(num_features, int64_t(NumFeatures));
num_blocks_Y = tv::div_up(nhot, int64_t(Num0));
threads_X = NumFeatures;
threads_Y = Num0;
}}
return std::make_tuple(num_blocks_X, num_blocks_Y, threads_X, threads_Y);
""")
code.ret("std::tuple<int, int, int, int>")
return code
# Copyright 2022 Yan Yan
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pccm
from cumm.common import TensorView, TensorViewNVRTC
class PointCloudCompress(pccm.Class):
def __init__(self):
super().__init__()
self.add_dependency(TensorView, TensorViewNVRTC)
self.add_include("unordered_map")
self.add_include("tensorview/hash/hash_functions.h")
self.add_enum_class("EncodeType", [
("XYZ_8", 0),
("XYZI_8", 1),
])
@pccm.pybind.mark
@pccm.static_function
def encode_with_order(self):
code = pccm.code()
# TODO add checksum
code.arg("points", "tv::Tensor")
code.arg("intensity", "tv::Tensor")
code.arg("ex", "float")
code.arg("ey", "float")
code.arg("ez", "float")
code.arg("type", "EncodeType")
code.arg("with_order", "bool", "false")
code.raw(f"""
namespace op = tv::arrayops;
float vx = 256.0f * ex;
float vy = 256.0f * ey;
float vz = 256.0f * ez;
std::vector<std::tuple<uint64_t, tv::array<float, 3>>> offsets;
std::unordered_map<uint64_t, std::vector<int64_t>> order;
auto N = points.dim(0);
tv::array<float, 3> vsize{{vx, vy, vz}};
tv::array<float, 3> errors{{ex, ey, ez}};
tv::Tensor order_ten;
int64_t* order_ten_ptr = nullptr;
if (with_order){{
order_ten = tv::empty({{N}}, tv::int64);
order_ten_ptr = order_ten.data_ptr<int64_t>();
}}
using hash_t = tv::hash::SpatialHash<uint64_t>;
auto point_stride = points.stride(0);
int64_t final_size = sizeof(int64_t) * 5 + sizeof(float) * 3;
tv::Tensor res;
tv::ssprint(1);
tv::dispatch<float, double>(points.dtype(), [&](auto IP){{
using TPoint = TV_DECLTYPE(IP);
auto points_data = points.data_ptr<TPoint>();
tv::dispatch<float, double, uint8_t>(intensity.dtype(), [&](auto II){{
using TInten = TV_DECLTYPE(II);
auto intensity_data = intensity.data_ptr<TInten>();
tv::dispatch_int<static_cast<int>(EncodeType::XYZI_8), static_cast<int>(EncodeType::XYZ_8)>(static_cast<int>(type), [&](auto I){{
constexpr int kTypeInt = TV_DECLTYPE(I)::value;
constexpr int kEncodeDim = kTypeInt == static_cast<int>(EncodeType::XYZI_8) ? 4 : 3;
std::unordered_map<uint64_t, std::vector<tv::array<uint8_t, kEncodeDim>>> hash;
int inten_stride = 0;
if (kEncodeDim > 3){{
TV_ASSERT_RT_ERR(!intensity.empty(), "inten must not empty");
inten_stride = intensity.stride(0);
}}
for (size_t i = 0; i < N; ++i){{
tv::array<float, 3> point = op::read_ptr<3>(points_data).template cast<float>();
auto pos_unit_voxel = point / vsize;
auto pos_int = op::apply(floorf, pos_unit_voxel).cast<int32_t>();
auto pos_enc = (point / errors - pos_int.cast<float>() * float(256)).cast<uint8_t>();
tv::array<uint8_t, kEncodeDim> enc;
tv::if_constexpr<(kEncodeDim > 3)>([&](auto _){{
TInten inten = intensity_data[0];
enc = _(tv::array<uint8_t, kEncodeDim>{{pos_enc[0], pos_enc[1], pos_enc[2], uint8_t(inten)}});
intensity_data += inten_stride;
}}, [&](auto _){{
enc = _(tv::array<uint8_t, kEncodeDim>{{pos_enc[0], pos_enc[1], pos_enc[2]}});
}});
auto pos_uint = pos_int + hash_t::direct_hash_offset();
uint64_t scalar = hash_t::encode(pos_int[0], pos_int[1], pos_int[2]);
auto iter = hash.find(scalar);
if (iter == hash.end()){{
auto pos_offset = pos_int.cast<float>() * vsize;
std::vector<tv::array<uint8_t, kEncodeDim>> vec{{enc}};
offsets.push_back({{scalar, pos_offset}});
hash.insert({{scalar, vec}});
final_size += sizeof(float) * 3 + sizeof(int) + sizeof(uint8_t) * kEncodeDim;
if (with_order){{
std::vector<int64_t> order_cluster{{int64_t(i)}};
order.insert({{scalar, order_cluster}});
}}
}}else{{
// iter.value().push_back(enc);
iter->second.push_back(enc);
final_size += sizeof(uint8_t) * kEncodeDim;
if (with_order){{
order.at(scalar).push_back(i);
}}
}}
points_data += point_stride;
}}
res = tv::empty({{final_size}}, tv::uint8, -1);
auto res_ptr = res.raw_data();
int64_t* res_ptr_header = reinterpret_cast<int64_t*>(res_ptr);
res_ptr_header[0] = int64_t(final_size);
res_ptr_header[1] = static_cast<int>(type);
res_ptr_header[2] = int64_t(N);
res_ptr_header[3] = int64_t(offsets.size());
res_ptr_header[4] = 0;
// TODO add checksum in header
res_ptr += sizeof(int64_t) * 5;
float* error_header = reinterpret_cast<float*>(res_ptr);
error_header[0] = errors[0];
error_header[1] = errors[1];
error_header[2] = errors[2];
res_ptr += sizeof(float) * 3;
for (auto& p : offsets){{
auto& offset = std::get<1>(p);
auto& encodes = hash.at(std::get<0>(p));
int cluster_size = encodes.size();
reinterpret_cast<int*>(res_ptr)[0] = cluster_size;
reinterpret_cast<float*>(res_ptr)[1] = offset[0];
reinterpret_cast<float*>(res_ptr)[2] = offset[1];
reinterpret_cast<float*>(res_ptr)[3] = offset[2];
res_ptr += sizeof(int) + sizeof(float) * 3;
}}
for (auto& p : offsets){{
auto& offset = std::get<1>(p);
auto& encodes = hash.at(std::get<0>(p));
int cluster_size = encodes.size();
auto enc_ptr = reinterpret_cast<tv::array<uint8_t, kEncodeDim>*>(res_ptr);
for (int i = 0; i < cluster_size; ++i){{
enc_ptr[i] = encodes[i];
}}
if (with_order){{
auto& orders = order.at(std::get<0>(p));
for (int i = 0; i < cluster_size; ++i){{
order_ten_ptr[i] = orders[i];
}}
order_ten_ptr += cluster_size;
}}
res_ptr += cluster_size * sizeof(tv::array<uint8_t, kEncodeDim>);
}}
TV_ASSERT_RT_ERR(res_ptr - res.raw_data() == final_size, "error");
}});
}});
}});
return std::make_tuple(res, order_ten);
""")
return code.ret("std::tuple<tv::Tensor, tv::Tensor>")
@pccm.pybind.mark
@pccm.static_function
def encode_xyzi(self):
code = pccm.code()
code.arg("points", "tv::Tensor")
code.arg("intensity", "tv::Tensor")
code.arg("ex", "float")
code.arg("ey", "float")
code.arg("ez", "float")
code.raw(f"""
auto res = encode_with_order(points, intensity, ex, ey, ez, EncodeType::XYZI_8, false);
return std::get<0>(res);
""")
return code.ret("tv::Tensor")
@pccm.pybind.mark
@pccm.static_function
def encode_xyz(self):
code = pccm.code()
code.arg("points", "tv::Tensor")
code.arg("ex", "float")
code.arg("ey", "float")
code.arg("ez", "float")
code.raw(f"""
auto res = encode_with_order(points, tv::Tensor(), ex, ey, ez, EncodeType::XYZ_8, false);
return std::get<0>(res);
""")
return code.ret("tv::Tensor")
@pccm.pybind.mark
@pccm.static_function
def decode(self):
code = pccm.code()
code.arg("data", "tv::Tensor")
code.raw(f"""
namespace op = tv::arrayops;
const uint8_t* data_ptr = data.data_ptr<const uint8_t>();
auto res_ptr = data.raw_data();
int64_t* res_ptr_header = reinterpret_cast<int64_t*>(res_ptr);
int64_t final_size = res_ptr_header[0];
int type = res_ptr_header[1];
int64_t N = res_ptr_header[2];
int64_t voxel_num = res_ptr_header[3];
TV_ASSERT_RT_ERR(final_size == data.raw_size(), "size mismatch");
res_ptr += sizeof(int64_t) * 5;
float* error_header = reinterpret_cast<float*>(res_ptr);
tv::array<float, 3> error;
error[0] = error_header[0];
error[1] = error_header[1];
error[2] = error_header[2];
res_ptr += sizeof(float) * 3;
tv::Tensor points;
tv::dispatch_int<static_cast<int>(EncodeType::XYZI_8), static_cast<int>(EncodeType::XYZ_8)>(static_cast<int>(type), [&](auto I){{
constexpr int kTypeInt = TV_DECLTYPE(I)::value;
constexpr int kEncodeDim = kTypeInt == static_cast<int>(EncodeType::XYZI_8) ? 4 : 3;
points = tv::empty({{N, kEncodeDim}}, tv::float32);
auto points_ptr = points.data_ptr<float>();
auto enc_ptr = reinterpret_cast<tv::array<uint8_t, kEncodeDim>*>(res_ptr + voxel_num * (sizeof(int) * 1 + sizeof(float) * 3));
for (int i = 0; i < voxel_num; ++i){{
int cluster_size = reinterpret_cast<int*>(res_ptr)[0];
tv::array<float, 3> offset;
offset[0] = reinterpret_cast<float*>(res_ptr)[1];
offset[1] = reinterpret_cast<float*>(res_ptr)[2];
offset[2] = reinterpret_cast<float*>(res_ptr)[3];
auto point_cur_ptr = points_ptr;
for (int j = 0; j < cluster_size; ++j){{
auto& enc = enc_ptr[j];
auto point = op::slice<0, 3>(enc).template cast<float>() * error + offset;
point_cur_ptr[0] = point[0];
point_cur_ptr[1] = point[1];
point_cur_ptr[2] = point[2];
if (kEncodeDim > 3){{
point_cur_ptr[3] = enc[3];
}}
point_cur_ptr += kEncodeDim;
}}
res_ptr += sizeof(int) + sizeof(float) * 3;
enc_ptr += cluster_size;
points_ptr += cluster_size * kEncodeDim;
}}
}});
return points;
""")
return code.ret("tv::Tensor")
...@@ -270,6 +270,7 @@ class SparseConvolution(SparseModule): ...@@ -270,6 +270,7 @@ class SparseConvolution(SparseModule):
if self.training: if self.training:
msg = "act don't support backward, only used in inference" msg = "act don't support backward, only used in inference"
assert self.act_type == tv.gemm.Activation.None_, msg assert self.act_type == tv.gemm.Activation.None_, msg
if not self.subm: if not self.subm:
if self.transposed: if self.transposed:
out_spatial_shape = ops.get_deconv_output_size( out_spatial_shape = ops.get_deconv_output_size(
...@@ -428,6 +429,7 @@ class SparseConvolution(SparseModule): ...@@ -428,6 +429,7 @@ class SparseConvolution(SparseModule):
indice_pair_num, indice_pair_num,
outids.shape[0], outids.shape[0],
algo, algo,
input._timer,
bias_for_infer, bias_for_infer,
self.act_alpha, self.act_alpha,
self.act_beta, self.act_beta,
...@@ -551,7 +553,6 @@ class SparseConvolution(SparseModule): ...@@ -551,7 +553,6 @@ class SparseConvolution(SparseModule):
self.act_alpha, self.act_alpha,
self.act_beta, self.act_beta,
self.act_type) self.act_type)
if bias_for_training is not None: if bias_for_training is not None:
out_features += bias_for_training out_features += bias_for_training
if input.benchmark: if input.benchmark:
...@@ -571,6 +572,9 @@ class SparseConvolution(SparseModule): ...@@ -571,6 +572,9 @@ class SparseConvolution(SparseModule):
out_tensor.indices = outids out_tensor.indices = outids
out_tensor.indice_dict = indice_dict out_tensor.indice_dict = indice_dict
out_tensor.spatial_shape = out_spatial_shape out_tensor.spatial_shape = out_spatial_shape
# print(outids.shape, spatial_shape, self.kernel_size, self.stride, self.padding,
# self.dilation, self.output_padding, out_spatial_shape)
return out_tensor return out_tensor
def _check_subm_reuse_valid(self, inp: SparseConvTensor, def _check_subm_reuse_valid(self, inp: SparseConvTensor,
......
...@@ -1561,6 +1561,8 @@ def implicit_gemm(features: torch.Tensor, ...@@ -1561,6 +1561,8 @@ def implicit_gemm(features: torch.Tensor,
with timer.record("implicit_gemm", stream): with timer.record("implicit_gemm", stream):
for j in range(num_split): for j in range(num_split):
beta = 0 if j == 0 else 1 beta = 0 if j == 0 else 1
if bias is not None:
beta = 1
CONV.run_with_tuned_result( CONV.run_with_tuned_result(
tune_res, tune_res,
ConvOpType.kForward, ConvOpType.kForward,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment