Commit 82fd7a8b authored by yan.yan's avatar yan.yan
Browse files

v2.1.5: add profile tool and python 3.6 for linux

parent f31eee3a
...@@ -89,7 +89,7 @@ jobs: ...@@ -89,7 +89,7 @@ jobs:
runs-on: ubuntu-20.04 runs-on: ubuntu-20.04
strategy: strategy:
matrix: matrix:
python-version: ['3.7', '3.8', '3.9', '3.10'] # this version is only used for upload. python-version: ['3.6', '3.7', '3.8', '3.9', '3.10'] # this version is only used for upload.
cuda-version: ['102', '111', '113', '114', ''] cuda-version: ['102', '111', '113', '114', '']
steps: steps:
......
...@@ -14,5 +14,6 @@ jobs: ...@@ -14,5 +14,6 @@ jobs:
steps: steps:
- uses: actions/stale@v4 - uses: actions/stale@v4
with: with:
stale-issue-message: 'Close stale issues due to inactivity.' stale-issue-message: 'Mark stale issues due to inactivity.'
stale-pr-message: 'Close stale PRs due to inactivity.' stale-pr-message: 'Mark stale PRs due to inactivity.'
operations-per-run: 300
# Changelog # Changelog
## [2.1.5] - 2021-11-10
### Added
- Add cuda profile tool
- Add python 36 support
### Changed
- Format all code
### Removed
- remove a unnecessary device sync and slightly improve performance.
## [2.1.0] - 2021-10-31 ## [2.1.0] - 2021-10-31
### Addad ### Addad
* add implicit gemm algorithm for all kind of convolution with kernel volume <= 32. this algorithm is very fast with float16. * add implicit gemm algorithm for all kind of convolution with kernel volume <= 32. this algorithm is very fast with float16.
......
...@@ -13,16 +13,36 @@ ...@@ -13,16 +13,36 @@
See the License for the specific language governing permissions and See the License for the specific language governing permissions and
limitations under the License. limitations under the License.
--> -->
[pypi-ver-cpu]: https://img.shields.io/pypi/v/spconv
[pypi-download]: https://img.shields.io/pypi/dm/spconv-cu114 [pypi-ver-114]: https://img.shields.io/pypi/v/spconv-cu114
[pypi-url]: https://pypi.org/project/spconv-cu114/ [pypi-ver-111]: https://img.shields.io/pypi/v/spconv-cu111
[pypi-image]: https://badge.fury.io/py/spconv-cu114.svg [pypi-ver-113]: https://img.shields.io/pypi/v/spconv-cu113
[pypi-ver-102]: https://img.shields.io/pypi/v/spconv-cu102
[pypi-url-111]: https://pypi.org/project/spconv-cu111/
[pypi-download-111]: https://img.shields.io/pypi/dm/spconv-cu111
[pypi-url-113]: https://pypi.org/project/spconv-cu113/
[pypi-download-113]: https://img.shields.io/pypi/dm/spconv-cu113
[pypi-url-102]: https://pypi.org/project/spconv-cu102/
[pypi-download-102]: https://img.shields.io/pypi/dm/spconv-cu102
[pypi-url-114]: https://pypi.org/project/spconv-cu114/
[pypi-download-114]: https://img.shields.io/pypi/dm/spconv-cu114
[pypi-url-cpu]: https://pypi.org/project/spconv/
[pypi-download-cpu]: https://img.shields.io/pypi/dm/spconv
# SpConv: Spatially Sparse Convolution Library # SpConv: Spatially Sparse Convolution Library
[![Build Status](https://github.com/traveller59/spconv/workflows/build/badge.svg)](https://github.com/traveller59/spconv/actions?query=workflow%3Abuild) [![PyPI Version][pypi-image]][pypi-url] [![pypi monthly download][pypi-download]][pypi-url] [![Build Status](https://github.com/traveller59/spconv/workflows/build/badge.svg)](https://github.com/traveller59/spconv/actions?query=workflow%3Abuild)
| | PyPi Version | Downloads |
| -------------- |:---------------------:| ---------------------:|
| CPU (Linux Only) | [![PyPI Version][pypi-ver-cpu]][pypi-url-cpu] | [![pypi monthly download][pypi-download-cpu]][pypi-url-cpu] |
| CUDA 10.2 | [![PyPI Version][pypi-ver-102]][pypi-url-102] | [![pypi monthly download][pypi-download-102]][pypi-url-102] |
| CUDA 11.1 | [![PyPI Version][pypi-ver-111]][pypi-url-111] | [![pypi monthly download][pypi-download-111]][pypi-url-111]|
| CUDA 11.3 (Linux Only) | [![PyPI Version][pypi-ver-113]][pypi-url-113] |[![pypi monthly download][pypi-download-113]][pypi-url-113]|
| CUDA 11.4 | [![PyPI Version][pypi-ver-114]][pypi-url-114] | [![pypi monthly download][pypi-download-114]][pypi-url-114]|
```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support. ```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support. check [benchmark](docs/BENCHMARK.md) to see how fast spconv 2.x runs.
[Spconv 1.x code](https://github.com/traveller59/spconv/tree/v1.2.1). We won't provide any support for spconv 1.x since it's deprecated. use spconv 2.x if possible. <!--remove this message in spconv 2.2--> [Spconv 1.x code](https://github.com/traveller59/spconv/tree/v1.2.1). We won't provide any support for spconv 1.x since it's deprecated. use spconv 2.x if possible. <!--remove this message in spconv 2.2-->
...@@ -99,7 +119,10 @@ The c++ code will be built automatically when you change c++ code in project. ...@@ -99,7 +119,10 @@ The c++ code will be built automatically when you change c++ code in project.
For NVIDIA Embedded Platforms, you need to specify cuda arch before build: ```export CUMM_CUDA_ARCH_LIST="7.2"``` for xavier. For NVIDIA Embedded Platforms, you need to specify cuda arch before build: ```export CUMM_CUDA_ARCH_LIST="7.2"``` for xavier.
You need to remove ```cumm``` in ```requires``` section in pyproject.toml after install editable ```cumm``` and before install spconv due to pyproject limit (can't find editable installed ```cumm```).
#### Linux #### Linux
0. uninstall spconv and cumm installed by pip 0. uninstall spconv and cumm installed by pip
1. install build-essential, install CUDA 1. install build-essential, install CUDA
2. ```git clone https://github.com/FindDefinition/cumm```, ```cd ./cumm```, ```pip install -e .``` 2. ```git clone https://github.com/FindDefinition/cumm```, ```cd ./cumm```, ```pip install -e .```
......
<!--
Copyright 2021 Yan Yan
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
## Simple Benchmark
### Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU
Network Code: test/benchmark.py
| F32/F16 | Spconv 1.x F32 (1080Ti) | Native| Implicit Gemm | Implicit Gemm Split Mask |
| -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
| Forward | 43ms | 21.7ms/13.7ms | 23.5ms/11.2ms | 22ms/12.2ms |
| Backward | 80ms | 41.9ms/25.2ms | 51.0ms/13.8ms | 41.1ms/12.2ms |
### Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU
Network Code: test/benchmark.py
The network/input/profile code is same as above table.
This table only profile **fp16 gemm kernels** without output tensor create/clear overhead. this table show the performance upper bound of our algorithm.
| F16 | Native| Implicit Gemm | Implicit Gemm Split Mask |
| -------------- |:---------------------:|---------------------:| ---------------------:|
| Forward | 8.0ms | 4.3ms | 4.0ms |
We can see that the implicit gemm is very fast, gemm only use 4.3ms/11.2ms in network forward. we can achieve better performance in TensorRT + Pure C++.
**NOTE**
When you want to benchmark network in your laptop, don't forget to close all apps except terminals! Other apps will consume GPU resource and make kernels run slower.
## Comparsion with [MinkowskiEngine](https://github.com/NVIDIA/MinkowskiEngine) and [torchsparse](https://github.com/mit-han-lab/torchsparse)
TODO
\ No newline at end of file
...@@ -25,12 +25,7 @@ ...@@ -25,12 +25,7 @@
* make sure your channel size is multiple of 8 when using fp16. multiple of 32 is better. * make sure your channel size is multiple of 8 when using fp16. multiple of 32 is better.
* spconv 2.x in Windows 10 is 1.5x~2x slower than Linux. use Linux if possible. * spconv 2.x in Windows 10 is 1.5x~2x slower than Linux. use Linux if possible.
Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU See [benchmark](BENCHMARK.md) for more performance details of different algorithms.
| F32/F16 | Spconv 1.x | Native| Implicit Gemm | Implicit Gemm Split Mask |
| -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
| Forward | 43ms | 29ms/23ms | 30ms/15ms | 30ms/19ms |
| Backward | 80ms | 47ms/32ms | 56ms/15ms | 45ms/14ms |
## Algorithm Overview ## Algorithm Overview
...@@ -57,4 +52,4 @@ In my test, ```Implicit Gemm``` is almost 2x faster than ```Native```. ...@@ -57,4 +52,4 @@ In my test, ```Implicit Gemm``` is almost 2x faster than ```Native```.
TODO TODO
In my test, ```Implicit Gemm Split Mask``` is slightly faster than ```Implicit Gemm```, but the indice generation is greatly slower, so currently we use ```Implicit Gemm``` by default. In my test, ```Implicit Gemm Split Mask``` is slightly faster than ```Implicit Gemm```, but the indice generation is slower, so currently we use ```Implicit Gemm``` by default.
\ No newline at end of file \ No newline at end of file
...@@ -24,6 +24,7 @@ from torch.optim.lr_scheduler import StepLR ...@@ -24,6 +24,7 @@ from torch.optim.lr_scheduler import StepLR
import contextlib import contextlib
import torch.cuda.amp import torch.cuda.amp
@contextlib.contextmanager @contextlib.contextmanager
def identity_ctx(): def identity_ctx():
yield yield
...@@ -46,7 +47,6 @@ class Net(nn.Module): ...@@ -46,7 +47,6 @@ class Net(nn.Module):
self.dropout1 = nn.Dropout2d(0.25) self.dropout1 = nn.Dropout2d(0.25)
self.dropout2 = nn.Dropout2d(0.5) self.dropout2 = nn.Dropout2d(0.5)
def forward(self, x: torch.Tensor): def forward(self, x: torch.Tensor):
# x: [N, 28, 28, 1], must be NHWC tensor # x: [N, 28, 28, 1], must be NHWC tensor
x_sp = spconv.SparseConvTensor.from_dense(x.reshape(-1, 28, 28, 1)) x_sp = spconv.SparseConvTensor.from_dense(x.reshape(-1, 28, 28, 1))
...@@ -116,13 +116,17 @@ def test(args, model, device, test_loader): ...@@ -116,13 +116,17 @@ def test(args, model, device, test_loader):
with amp_ctx: with amp_ctx:
output = model(data) output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss test_loss += F.nll_loss(
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(
dim=1,
keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item() correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset) test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format( print(
'\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset), test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset))) 100. * correct / len(test_loader.dataset)))
...@@ -130,26 +134,54 @@ def test(args, model, device, test_loader): ...@@ -130,26 +134,54 @@ def test(args, model, device, test_loader):
def main(): def main():
# Training settings # Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example') parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N', parser.add_argument('--batch-size',
type=int,
default=64,
metavar='N',
help='input batch size for training (default: 64)') help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N', parser.add_argument('--test-batch-size',
type=int,
default=1000,
metavar='N',
help='input batch size for testing (default: 1000)') help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=14, metavar='N', parser.add_argument('--epochs',
type=int,
default=14,
metavar='N',
help='number of epochs to train (default: 14)') help='number of epochs to train (default: 14)')
parser.add_argument('--lr', type=float, default=1.0, metavar='LR', parser.add_argument('--lr',
type=float,
default=1.0,
metavar='LR',
help='learning rate (default: 1.0)') help='learning rate (default: 1.0)')
parser.add_argument('--gamma', type=float, default=0.7, metavar='M', parser.add_argument('--gamma',
type=float,
default=0.7,
metavar='M',
help='Learning rate step gamma (default: 0.7)') help='Learning rate step gamma (default: 0.7)')
parser.add_argument('--no-cuda', action='store_true', default=False, parser.add_argument('--no-cuda',
action='store_true',
default=False,
help='disables CUDA training') help='disables CUDA training')
parser.add_argument('--seed', type=int, default=1, metavar='S', parser.add_argument('--seed',
type=int,
default=1,
metavar='S',
help='random seed (default: 1)') help='random seed (default: 1)')
parser.add_argument('--log-interval', type=int, default=10, metavar='N', parser.add_argument(
'--log-interval',
type=int,
default=10,
metavar='N',
help='how many batches to wait before logging training status') help='how many batches to wait before logging training status')
parser.add_argument('--save-model', action='store_true', default=False, parser.add_argument('--save-model',
action='store_true',
default=False,
help='For Saving the current Model') help='For Saving the current Model')
parser.add_argument('--fp16', action='store_true', default=False, parser.add_argument('--fp16',
action='store_true',
default=False,
help='For mixed precision training') help='For mixed precision training')
args = parser.parse_args() args = parser.parse_args()
...@@ -161,20 +193,30 @@ def main(): ...@@ -161,20 +193,30 @@ def main():
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {} kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
train_loader = torch.utils.data.DataLoader( train_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=True, download=True, datasets.MNIST(
'../data',
train=True,
download=True,
transform=transforms.Compose([ transform=transforms.Compose([
transforms.ToTensor(), transforms.ToTensor(),
# here we remove norm to get sparse tensor with lots of zeros # here we remove norm to get sparse tensor with lots of zeros
# transforms.Normalize((0.1307,), (0.3081,)) # transforms.Normalize((0.1307,), (0.3081,))
])), ])),
batch_size=args.batch_size, shuffle=True, **kwargs) batch_size=args.batch_size,
shuffle=True,
**kwargs)
test_loader = torch.utils.data.DataLoader( test_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=False, transform=transforms.Compose([ datasets.MNIST(
'../data',
train=False,
transform=transforms.Compose([
transforms.ToTensor(), transforms.ToTensor(),
# here we remove norm to get sparse tensor with lots of zeros # here we remove norm to get sparse tensor with lots of zeros
# transforms.Normalize((0.1307,), (0.3081,)) # transforms.Normalize((0.1307,), (0.3081,))
])), ])),
batch_size=args.test_batch_size, shuffle=True, **kwargs) batch_size=args.test_batch_size,
shuffle=True,
**kwargs)
model = Net().to(device) model = Net().to(device)
optimizer = optim.Adadelta(model.parameters(), lr=args.lr) optimizer = optim.Adadelta(model.parameters(), lr=args.lr)
......
...@@ -19,10 +19,10 @@ from spconv.utils import Point2VoxelCPU3d ...@@ -19,10 +19,10 @@ from spconv.utils import Point2VoxelCPU3d
from spconv.pytorch.utils import PointToVoxel from spconv.pytorch.utils import PointToVoxel
import torch import torch
def main(): def main():
# voxel gen source code: spconv/csrc/sparse/pointops.py # voxel gen source code: spconv/csrc/sparse/pointops.py
gen = Point2VoxelCPU3d( gen = Point2VoxelCPU3d(vsize_xyz=[0.1, 0.1, 0.1],
vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6], coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3, num_point_features=3,
max_num_voxels=5000, max_num_voxels=5000,
...@@ -39,19 +39,22 @@ def main(): ...@@ -39,19 +39,22 @@ def main():
print("------Raw Voxels-------") print("------Raw Voxels-------")
print(voxels_np[0]) print(voxels_np[0])
# run voxel gen and FILL MEAN VALUE to voxel remain # run voxel gen and FILL MEAN VALUE to voxel remain
voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(pc_tv) voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(
pc_tv)
voxels_np = voxels_tv.numpy_view() voxels_np = voxels_tv.numpy_view()
indices_np = indices_tv.numpy_view() indices_np = indices_tv.numpy_view()
num_p_in_vx_np = num_p_in_vx_tv.numpy_view() num_p_in_vx_np = num_p_in_vx_tv.numpy_view()
print("------Voxels with mean filled-------") print("------Voxels with mean filled-------")
print(voxels_np[0]) print(voxels_np[0])
def main_point_with_features(): def main_point_with_features():
# voxel gen source code: spconv/csrc/sparse/pointops.py # voxel gen source code: spconv/csrc/sparse/pointops.py
gen = Point2VoxelCPU3d( gen = Point2VoxelCPU3d(
vsize_xyz=[0.1, 0.1, 0.1], vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6], coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=4, # here num_point_features must equal to pc.shape[1] num_point_features=
4, # here num_point_features must equal to pc.shape[1]
max_num_voxels=5000, max_num_voxels=5000,
max_num_points_per_voxel=5) max_num_points_per_voxel=5)
...@@ -68,17 +71,18 @@ def main_point_with_features(): ...@@ -68,17 +71,18 @@ def main_point_with_features():
print("------Raw Voxels-------") print("------Raw Voxels-------")
print(voxels_np[0]) print(voxels_np[0])
# run voxel gen and FILL MEAN VALUE to voxel remain # run voxel gen and FILL MEAN VALUE to voxel remain
voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(pc_tv) voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(
pc_tv)
voxels_np = voxels_tv.numpy_view() voxels_np = voxels_tv.numpy_view()
indices_np = indices_tv.numpy_view() indices_np = indices_tv.numpy_view()
num_p_in_vx_np = num_p_in_vx_tv.numpy_view() num_p_in_vx_np = num_p_in_vx_tv.numpy_view()
print("------Voxels with mean filled-------") print("------Voxels with mean filled-------")
print(voxels_np[0]) print(voxels_np[0])
def main_pytorch_voxel_gen(): def main_pytorch_voxel_gen():
# voxel gen source code: spconv/csrc/sparse/pointops.py # voxel gen source code: spconv/csrc/sparse/pointops.py
gen = PointToVoxel( gen = PointToVoxel(vsize_xyz=[0.1, 0.1, 0.1],
vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6], coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3, num_point_features=3,
max_num_voxels=5000, max_num_voxels=5000,
...@@ -100,11 +104,11 @@ def main_pytorch_voxel_gen(): ...@@ -100,11 +104,11 @@ def main_pytorch_voxel_gen():
print("------Voxels with mean filled-------") print("------Voxels with mean filled-------")
print(voxels_np[0]) print(voxels_np[0])
def main_pytorch_voxel_gen_cuda(): def main_pytorch_voxel_gen_cuda():
# voxel gen source code: spconv/csrc/sparse/pointops.py # voxel gen source code: spconv/csrc/sparse/pointops.py
device = torch.device("cuda:0") device = torch.device("cuda:0")
gen = PointToVoxel( gen = PointToVoxel(vsize_xyz=[0.1, 0.1, 0.1],
vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6], coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3, num_point_features=3,
max_num_voxels=5000, max_num_voxels=5000,
......
isort -rc --atomic ./spconv && \ yapf -i --recursive -vv ./spconv ./test ./example ./scripts
isort -rc --atomic ./test && \
yapf -i --recursive -vv ./spconv ./test
find ./src -regex '.*\.\(cpp\|hpp\|cc\|cxx\|cu\|cuh\|h\)' | xargs clang-format -i
find ./include -regex '.*\.\(cpp\|hpp\|cc\|cxx\|cu\|cuh\|h\)' | xargs clang-format -i
\ No newline at end of file
[build-system] [build-system]
requires = ["setuptools>=41.0", "wheel", "pccm>=0.2.21", "cumm>=0.2.1"] requires = ["setuptools>=41.0", "wheel", "pccm>=0.2.21", "cumm>=0.2.2"]
build-backend = "setuptools.build_meta" build-backend = "setuptools.build_meta"
...@@ -29,10 +29,11 @@ from spconv.pytorch import ops ...@@ -29,10 +29,11 @@ from spconv.pytorch import ops
from spconv.algo import CONV, BestConvAlgoByProfile from spconv.algo import CONV, BestConvAlgoByProfile
from spconv.pytorch.cppcore import torch_tensor_to_tv from spconv.pytorch.cppcore import torch_tensor_to_tv
def reduce_mask_count(mask: np.ndarray, width: int): def reduce_mask_count(mask: np.ndarray, width: int):
mask_length_32 = (div_up(mask.shape[0], width)) * width mask_length_32 = (div_up(mask.shape[0], width)) * width
if mask.shape[0] < mask_length_32: if mask.shape[0] < mask_length_32:
mask_pad = np.zeros((mask_length_32,), dtype=mask.dtype) mask_pad = np.zeros((mask_length_32, ), dtype=mask.dtype)
mask_pad[:mask.shape[0]] = mask mask_pad[:mask.shape[0]] = mask
mask = mask_pad mask = mask_pad
mask = mask.reshape(-1, width) mask = mask.reshape(-1, width)
...@@ -40,16 +41,18 @@ def reduce_mask_count(mask: np.ndarray, width: int): ...@@ -40,16 +41,18 @@ def reduce_mask_count(mask: np.ndarray, width: int):
maskr_tv = tv.from_numpy(maskr) maskr_tv = tv.from_numpy(maskr)
return SpconvOps.count_bits(maskr_tv).numpy().sum() * width return SpconvOps.count_bits(maskr_tv).numpy().sum() * width
def reduce_mask_count_x(mask: np.ndarray, width: int): def reduce_mask_count_x(mask: np.ndarray, width: int):
mask_length_32 = (div_up(mask.shape[0], width)) * width mask_length_32 = (div_up(mask.shape[0], width)) * width
if mask.shape[0] < mask_length_32: if mask.shape[0] < mask_length_32:
mask_pad = np.zeros((mask_length_32,), dtype=mask.dtype) mask_pad = np.zeros((mask_length_32, ), dtype=mask.dtype)
mask_pad[:mask.shape[0]] = mask mask_pad[:mask.shape[0]] = mask
mask = mask_pad mask = mask_pad
mask = mask.reshape(-1, width) mask = mask.reshape(-1, width)
maskr = np.bitwise_or.reduce(mask, axis=1) maskr = np.bitwise_or.reduce(mask, axis=1)
return maskr return maskr
def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True): def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
limit_input_n = 16384 limit_input_n = 16384
limit_input_n = None limit_input_n = None
...@@ -88,8 +91,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True): ...@@ -88,8 +91,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
stride = [1] * ndim stride = [1] * ndim
dilation = [1] * ndim dilation = [1] * ndim
out_padding = [0] * ndim out_padding = [0] * ndim
out_inds, pair_ref, indice_num_per_loc = ops.get_indice_pairs(indices_th, 1, spatial_shape, ConvAlgo.Native, out_inds, pair_ref, indice_num_per_loc = ops.get_indice_pairs(
ksize, stride, padding, dilation, out_padding, subm) indices_th, 1, spatial_shape, ConvAlgo.Native, ksize, stride, padding,
dilation, out_padding, subm)
indice_num_per_loc_np = indice_num_per_loc.cpu().numpy() indice_num_per_loc_np = indice_num_per_loc.cpu().numpy()
indice_pairs_np = pair_ref.cpu().numpy() indice_pairs_np = pair_ref.cpu().numpy()
algo = ConvAlgo.MaskSplitImplicitGemm algo = ConvAlgo.MaskSplitImplicitGemm
...@@ -98,8 +102,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True): ...@@ -98,8 +102,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
else: else:
num_split = 2 num_split = 2
for i in range(5): for i in range(5):
res = ops.get_indice_pairs_implicit_gemm(indices_th, 1, spatial_shape, algo, res = ops.get_indice_pairs_implicit_gemm(indices_th, 1, spatial_shape,
ksize, stride, padding, dilation, out_padding, subm) algo, ksize, stride, padding,
dilation, out_padding, subm)
out_inds = res[0] out_inds = res[0]
num_inds_per_loc = res[1] num_inds_per_loc = res[1]
pair_fwd = res[2] pair_fwd = res[2]
...@@ -115,23 +120,38 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True): ...@@ -115,23 +120,38 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
mask_argsort_fwd_splits = res[6] mask_argsort_fwd_splits = res[6]
mask_argsort_bwd_splits = res[7] mask_argsort_bwd_splits = res[7]
masks = res[8] masks = res[8]
pair_mask_fwd_splits_tv = [ops.torch_tensor_to_tv(t, dtype=tv.uint32) for t in pair_mask_fwd_splits] pair_mask_fwd_splits_tv = [
valid_location_bitcount = [SpconvOps.count_bits(t) for t in pair_mask_fwd_splits_tv] ops.torch_tensor_to_tv(t, dtype=tv.uint32)
valid_location_count = sum([t.cpu().numpy().sum() for t in valid_location_bitcount]) for t in pair_mask_fwd_splits
]
valid_location_bitcount = [
SpconvOps.count_bits(t) for t in pair_mask_fwd_splits_tv
]
valid_location_count = sum(
[t.cpu().numpy().sum() for t in valid_location_bitcount])
reduce_length = 32 reduce_length = 32
split_mask_valid_count = sum([reduce_mask_count(t.cpu().numpy(), reduce_length) for t in pair_mask_fwd_splits_tv]) split_mask_valid_count = sum([
reduce_mask_count(t.cpu().numpy(), reduce_length)
for t in pair_mask_fwd_splits_tv
])
if subm: if subm:
print("SUBM", valid_location_count, split_mask_valid_count, pair_fwd.numel()) print("SUBM", valid_location_count, split_mask_valid_count,
pair_fwd.numel())
else: else:
print("REGULAR", valid_location_count, split_mask_valid_count, pair_fwd.numel()) print("REGULAR", valid_location_count, split_mask_valid_count,
pair_fwd.numel())
# return # return
if run_conv: if run_conv:
C = 64 C = 64
K = 64 K = 64
desps = CONV.desps desps = CONV.desps
mask_output_fwd = torch.zeros([2, div_up(out_inds.shape[0], 32)], dtype=torch.int32, device=indices_th.device) mask_output_fwd = torch.zeros([2, div_up(out_inds.shape[0], 32)],
mask_output_bwd = torch.zeros([2, div_up(indices.dim(0), 32)], dtype=torch.int32, device=indices_th.device) dtype=torch.int32,
device=indices_th.device)
mask_output_bwd = torch.zeros([2, div_up(indices.dim(0), 32)],
dtype=torch.int32,
device=indices_th.device)
for desp in desps: for desp in desps:
if desp.algo != GemmAlgo.Simt.value: if desp.algo != GemmAlgo.Simt.value:
...@@ -140,17 +160,22 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True): ...@@ -140,17 +160,22 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
# continue # continue
# if desp.tile_shape ! # if desp.tile_shape !
if desp.dtype_a == dtypes.int8.tv_dtype: if desp.dtype_a == dtypes.int8.tv_dtype:
inp = np.random.randint(-1, 1, size=[voxels_np.shape[0], C]).astype(np.int8) inp = np.random.randint(-1, 1, size=[voxels_np.shape[0],
weight = np.random.randint(-1, 1, size=[K, *ksize, C]).astype(np.int8) C]).astype(np.int8)
output = np.random.randint(-1, 1, size=[out_inds.shape[0], K]).astype( weight = np.random.randint(-1, 1, size=[K, *ksize,
dtypes.get_npdtype_from_tvdtype(desp.dtype_output)) C]).astype(np.int8)
output = np.random.randint(-1, 1, size=[
out_inds.shape[0], K
]).astype(dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
else: else:
inp = np.random.uniform(-1, 1, size=[voxels_np.shape[0], C]).astype( inp = np.random.uniform(-1, 1, size=[
dtypes.get_npdtype_from_tvdtype(desp.dtype_input)) voxels_np.shape[0], C
]).astype(dtypes.get_npdtype_from_tvdtype(desp.dtype_input))
weight = np.random.uniform(-1, 1, size=[K, *ksize, C]).astype( weight = np.random.uniform(-1, 1, size=[K, *ksize, C]).astype(
dtypes.get_npdtype_from_tvdtype(desp.dtype_weight)) dtypes.get_npdtype_from_tvdtype(desp.dtype_weight))
output = np.random.uniform(-1, 1, size=[out_inds.shape[0], K]).astype( output = np.random.uniform(-1, 1, size=[
dtypes.get_npdtype_from_tvdtype(desp.dtype_output)) out_inds.shape[0], K
]).astype(dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
weight_ref = weight.transpose(1, 2, 3, 0, 4) weight_ref = weight.transpose(1, 2, 3, 0, 4)
weight_ref = np.ascontiguousarray(weight_ref).reshape(-1, K, C) weight_ref = np.ascontiguousarray(weight_ref).reshape(-1, K, C)
if desp.op_type == ConvOpType.kBackwardInput.value: if desp.op_type == ConvOpType.kBackwardInput.value:
...@@ -270,7 +295,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True): ...@@ -270,7 +295,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
c_inds = indice_pairs_np[1][filter_offset][:nhot] c_inds = indice_pairs_np[1][filter_offset][:nhot]
# print(a_inds_cpu[:10]) # print(a_inds_cpu[:10])
a = inp[a_inds] a = inp[a_inds]
cc = a.astype(np.float32) @ weight_ref[filter_offset].T.astype(np.float32) cc = a.astype(
np.float32) @ weight_ref[filter_offset].T.astype(
np.float32)
output_ref[c_inds] += cc output_ref[c_inds] += cc
output_cpu = output_tv.cpu().numpy().astype(np.float32) output_cpu = output_tv.cpu().numpy().astype(np.float32)
...@@ -294,12 +321,18 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True): ...@@ -294,12 +321,18 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
# print(a_inds_cpu[:10]) # print(a_inds_cpu[:10])
a = output[a_inds] a = output[a_inds]
# NK @ KC # NK @ KC
cc = a.astype(np.float32) @ weight_ref[filter_offset].astype(np.float32) cc = a.astype(
np.float32) @ weight_ref[filter_offset].astype(
np.float32)
dinput_ref[c_inds] += cc dinput_ref[c_inds] += cc
din_cpu = inp_tv.cpu().numpy() din_cpu = inp_tv.cpu().numpy()
print("ERROR", np.linalg.norm(din_cpu.reshape(-1) - dinput_ref.reshape(-1))) print(
"ERROR",
np.linalg.norm(
din_cpu.reshape(-1) - dinput_ref.reshape(-1)))
else: else:
dw_ref = np.zeros_like(weight_ref, dtype=np.float32) # KV, K, C dw_ref = np.zeros_like(weight_ref,
dtype=np.float32) # KV, K, C
for filter_offset in range(kv): for filter_offset in range(kv):
if subm and filter_offset > kv // 2: if subm and filter_offset > kv // 2:
nhot = indice_num_per_loc_np[kv - 1 - filter_offset] nhot = indice_num_per_loc_np[kv - 1 - filter_offset]
...@@ -313,13 +346,17 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True): ...@@ -313,13 +346,17 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
out_gather = output[o_inds] # [N, K] out_gather = output[o_inds] # [N, K]
inp_gather = inp[i_inds] # [N, C] inp_gather = inp[i_inds] # [N, C]
# KN @ NC # KN @ NC
dw_res = out_gather.astype(np.float32).T @ inp_gather.astype(np.float32) dw_res = out_gather.astype(
np.float32).T @ inp_gather.astype(np.float32)
dw_ref[filter_offset] = dw_res dw_ref[filter_offset] = dw_res
# print(indice_pairs_np_test[0]) # print(indice_pairs_np_test[0])
dw_ref_kcrs = dw_ref.transpose(1, 0, 2) dw_ref_kcrs = dw_ref.transpose(1, 0, 2)
dw_cpu = weight_tv.cpu().numpy().reshape(K, np.prod(ksize), C) dw_cpu = weight_tv.cpu().numpy().reshape(K, np.prod(ksize), C)
print("ERROR", np.linalg.norm(dw_cpu.reshape(-1) - dw_ref_kcrs.reshape(-1))) print(
"ERROR",
np.linalg.norm(
dw_cpu.reshape(-1) - dw_ref_kcrs.reshape(-1)))
if __name__ == "__main__": if __name__ == "__main__":
......
...@@ -20,6 +20,7 @@ import torch ...@@ -20,6 +20,7 @@ import torch
from spconv.pytorch.cppcore import torch_tensor_to_tv from spconv.pytorch.cppcore import torch_tensor_to_tv
def main(): def main():
with open("/home/yy/asd.pkl", "rb") as f: with open("/home/yy/asd.pkl", "rb") as f:
a_th = pickle.load(f) a_th = pickle.load(f)
...@@ -34,5 +35,6 @@ def main(): ...@@ -34,5 +35,6 @@ def main():
a_tv_1 = a_tv.clone() a_tv_1 = a_tv.clone()
SpconvOps.sort_1d_by_key(a_tv_1[0], mask_argsort_tv[0]) SpconvOps.sort_1d_by_key(a_tv_1[0], mask_argsort_tv[0])
if __name__ == "__main__": if __name__ == "__main__":
main() main()
...@@ -38,9 +38,9 @@ if cuda_ver: ...@@ -38,9 +38,9 @@ if cuda_ver:
cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102 cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102
RELEASE_NAME += "-cu{}".format(cuda_ver) RELEASE_NAME += "-cu{}".format(cuda_ver)
deps = ["cumm-cu{}".format(cuda_ver)] deps = ["cumm-cu{}>=0.2.2".format(cuda_ver)]
else: else:
deps = ["cumm"] deps = ["cumm>=0.2.2"]
...@@ -48,11 +48,11 @@ DESCRIPTION = 'spatial sparse convolution' ...@@ -48,11 +48,11 @@ DESCRIPTION = 'spatial sparse convolution'
URL = 'https://github.com/traveller59/spconv' URL = 'https://github.com/traveller59/spconv'
EMAIL = 'yanyan.sub@outlook.com' EMAIL = 'yanyan.sub@outlook.com'
AUTHOR = 'Yan Yan' AUTHOR = 'Yan Yan'
REQUIRES_PYTHON = '>=3.7' REQUIRES_PYTHON = '>=3.6'
VERSION = None VERSION = None
# What packages are required for this module to be executed? # What packages are required for this module to be executed?
REQUIRED = ["pccm>=0.2.19", "pybind11>=2.6.0", "fire", "numpy", *deps] REQUIRED = ["pccm>=0.2.21", "pybind11>=2.6.0", "fire", "numpy", *deps]
# What packages are optional? # What packages are optional?
EXTRAS = { EXTRAS = {
......
...@@ -24,9 +24,10 @@ from spconv.constants import NDIM_DONT_CARE ...@@ -24,9 +24,10 @@ from spconv.constants import NDIM_DONT_CARE
from typing import Optional from typing import Optional
import time import time
from threading import Lock from threading import Lock
import torch import contextlib
import numpy as np import numpy as np
from spconv.core import ConvAlgo, AlgoHint from spconv.core import ConvAlgo, AlgoHint
from spconv.tools import CUDAKernelTimer
ALL_ALGO_DESPS = GemmMainUnitTest.get_all_algo_desp() ALL_ALGO_DESPS = GemmMainUnitTest.get_all_algo_desp()
ALL_CONV_ALGO_DESPS = ConvMainUnitTest.get_all_conv_algo_desp() ALL_CONV_ALGO_DESPS = ConvMainUnitTest.get_all_conv_algo_desp()
...@@ -403,7 +404,8 @@ class SimpleGemm: ...@@ -403,7 +404,8 @@ class SimpleGemm:
alpha: float = 1.0, alpha: float = 1.0,
beta: float = 0.0, beta: float = 0.0,
gather_data: tv.Tensor = tv.Tensor(), gather_data: tv.Tensor = tv.Tensor(),
workspace: tv.Tensor = tv.Tensor()): workspace: tv.Tensor = tv.Tensor(),
timer: CUDAKernelTimer = CUDAKernelTimer(False)):
m, n, k = GemmMainUnitTest.extract_mnk(a.shape, b.shape, trans_a, m, n, k = GemmMainUnitTest.extract_mnk(a.shape, b.shape, trans_a,
trans_b, trans_c, trans_b, trans_c,
shuffle_type.value, shuffle_type.value,
...@@ -446,6 +448,9 @@ class SimpleGemm: ...@@ -446,6 +448,9 @@ class SimpleGemm:
# stream=stream) # stream=stream)
# GemmMainUnitTest.stream_synchronize(stream) # GemmMainUnitTest.stream_synchronize(stream)
# gather = time.time() - tt # gather = time.time() - tt
if timer.enable:
assert timer._timer is not None
params.timer = timer._timer
GemmMainUnitTest.matmul2(params) GemmMainUnitTest.matmul2(params)
# GemmMainUnitTest.stream_synchronize(stream) # GemmMainUnitTest.stream_synchronize(stream)
...@@ -678,7 +683,8 @@ class SimpleConv: ...@@ -678,7 +683,8 @@ class SimpleConv:
beta: float = 0.0, beta: float = 0.0,
stream: int = 0, stream: int = 0,
workspace: tv.Tensor = tv.Tensor(), workspace: tv.Tensor = tv.Tensor(),
verbose: bool = False): verbose: bool = False,
timer: CUDAKernelTimer = CUDAKernelTimer(False)):
channel_k = output.dim(1) channel_k = output.dim(1)
channel_c = inp.dim(1) channel_c = inp.dim(1)
# GemmMainUnitTest.stream_synchronize(stream) # GemmMainUnitTest.stream_synchronize(stream)
...@@ -709,9 +715,11 @@ class SimpleConv: ...@@ -709,9 +715,11 @@ class SimpleConv:
params.mask_filter = mask_filter params.mask_filter = mask_filter
params.mask_output = mask_output params.mask_output = mask_output
params.reverse_mask = reverse_mask params.reverse_mask = reverse_mask
if timer.enable:
assert timer._timer is not None
params.timer = timer._timer
# torch.cuda.synchronize() # torch.cuda.synchronize()
# t = time.time() # t = time.time()
params.workspace = workspace params.workspace = workspace
ConvMainUnitTest.implicit_gemm2(params) ConvMainUnitTest.implicit_gemm2(params)
# torch.cuda.synchronize() # torch.cuda.synchronize()
...@@ -724,6 +732,7 @@ class SimpleConv: ...@@ -724,6 +732,7 @@ class SimpleConv:
def stream_synchronize(self, stream: int): def stream_synchronize(self, stream: int):
return GemmMainUnitTest.stream_synchronize(stream) return GemmMainUnitTest.stream_synchronize(stream)
GEMM = SimpleGemm(ALL_ALGO_DESPS) GEMM = SimpleGemm(ALL_ALGO_DESPS)
CONV = SimpleConv(ALL_CONV_ALGO_DESPS) CONV = SimpleConv(ALL_CONV_ALGO_DESPS)
......
...@@ -19,7 +19,8 @@ from pccm.utils import project_is_editable, project_is_installed ...@@ -19,7 +19,8 @@ from pccm.utils import project_is_editable, project_is_installed
from ccimport.compat import InWindows from ccimport.compat import InWindows
from .constants import PACKAGE_NAME, PACKAGE_ROOT, DISABLE_JIT from .constants import PACKAGE_NAME, PACKAGE_ROOT, DISABLE_JIT
if project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME) and not DISABLE_JIT: if project_is_installed(PACKAGE_NAME) and project_is_editable(
PACKAGE_NAME) and not DISABLE_JIT:
from spconv.core import SHUFFLE_SIMT_PARAMS, SHUFFLE_VOLTA_PARAMS, SHUFFLE_TURING_PARAMS from spconv.core import SHUFFLE_SIMT_PARAMS, SHUFFLE_VOLTA_PARAMS, SHUFFLE_TURING_PARAMS
from spconv.core import IMPLGEMM_SIMT_PARAMS, IMPLGEMM_VOLTA_PARAMS, IMPLGEMM_TURING_PARAMS from spconv.core import IMPLGEMM_SIMT_PARAMS, IMPLGEMM_VOLTA_PARAMS, IMPLGEMM_TURING_PARAMS
...@@ -27,9 +28,11 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME) and ...@@ -27,9 +28,11 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME) and
from cumm.conv.main import ConvMainUnitTest from cumm.conv.main import ConvMainUnitTest
from spconv.csrc.sparse.all import SpconvOps from spconv.csrc.sparse.all import SpconvOps
cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS + SHUFFLE_TURING_PARAMS) cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS +
SHUFFLE_TURING_PARAMS)
cu.namespace = "cumm.gemm.main" cu.namespace = "cumm.gemm.main"
convcu = ConvMainUnitTest(IMPLGEMM_SIMT_PARAMS + IMPLGEMM_VOLTA_PARAMS + IMPLGEMM_TURING_PARAMS) convcu = ConvMainUnitTest(IMPLGEMM_SIMT_PARAMS + IMPLGEMM_VOLTA_PARAMS +
IMPLGEMM_TURING_PARAMS)
convcu.namespace = "cumm.conv.main" convcu.namespace = "cumm.conv.main"
objects_folder = None objects_folder = None
if InWindows: if InWindows:
......
...@@ -20,8 +20,8 @@ from pccm.utils import project_is_editable, project_is_installed ...@@ -20,8 +20,8 @@ from pccm.utils import project_is_editable, project_is_installed
PACKAGE_NAME = "spconv" PACKAGE_NAME = "spconv"
PACKAGE_ROOT = Path(__file__).parent.resolve() PACKAGE_ROOT = Path(__file__).parent.resolve()
EDITABLE_INSTALLED = project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME) EDITABLE_INSTALLED = project_is_installed(
PACKAGE_NAME) and project_is_editable(PACKAGE_NAME)
_filter_hwio_env = os.getenv("SPCONV_FILTER_HWIO", "0") _filter_hwio_env = os.getenv("SPCONV_FILTER_HWIO", "0")
FILTER_HWIO = _filter_hwio_env == "1" FILTER_HWIO = _filter_hwio_env == "1"
......
This diff is collapsed.
# Copyright 2021 Yan Yan
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright 2021 Yan Yan
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment