Commit 82fd7a8b authored by yan.yan's avatar yan.yan
Browse files

v2.1.5: add profile tool and python 3.6 for linux

parent f31eee3a
......@@ -89,7 +89,7 @@ jobs:
runs-on: ubuntu-20.04
strategy:
matrix:
python-version: ['3.7', '3.8', '3.9', '3.10'] # this version is only used for upload.
python-version: ['3.6', '3.7', '3.8', '3.9', '3.10'] # this version is only used for upload.
cuda-version: ['102', '111', '113', '114', '']
steps:
......
......@@ -14,5 +14,6 @@ jobs:
steps:
- uses: actions/stale@v4
with:
stale-issue-message: 'Close stale issues due to inactivity.'
stale-pr-message: 'Close stale PRs due to inactivity.'
stale-issue-message: 'Mark stale issues due to inactivity.'
stale-pr-message: 'Mark stale PRs due to inactivity.'
operations-per-run: 300
# Changelog
## [2.1.5] - 2021-11-10
### Added
- Add cuda profile tool
- Add python 36 support
### Changed
- Format all code
### Removed
- remove a unnecessary device sync and slightly improve performance.
## [2.1.0] - 2021-10-31
### Addad
* add implicit gemm algorithm for all kind of convolution with kernel volume <= 32. this algorithm is very fast with float16.
......
......@@ -13,16 +13,36 @@
See the License for the specific language governing permissions and
limitations under the License.
-->
[pypi-download]: https://img.shields.io/pypi/dm/spconv-cu114
[pypi-url]: https://pypi.org/project/spconv-cu114/
[pypi-image]: https://badge.fury.io/py/spconv-cu114.svg
[pypi-ver-cpu]: https://img.shields.io/pypi/v/spconv
[pypi-ver-114]: https://img.shields.io/pypi/v/spconv-cu114
[pypi-ver-111]: https://img.shields.io/pypi/v/spconv-cu111
[pypi-ver-113]: https://img.shields.io/pypi/v/spconv-cu113
[pypi-ver-102]: https://img.shields.io/pypi/v/spconv-cu102
[pypi-url-111]: https://pypi.org/project/spconv-cu111/
[pypi-download-111]: https://img.shields.io/pypi/dm/spconv-cu111
[pypi-url-113]: https://pypi.org/project/spconv-cu113/
[pypi-download-113]: https://img.shields.io/pypi/dm/spconv-cu113
[pypi-url-102]: https://pypi.org/project/spconv-cu102/
[pypi-download-102]: https://img.shields.io/pypi/dm/spconv-cu102
[pypi-url-114]: https://pypi.org/project/spconv-cu114/
[pypi-download-114]: https://img.shields.io/pypi/dm/spconv-cu114
[pypi-url-cpu]: https://pypi.org/project/spconv/
[pypi-download-cpu]: https://img.shields.io/pypi/dm/spconv
# SpConv: Spatially Sparse Convolution Library
[![Build Status](https://github.com/traveller59/spconv/workflows/build/badge.svg)](https://github.com/traveller59/spconv/actions?query=workflow%3Abuild) [![PyPI Version][pypi-image]][pypi-url] [![pypi monthly download][pypi-download]][pypi-url]
[![Build Status](https://github.com/traveller59/spconv/workflows/build/badge.svg)](https://github.com/traveller59/spconv/actions?query=workflow%3Abuild)
| | PyPi Version | Downloads |
| -------------- |:---------------------:| ---------------------:|
| CPU (Linux Only) | [![PyPI Version][pypi-ver-cpu]][pypi-url-cpu] | [![pypi monthly download][pypi-download-cpu]][pypi-url-cpu] |
| CUDA 10.2 | [![PyPI Version][pypi-ver-102]][pypi-url-102] | [![pypi monthly download][pypi-download-102]][pypi-url-102] |
| CUDA 11.1 | [![PyPI Version][pypi-ver-111]][pypi-url-111] | [![pypi monthly download][pypi-download-111]][pypi-url-111]|
| CUDA 11.3 (Linux Only) | [![PyPI Version][pypi-ver-113]][pypi-url-113] |[![pypi monthly download][pypi-download-113]][pypi-url-113]|
| CUDA 11.4 | [![PyPI Version][pypi-ver-114]][pypi-url-114] | [![pypi monthly download][pypi-download-114]][pypi-url-114]|
```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support.
```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support. check [benchmark](docs/BENCHMARK.md) to see how fast spconv 2.x runs.
[Spconv 1.x code](https://github.com/traveller59/spconv/tree/v1.2.1). We won't provide any support for spconv 1.x since it's deprecated. use spconv 2.x if possible. <!--remove this message in spconv 2.2-->
......@@ -99,7 +119,10 @@ The c++ code will be built automatically when you change c++ code in project.
For NVIDIA Embedded Platforms, you need to specify cuda arch before build: ```export CUMM_CUDA_ARCH_LIST="7.2"``` for xavier.
You need to remove ```cumm``` in ```requires``` section in pyproject.toml after install editable ```cumm``` and before install spconv due to pyproject limit (can't find editable installed ```cumm```).
#### Linux
0. uninstall spconv and cumm installed by pip
1. install build-essential, install CUDA
2. ```git clone https://github.com/FindDefinition/cumm```, ```cd ./cumm```, ```pip install -e .```
......
<!--
Copyright 2021 Yan Yan
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
## Simple Benchmark
### Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU
Network Code: test/benchmark.py
| F32/F16 | Spconv 1.x F32 (1080Ti) | Native| Implicit Gemm | Implicit Gemm Split Mask |
| -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
| Forward | 43ms | 21.7ms/13.7ms | 23.5ms/11.2ms | 22ms/12.2ms |
| Backward | 80ms | 41.9ms/25.2ms | 51.0ms/13.8ms | 41.1ms/12.2ms |
### Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU
Network Code: test/benchmark.py
The network/input/profile code is same as above table.
This table only profile **fp16 gemm kernels** without output tensor create/clear overhead. this table show the performance upper bound of our algorithm.
| F16 | Native| Implicit Gemm | Implicit Gemm Split Mask |
| -------------- |:---------------------:|---------------------:| ---------------------:|
| Forward | 8.0ms | 4.3ms | 4.0ms |
We can see that the implicit gemm is very fast, gemm only use 4.3ms/11.2ms in network forward. we can achieve better performance in TensorRT + Pure C++.
**NOTE**
When you want to benchmark network in your laptop, don't forget to close all apps except terminals! Other apps will consume GPU resource and make kernels run slower.
## Comparsion with [MinkowskiEngine](https://github.com/NVIDIA/MinkowskiEngine) and [torchsparse](https://github.com/mit-han-lab/torchsparse)
TODO
\ No newline at end of file
......@@ -25,12 +25,7 @@
* make sure your channel size is multiple of 8 when using fp16. multiple of 32 is better.
* spconv 2.x in Windows 10 is 1.5x~2x slower than Linux. use Linux if possible.
Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU
| F32/F16 | Spconv 1.x | Native| Implicit Gemm | Implicit Gemm Split Mask |
| -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
| Forward | 43ms | 29ms/23ms | 30ms/15ms | 30ms/19ms |
| Backward | 80ms | 47ms/32ms | 56ms/15ms | 45ms/14ms |
See [benchmark](BENCHMARK.md) for more performance details of different algorithms.
## Algorithm Overview
......@@ -57,4 +52,4 @@ In my test, ```Implicit Gemm``` is almost 2x faster than ```Native```.
TODO
In my test, ```Implicit Gemm Split Mask``` is slightly faster than ```Implicit Gemm```, but the indice generation is greatly slower, so currently we use ```Implicit Gemm``` by default.
\ No newline at end of file
In my test, ```Implicit Gemm Split Mask``` is slightly faster than ```Implicit Gemm```, but the indice generation is slower, so currently we use ```Implicit Gemm``` by default.
\ No newline at end of file
# Copyright 2021 Yan Yan
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
......@@ -22,11 +22,12 @@ import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR
import contextlib
import torch.cuda.amp
import torch.cuda.amp
@contextlib.contextmanager
def identity_ctx():
yield
yield
class Net(nn.Module):
......@@ -39,14 +40,13 @@ class Net(nn.Module):
spconv.SubMConv2d(32, 64, 3, 1),
nn.ReLU(),
spconv.SparseMaxPool2d(2, 2),
spconv.ToDense(),
spconv.ToDense(),
)
self.fc1 = nn.Linear(14 * 14 * 64, 128)
self.fc2 = nn.Linear(128, 10)
self.dropout1 = nn.Dropout2d(0.25)
self.dropout2 = nn.Dropout2d(0.5)
def forward(self, x: torch.Tensor):
# x: [N, 28, 28, 1], must be NHWC tensor
x_sp = spconv.SparseConvTensor.from_dense(x.reshape(-1, 28, 28, 1))
......@@ -116,40 +116,72 @@ def test(args, model, device, test_loader):
with amp_ctx:
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
test_loss += F.nll_loss(
output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(
dim=1,
keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
print(
'\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
def main():
# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
parser.add_argument('--batch-size',
type=int,
default=64,
metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
parser.add_argument('--test-batch-size',
type=int,
default=1000,
metavar='N',
help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=14, metavar='N',
parser.add_argument('--epochs',
type=int,
default=14,
metavar='N',
help='number of epochs to train (default: 14)')
parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
parser.add_argument('--lr',
type=float,
default=1.0,
metavar='LR',
help='learning rate (default: 1.0)')
parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
parser.add_argument('--gamma',
type=float,
default=0.7,
metavar='M',
help='Learning rate step gamma (default: 0.7)')
parser.add_argument('--no-cuda', action='store_true', default=False,
parser.add_argument('--no-cuda',
action='store_true',
default=False,
help='disables CUDA training')
parser.add_argument('--seed', type=int, default=1, metavar='S',
parser.add_argument('--seed',
type=int,
default=1,
metavar='S',
help='random seed (default: 1)')
parser.add_argument('--log-interval', type=int, default=10, metavar='N',
help='how many batches to wait before logging training status')
parser.add_argument('--save-model', action='store_true', default=False,
parser.add_argument(
'--log-interval',
type=int,
default=10,
metavar='N',
help='how many batches to wait before logging training status')
parser.add_argument('--save-model',
action='store_true',
default=False,
help='For Saving the current Model')
parser.add_argument('--fp16', action='store_true', default=False,
parser.add_argument('--fp16',
action='store_true',
default=False,
help='For mixed precision training')
args = parser.parse_args()
......@@ -161,20 +193,30 @@ def main():
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
# here we remove norm to get sparse tensor with lots of zeros
# transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=args.batch_size, shuffle=True, **kwargs)
datasets.MNIST(
'../data',
train=True,
download=True,
transform=transforms.Compose([
transforms.ToTensor(),
# here we remove norm to get sparse tensor with lots of zeros
# transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=args.batch_size,
shuffle=True,
**kwargs)
test_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=False, transform=transforms.Compose([
transforms.ToTensor(),
# here we remove norm to get sparse tensor with lots of zeros
# transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=args.test_batch_size, shuffle=True, **kwargs)
datasets.MNIST(
'../data',
train=False,
transform=transforms.Compose([
transforms.ToTensor(),
# here we remove norm to get sparse tensor with lots of zeros
# transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=args.test_batch_size,
shuffle=True,
**kwargs)
model = Net().to(device)
optimizer = optim.Adadelta(model.parameters(), lr=args.lr)
......
# Copyright 2021 Yan Yan
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import numpy as np
from cumm import tensorview as tv
from cumm import tensorview as tv
from spconv.utils import Point2VoxelCPU3d
from spconv.pytorch.utils import PointToVoxel
import torch
import torch
def main():
# voxel gen source code: spconv/csrc/sparse/pointops.py
gen = Point2VoxelCPU3d(
vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3,
max_num_voxels=5000,
max_num_points_per_voxel=5)
gen = Point2VoxelCPU3d(vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3,
max_num_voxels=5000,
max_num_points_per_voxel=5)
pc = np.random.uniform(-10, 10, size=[1000, 3])
pc_tv = tv.from_numpy(pc)
......@@ -39,20 +39,23 @@ def main():
print("------Raw Voxels-------")
print(voxels_np[0])
# run voxel gen and FILL MEAN VALUE to voxel remain
voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(pc_tv)
voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(
pc_tv)
voxels_np = voxels_tv.numpy_view()
indices_np = indices_tv.numpy_view()
num_p_in_vx_np = num_p_in_vx_tv.numpy_view()
print("------Voxels with mean filled-------")
print(voxels_np[0])
def main_point_with_features():
# voxel gen source code: spconv/csrc/sparse/pointops.py
gen = Point2VoxelCPU3d(
vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=4, # here num_point_features must equal to pc.shape[1]
max_num_voxels=5000,
vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=
4, # here num_point_features must equal to pc.shape[1]
max_num_voxels=5000,
max_num_points_per_voxel=5)
pc = np.random.uniform(-10, 10, size=[1000, 3])
......@@ -68,21 +71,22 @@ def main_point_with_features():
print("------Raw Voxels-------")
print(voxels_np[0])
# run voxel gen and FILL MEAN VALUE to voxel remain
voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(pc_tv)
voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(
pc_tv)
voxels_np = voxels_tv.numpy_view()
indices_np = indices_tv.numpy_view()
num_p_in_vx_np = num_p_in_vx_tv.numpy_view()
print("------Voxels with mean filled-------")
print(voxels_np[0])
def main_pytorch_voxel_gen():
# voxel gen source code: spconv/csrc/sparse/pointops.py
gen = PointToVoxel(
vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3,
max_num_voxels=5000,
max_num_points_per_voxel=5)
gen = PointToVoxel(vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3,
max_num_voxels=5000,
max_num_points_per_voxel=5)
pc = np.random.uniform(-10, 10, size=[1000, 3])
pc_th = torch.from_numpy(pc)
......@@ -100,16 +104,16 @@ def main_pytorch_voxel_gen():
print("------Voxels with mean filled-------")
print(voxels_np[0])
def main_pytorch_voxel_gen_cuda():
# voxel gen source code: spconv/csrc/sparse/pointops.py
device = torch.device("cuda:0")
gen = PointToVoxel(
vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3,
max_num_voxels=5000,
max_num_points_per_voxel=5,
device=device)
gen = PointToVoxel(vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3,
max_num_voxels=5000,
max_num_points_per_voxel=5,
device=device)
pc = np.random.uniform(-10, 10, size=[1000, 3]).astype(np.float32)
pc_th = torch.from_numpy(pc).to(device)
......@@ -133,4 +137,4 @@ if __name__ == "__main__":
main_point_with_features()
main_pytorch_voxel_gen()
if torch.cuda.is_available():
main_pytorch_voxel_gen_cuda()
\ No newline at end of file
main_pytorch_voxel_gen_cuda()
isort -rc --atomic ./spconv && \
isort -rc --atomic ./test && \
yapf -i --recursive -vv ./spconv ./test
find ./src -regex '.*\.\(cpp\|hpp\|cc\|cxx\|cu\|cuh\|h\)' | xargs clang-format -i
find ./include -regex '.*\.\(cpp\|hpp\|cc\|cxx\|cu\|cuh\|h\)' | xargs clang-format -i
\ No newline at end of file
yapf -i --recursive -vv ./spconv ./test ./example ./scripts
[build-system]
requires = ["setuptools>=41.0", "wheel", "pccm>=0.2.21", "cumm>=0.2.1"]
requires = ["setuptools>=41.0", "wheel", "pccm>=0.2.21", "cumm>=0.2.2"]
build-backend = "setuptools.build_meta"
......@@ -19,20 +19,21 @@ from cumm.conv.bases import NCHW, NHWC, ConvIterAlgo, ConvOpType
from cumm.conv.main import ConvMainUnitTest, gen_gemm_kernels
from cumm.conv.params import ConvProblem
from cumm.gemm import kernel
import os
import os
from spconv.core_cc.csrc.sparse.all import SpconvOps
from cumm.gemm.codeops import div_up
from spconv.constants import PACKAGE_ROOT
from spconv.core import ConvAlgo
from spconv.pytorch import ops
from spconv.pytorch import ops
from spconv.algo import CONV, BestConvAlgoByProfile
from spconv.pytorch.cppcore import torch_tensor_to_tv
def reduce_mask_count(mask: np.ndarray, width: int):
mask_length_32 = (div_up(mask.shape[0], width)) * width
if mask.shape[0] < mask_length_32:
mask_pad = np.zeros((mask_length_32,), dtype=mask.dtype)
mask_pad = np.zeros((mask_length_32, ), dtype=mask.dtype)
mask_pad[:mask.shape[0]] = mask
mask = mask_pad
mask = mask.reshape(-1, width)
......@@ -40,16 +41,18 @@ def reduce_mask_count(mask: np.ndarray, width: int):
maskr_tv = tv.from_numpy(maskr)
return SpconvOps.count_bits(maskr_tv).numpy().sum() * width
def reduce_mask_count_x(mask: np.ndarray, width: int):
mask_length_32 = (div_up(mask.shape[0], width)) * width
if mask.shape[0] < mask_length_32:
mask_pad = np.zeros((mask_length_32,), dtype=mask.dtype)
mask_pad = np.zeros((mask_length_32, ), dtype=mask.dtype)
mask_pad[:mask.shape[0]] = mask
mask = mask_pad
mask = mask.reshape(-1, width)
maskr = np.bitwise_or.reduce(mask, axis=1)
return maskr
def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
limit_input_n = 16384
limit_input_n = None
......@@ -88,8 +91,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
stride = [1] * ndim
dilation = [1] * ndim
out_padding = [0] * ndim
out_inds, pair_ref, indice_num_per_loc = ops.get_indice_pairs(indices_th, 1, spatial_shape, ConvAlgo.Native,
ksize, stride, padding, dilation, out_padding, subm)
out_inds, pair_ref, indice_num_per_loc = ops.get_indice_pairs(
indices_th, 1, spatial_shape, ConvAlgo.Native, ksize, stride, padding,
dilation, out_padding, subm)
indice_num_per_loc_np = indice_num_per_loc.cpu().numpy()
indice_pairs_np = pair_ref.cpu().numpy()
algo = ConvAlgo.MaskSplitImplicitGemm
......@@ -98,8 +102,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
else:
num_split = 2
for i in range(5):
res = ops.get_indice_pairs_implicit_gemm(indices_th, 1, spatial_shape, algo,
ksize, stride, padding, dilation, out_padding, subm)
res = ops.get_indice_pairs_implicit_gemm(indices_th, 1, spatial_shape,
algo, ksize, stride, padding,
dilation, out_padding, subm)
out_inds = res[0]
num_inds_per_loc = res[1]
pair_fwd = res[2]
......@@ -115,23 +120,38 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
mask_argsort_fwd_splits = res[6]
mask_argsort_bwd_splits = res[7]
masks = res[8]
pair_mask_fwd_splits_tv = [ops.torch_tensor_to_tv(t, dtype=tv.uint32) for t in pair_mask_fwd_splits]
valid_location_bitcount = [SpconvOps.count_bits(t) for t in pair_mask_fwd_splits_tv]
valid_location_count = sum([t.cpu().numpy().sum() for t in valid_location_bitcount])
pair_mask_fwd_splits_tv = [
ops.torch_tensor_to_tv(t, dtype=tv.uint32)
for t in pair_mask_fwd_splits
]
valid_location_bitcount = [
SpconvOps.count_bits(t) for t in pair_mask_fwd_splits_tv
]
valid_location_count = sum(
[t.cpu().numpy().sum() for t in valid_location_bitcount])
reduce_length = 32
split_mask_valid_count = sum([reduce_mask_count(t.cpu().numpy(), reduce_length) for t in pair_mask_fwd_splits_tv])
split_mask_valid_count = sum([
reduce_mask_count(t.cpu().numpy(), reduce_length)
for t in pair_mask_fwd_splits_tv
])
if subm:
print("SUBM", valid_location_count, split_mask_valid_count, pair_fwd.numel())
print("SUBM", valid_location_count, split_mask_valid_count,
pair_fwd.numel())
else:
print("REGULAR", valid_location_count, split_mask_valid_count, pair_fwd.numel())
# return
print("REGULAR", valid_location_count, split_mask_valid_count,
pair_fwd.numel())
# return
if run_conv:
C = 64
K = 64
desps = CONV.desps
mask_output_fwd = torch.zeros([2, div_up(out_inds.shape[0], 32)], dtype=torch.int32, device=indices_th.device)
mask_output_bwd = torch.zeros([2, div_up(indices.dim(0), 32)], dtype=torch.int32, device=indices_th.device)
mask_output_fwd = torch.zeros([2, div_up(out_inds.shape[0], 32)],
dtype=torch.int32,
device=indices_th.device)
mask_output_bwd = torch.zeros([2, div_up(indices.dim(0), 32)],
dtype=torch.int32,
device=indices_th.device)
for desp in desps:
if desp.algo != GemmAlgo.Simt.value:
......@@ -140,17 +160,22 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
# continue
# if desp.tile_shape !
if desp.dtype_a == dtypes.int8.tv_dtype:
inp = np.random.randint(-1, 1, size=[voxels_np.shape[0], C]).astype(np.int8)
weight = np.random.randint(-1, 1, size=[K, *ksize, C]).astype(np.int8)
output = np.random.randint(-1, 1, size=[out_inds.shape[0], K]).astype(
dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
inp = np.random.randint(-1, 1, size=[voxels_np.shape[0],
C]).astype(np.int8)
weight = np.random.randint(-1, 1, size=[K, *ksize,
C]).astype(np.int8)
output = np.random.randint(-1, 1, size=[
out_inds.shape[0], K
]).astype(dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
else:
inp = np.random.uniform(-1, 1, size=[voxels_np.shape[0], C]).astype(
dtypes.get_npdtype_from_tvdtype(desp.dtype_input))
inp = np.random.uniform(-1, 1, size=[
voxels_np.shape[0], C
]).astype(dtypes.get_npdtype_from_tvdtype(desp.dtype_input))
weight = np.random.uniform(-1, 1, size=[K, *ksize, C]).astype(
dtypes.get_npdtype_from_tvdtype(desp.dtype_weight))
output = np.random.uniform(-1, 1, size=[out_inds.shape[0], K]).astype(
dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
output = np.random.uniform(-1, 1, size=[
out_inds.shape[0], K
]).astype(dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
weight_ref = weight.transpose(1, 2, 3, 0, 4)
weight_ref = np.ascontiguousarray(weight_ref).reshape(-1, K, C)
if desp.op_type == ConvOpType.kBackwardInput.value:
......@@ -211,19 +236,19 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
)
else:
if desp.op_type == ConvOpType.kForward.value:
indice_pairs = pair_fwd # inp -> out
indice_pairs = pair_fwd # inp -> out
mask_ops = pair_mask_fwd_splits
mask_argsorts = mask_argsort_fwd_splits
mask_output = mask_output_fwd
elif desp.op_type == ConvOpType.kBackwardInput.value:
indice_pairs = pair_bwd # out -> inp
indice_pairs = pair_bwd # out -> inp
mask_ops = pair_mask_bwd_splits
mask_argsorts = mask_argsort_bwd_splits
mask_output = mask_output_bwd
print([bin(x.item()) for x in masks])
else:
indice_pairs = pair_fwd # inp -> out
indice_pairs = pair_fwd # inp -> out
mask_ops = pair_mask_fwd_splits
mask_argsorts = mask_argsort_fwd_splits
mask_output = mask_output_fwd
......@@ -255,7 +280,7 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
)
torch.cuda.synchronize()
duration = time.time() - t
duration = time.time() - t
if desp.op_type == ConvOpType.kForward.value:
output_ref = np.zeros_like(output, dtype=np.float32)
# ref algorithm
......@@ -270,7 +295,9 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
c_inds = indice_pairs_np[1][filter_offset][:nhot]
# print(a_inds_cpu[:10])
a = inp[a_inds]
cc = a.astype(np.float32) @ weight_ref[filter_offset].T.astype(np.float32)
cc = a.astype(
np.float32) @ weight_ref[filter_offset].T.astype(
np.float32)
output_ref[c_inds] += cc
output_cpu = output_tv.cpu().numpy().astype(np.float32)
......@@ -294,12 +321,18 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
# print(a_inds_cpu[:10])
a = output[a_inds]
# NK @ KC
cc = a.astype(np.float32) @ weight_ref[filter_offset].astype(np.float32)
cc = a.astype(
np.float32) @ weight_ref[filter_offset].astype(
np.float32)
dinput_ref[c_inds] += cc
din_cpu = inp_tv.cpu().numpy()
print("ERROR", np.linalg.norm(din_cpu.reshape(-1) - dinput_ref.reshape(-1)))
print(
"ERROR",
np.linalg.norm(
din_cpu.reshape(-1) - dinput_ref.reshape(-1)))
else:
dw_ref = np.zeros_like(weight_ref, dtype=np.float32) # KV, K, C
dw_ref = np.zeros_like(weight_ref,
dtype=np.float32) # KV, K, C
for filter_offset in range(kv):
if subm and filter_offset > kv // 2:
nhot = indice_num_per_loc_np[kv - 1 - filter_offset]
......@@ -310,16 +343,20 @@ def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
o_inds = indice_pairs_np[1][filter_offset][:nhot]
i_inds = indice_pairs_np[0][filter_offset][:nhot]
# print(a_inds_cpu[:10])
out_gather = output[o_inds] # [N, K]
inp_gather = inp[i_inds] # [N, C]
out_gather = output[o_inds] # [N, K]
inp_gather = inp[i_inds] # [N, C]
# KN @ NC
dw_res = out_gather.astype(np.float32).T @ inp_gather.astype(np.float32)
dw_res = out_gather.astype(
np.float32).T @ inp_gather.astype(np.float32)
dw_ref[filter_offset] = dw_res
# print(indice_pairs_np_test[0])
dw_ref_kcrs = dw_ref.transpose(1, 0, 2)
dw_cpu = weight_tv.cpu().numpy().reshape(K, np.prod(ksize), C)
print("ERROR", np.linalg.norm(dw_cpu.reshape(-1) - dw_ref_kcrs.reshape(-1)))
print(
"ERROR",
np.linalg.norm(
dw_cpu.reshape(-1) - dw_ref_kcrs.reshape(-1)))
if __name__ == "__main__":
......
# Copyright 2021 Yan Yan
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from cumm import tensorview as tv
import numpy as np
from cumm import tensorview as tv
from spconv.core_cc.csrc.sparse.all import SpconvOps
import pickle
import pickle
import torch
from spconv.pytorch.cppcore import torch_tensor_to_tv
from spconv.pytorch.cppcore import torch_tensor_to_tv
def main():
with open("/home/yy/asd.pkl", "rb") as f:
a_th = pickle.load(f)
mask_argsort = torch.empty((1, a_th.shape[1]),
dtype=torch.int32,
device=a_th.device)
dtype=torch.int32,
device=a_th.device)
a = a_th.cpu().numpy()[0]
a_tv = torch_tensor_to_tv(a_th)
......@@ -34,5 +35,6 @@ def main():
a_tv_1 = a_tv.clone()
SpconvOps.sort_1d_by_key(a_tv_1[0], mask_argsort_tv[0])
if __name__ == "__main__":
main()
\ No newline at end of file
main()
......@@ -38,9 +38,9 @@ if cuda_ver:
cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102
RELEASE_NAME += "-cu{}".format(cuda_ver)
deps = ["cumm-cu{}".format(cuda_ver)]
deps = ["cumm-cu{}>=0.2.2".format(cuda_ver)]
else:
deps = ["cumm"]
deps = ["cumm>=0.2.2"]
......@@ -48,11 +48,11 @@ DESCRIPTION = 'spatial sparse convolution'
URL = 'https://github.com/traveller59/spconv'
EMAIL = 'yanyan.sub@outlook.com'
AUTHOR = 'Yan Yan'
REQUIRES_PYTHON = '>=3.7'
REQUIRES_PYTHON = '>=3.6'
VERSION = None
# What packages are required for this module to be executed?
REQUIRED = ["pccm>=0.2.19", "pybind11>=2.6.0", "fire", "numpy", *deps]
REQUIRED = ["pccm>=0.2.21", "pybind11>=2.6.0", "fire", "numpy", *deps]
# What packages are optional?
EXTRAS = {
......
# Copyright 2021 Yan Yan
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
......@@ -16,4 +16,4 @@ from . import build as _build
from .core import ConvAlgo, AlgoHint
from . import constants
from .__version__ import __version__
\ No newline at end of file
from .__version__ import __version__
......@@ -24,9 +24,10 @@ from spconv.constants import NDIM_DONT_CARE
from typing import Optional
import time
from threading import Lock
import torch
import contextlib
import numpy as np
from spconv.core import ConvAlgo, AlgoHint
from spconv.tools import CUDAKernelTimer
ALL_ALGO_DESPS = GemmMainUnitTest.get_all_algo_desp()
ALL_CONV_ALGO_DESPS = ConvMainUnitTest.get_all_conv_algo_desp()
......@@ -403,7 +404,8 @@ class SimpleGemm:
alpha: float = 1.0,
beta: float = 0.0,
gather_data: tv.Tensor = tv.Tensor(),
workspace: tv.Tensor = tv.Tensor()):
workspace: tv.Tensor = tv.Tensor(),
timer: CUDAKernelTimer = CUDAKernelTimer(False)):
m, n, k = GemmMainUnitTest.extract_mnk(a.shape, b.shape, trans_a,
trans_b, trans_c,
shuffle_type.value,
......@@ -446,6 +448,9 @@ class SimpleGemm:
# stream=stream)
# GemmMainUnitTest.stream_synchronize(stream)
# gather = time.time() - tt
if timer.enable:
assert timer._timer is not None
params.timer = timer._timer
GemmMainUnitTest.matmul2(params)
# GemmMainUnitTest.stream_synchronize(stream)
......@@ -678,7 +683,8 @@ class SimpleConv:
beta: float = 0.0,
stream: int = 0,
workspace: tv.Tensor = tv.Tensor(),
verbose: bool = False):
verbose: bool = False,
timer: CUDAKernelTimer = CUDAKernelTimer(False)):
channel_k = output.dim(1)
channel_c = inp.dim(1)
# GemmMainUnitTest.stream_synchronize(stream)
......@@ -709,9 +715,11 @@ class SimpleConv:
params.mask_filter = mask_filter
params.mask_output = mask_output
params.reverse_mask = reverse_mask
if timer.enable:
assert timer._timer is not None
params.timer = timer._timer
# torch.cuda.synchronize()
# t = time.time()
params.workspace = workspace
ConvMainUnitTest.implicit_gemm2(params)
# torch.cuda.synchronize()
......@@ -724,6 +732,7 @@ class SimpleConv:
def stream_synchronize(self, stream: int):
return GemmMainUnitTest.stream_synchronize(stream)
GEMM = SimpleGemm(ALL_ALGO_DESPS)
CONV = SimpleConv(ALL_CONV_ALGO_DESPS)
......
# Copyright 2021 Yan Yan
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
......@@ -19,7 +19,8 @@ from pccm.utils import project_is_editable, project_is_installed
from ccimport.compat import InWindows
from .constants import PACKAGE_NAME, PACKAGE_ROOT, DISABLE_JIT
if project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME) and not DISABLE_JIT:
if project_is_installed(PACKAGE_NAME) and project_is_editable(
PACKAGE_NAME) and not DISABLE_JIT:
from spconv.core import SHUFFLE_SIMT_PARAMS, SHUFFLE_VOLTA_PARAMS, SHUFFLE_TURING_PARAMS
from spconv.core import IMPLGEMM_SIMT_PARAMS, IMPLGEMM_VOLTA_PARAMS, IMPLGEMM_TURING_PARAMS
......@@ -27,11 +28,13 @@ if project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME) and
from cumm.conv.main import ConvMainUnitTest
from spconv.csrc.sparse.all import SpconvOps
cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS + SHUFFLE_TURING_PARAMS)
cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS +
SHUFFLE_TURING_PARAMS)
cu.namespace = "cumm.gemm.main"
convcu = ConvMainUnitTest(IMPLGEMM_SIMT_PARAMS + IMPLGEMM_VOLTA_PARAMS + IMPLGEMM_TURING_PARAMS)
convcu = ConvMainUnitTest(IMPLGEMM_SIMT_PARAMS + IMPLGEMM_VOLTA_PARAMS +
IMPLGEMM_TURING_PARAMS)
convcu.namespace = "cumm.conv.main"
objects_folder = None
objects_folder = None
if InWindows:
# windows have command line limit, so we use objects_folder to reduce command size.
objects_folder = "objects"
......
# Copyright 2021 Yan Yan
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
......@@ -20,10 +20,10 @@ from pccm.utils import project_is_editable, project_is_installed
PACKAGE_NAME = "spconv"
PACKAGE_ROOT = Path(__file__).parent.resolve()
EDITABLE_INSTALLED = project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME)
EDITABLE_INSTALLED = project_is_installed(
PACKAGE_NAME) and project_is_editable(PACKAGE_NAME)
_filter_hwio_env = os.getenv("SPCONV_FILTER_HWIO", "0")
FILTER_HWIO = _filter_hwio_env == "1"
DISABLE_JIT = os.getenv("SPCONV_DISABLE_JIT", "0") == "1"
NDIM_DONT_CARE = 3
\ No newline at end of file
NDIM_DONT_CARE = 3
This diff is collapsed.
# Copyright 2021 Yan Yan
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright 2021 Yan Yan
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment