Commit eae6a3bd authored by yan.yan's avatar yan.yan
Browse files

v2.1

parent fa995a4f
...@@ -16,7 +16,7 @@ jobs: ...@@ -16,7 +16,7 @@ jobs:
strategy: strategy:
matrix: matrix:
python-version: ['3.7', '3.8', '3.9', '3.10'] python-version: ['3.7', '3.8', '3.9', '3.10']
cuda-version: ['11.1', '11.4'] cuda-version: ['10.2', '11.1', '11.3', '11.4']
steps: steps:
- uses: actions/checkout@master - uses: actions/checkout@master
- uses: dorny/paths-filter@v2 - uses: dorny/paths-filter@v2
...@@ -75,7 +75,7 @@ jobs: ...@@ -75,7 +75,7 @@ jobs:
strategy: strategy:
matrix: matrix:
python-version: ['3.7', '3.8', '3.9', '3.10'] # this version is only used for upload. python-version: ['3.7', '3.8', '3.9', '3.10'] # this version is only used for upload.
cuda-version: ['111', '114'] cuda-version: ['102', '111', '113', '114', '']
steps: steps:
- uses: actions/checkout@master - uses: actions/checkout@master
...@@ -110,6 +110,17 @@ jobs: ...@@ -110,6 +110,17 @@ jobs:
chmod +x tools/build-wheels.sh chmod +x tools/build-wheels.sh
docker run --rm -e PLAT=$PLAT -e CUMM_CUDA_VERSION=${{ matrix.cuda-version }} -e SPCONV_PYTHON_LIST=${{env.PYTHON_VERSION}} -v `pwd`:/io $DOCKER_IMAGE bash -c "source /etc/bashrc && /io/tools/build-wheels.sh" docker run --rm -e PLAT=$PLAT -e CUMM_CUDA_VERSION=${{ matrix.cuda-version }} -e SPCONV_PYTHON_LIST=${{env.PYTHON_VERSION}} -v `pwd`:/io $DOCKER_IMAGE bash -c "source /etc/bashrc && /io/tools/build-wheels.sh"
- name: Build a cpu wheel
env:
CUDA_VERSION: ${{ matrix.cuda-version }}
PYTHON_VERSION: ${{ matrix.python-version }}
DOCKER_IMAGE: scrin/manylinux2014-cuda:cu114-devel-1.0.0
PLAT: manylinux2014_x86_64
if: (github.event_name == 'push' && (startsWith(github.ref, 'refs/tags')) && (env.CUDA_VERSION == '')) || ((steps.changes.outputs.nobuild == 'false') && (env.PYTHON_VERSION == '3.10') && (env.CUDA_VERSION == ''))
run: |
chmod +x tools/build-wheels.sh
docker run --rm -e PLAT=$PLAT -e CUMM_CUDA_VERSION=${{ matrix.cuda-version }} -e SPCONV_PYTHON_LIST=${{env.PYTHON_VERSION}} -v `pwd`:/io $DOCKER_IMAGE bash -c "source /etc/bashrc && /io/tools/build-wheels.sh"
- name: Publish a Python distribution to PyPI - name: Publish a Python distribution to PyPI
if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags') if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags')
uses: pypa/gh-action-pypi-publish@master uses: pypa/gh-action-pypi-publish@master
......
# Changelog # Changelog
## [2.1.0] - 2021-10-31
### Addad
* add implicit gemm algorithm for all kind of convolution with kernel volume <= 32. this algorithm is very fast with float16.
* add pytorch wrapper for voxel generator
* add CPU support and CPU-only build.
## [2.0.2] - 2021-10-26
### Fixed
- Fix a serious bug that do nothing with non-spconv layers in SparseSequential
- Fix a bug of ProxyableClassMeta
## [2.0.0] - 2021-10-16 ## [2.0.0] - 2021-10-16
### Changed ### Changed
- Change build system from cmake to pccm. - Change build system from cmake to pccm.
......
...@@ -14,41 +14,29 @@ ...@@ -14,41 +14,29 @@
limitations under the License. limitations under the License.
--> -->
# SpConv: PyTorch Spatially Sparse Convolution Library # SpConv: Spatially Sparse Convolution Library
[![Build Status](https://github.com/traveller59/spconv/workflows/build/badge.svg)](https://github.com/traveller59/spconv/actions?query=workflow%3Abuild) [![Build Status](https://github.com/traveller59/spconv/workflows/build/badge.svg)](https://github.com/traveller59/spconv/actions?query=workflow%3Abuild)
[Spconv 1.x code](https://github.com/traveller59/spconv/tree/v1.2.1). We won't provide any support for spconv 1.x since it's deprecated. use spconv 2.x if possible. ```spconv``` is a project that provide heavily-optimized sparse convolution implementation with tensor core support.
## Breaking changes in Spconv 2.x [Spconv 1.x code](https://github.com/traveller59/spconv/tree/v1.2.1). We won't provide any support for spconv 1.x since it's deprecated. use spconv 2.x if possible. <!--remove this message in spconv 2.2-->
* ```spconv.xxx``` move to ```spconv.pytorch.xxx```, change all ```import spconv``` to ```import spconv.pytorch as spconv``` and ```from spconv.xxx import``` to ```from spconv.pytorch.xxx import```.
* ```use_hash``` in Sparse Convolution is removed, we only use hash table in 2.x.
* ```x.features = F.relu(x)``` now raise error. use ```x = x.replace_feature(F.relu(x.features))``` instead.
* weight layout has been changed to RSKC (native algorithm) or KRSC (implicit gemm), no longer RSCK (spconv 1.x). RS is kernel size, C is input channel, K is output channel.
* all util ops are removed (pillar scatter/nms/...)
* VoxelGenerator has been replaced by Point2VoxelGPU[1-4]d/Point2VoxelCPU[1-4]d.
* spconv 2.x don't support CPU for now
* test spconv 1.x model in spconv 2.x: set environment variable before run program. Linux: ```export SPCONV_FILTER_HWIO="1"```, Windows powershell: ```$Env:SPCONV_FILTER_HWIO = "1"```
## Upcoming release Spconv 2.1.0 (Delay to 11.7.2021, sorry):
**Status**: CPU build is ready. implicit gemm is ready. working on implicit-gemm-style indice generation for standard conv/pool, and implicit-gemm-style maxpool op. ## Breaking changes in Spconv 2.x
* implicit gemm algorithm, greatly faster than native algorithm when using float16 (tested in RTX 3080 Laptop). Spconv 1.x users **NEED READ [THIS](docs/SPCONV_2_BREAKING_CHANGEs.md)** before using spconv 2.x.
* simple CPU support and CPU-only build
* add pytorch cpu/cuda voxel generator
* fix a bug of mixed precision training.
## News in Spconv 2.0.0 ## Spconv 2.1 vs Spconv 1.x
* training/inference speed is increased (+50~80% for float32) * spconv now can be installed by **pip**. see install section in readme for more details.
* support int8/tensor core * Microsoft Windows support (only windows 10 has been tested).
* fp32 (not tf32) training/inference speed is increased (+50~80%)
* fp16 training/inference speed is greatly increased when your layer support tensor core (channel size must be multiple of 8).
* int8 op is ready, but we still need some time to figure out how to run int8 in pytorch.
* doesn't depend on pytorch binary. * doesn't depend on pytorch binary.
* since spconv 2.x doesn't depend on pytorch binary (never in future), it's impossible to support torch.jit/libtorch inference. * since spconv 2.x doesn't depend on pytorch binary (never in future), it's impossible to support torch.jit/libtorch inference.
Spconv 2.1.0 vs 1.x speed: Spconv 2.1 vs 1.x speed:
| | 1080Ti Spconv 1.x F32 | 1080Ti Spconv 2.0 F32 | 3080M* Spconv 2.1 F16 | | | 1080Ti Spconv 1.x F32 | 1080Ti Spconv 2.0 F32 | 3080M* Spconv 2.1 F16 |
| -------------- |:---------------------:| ---------------------:| ----------:| | -------------- |:---------------------:| ---------------------:| ----------:|
...@@ -56,12 +44,26 @@ Spconv 2.1.0 vs 1.x speed: ...@@ -56,12 +44,26 @@ Spconv 2.1.0 vs 1.x speed:
\* 3080M (Laptop) ~= 3070 Desktop \* 3080M (Laptop) ~= 3070 Desktop
<!--
TODO Spconv vs [MinkowskiEngine](https://github.com/NVIDIA/MinkowskiEngine) vs [torchsparse](https://github.com/mit-han-lab/torchsparse)
-->
## Roadmap for Spconv 2.2-2.3:
* TensorFormat32 support for faster fp32 training when you use NVIDIA Geforce RTX 30x0/Tesla A100/Quadro RTX Ax000 (2.2)
* change implicit gemm weight layout from KRSC to RSKC to make sure we can use native algorithm with implicit gemm weight. (2.2)
* documents (2.2)
* Ampere feature support (2.3)
* pytorch int8 inference, and QAT support (2.3)
## Usage ## Usage
Firstly you need to use ```import spconv.pytorch as spconv``` in spconv 2.x. Firstly you need to use ```import spconv.pytorch as spconv``` in spconv 2.x.
Then see docs/USAGE.md. Then see [this](docs/USAGE.md).
Don't forget to check [performance guide](docs/PERFORMANCE_GUIDE.md).
## Install ## Install
...@@ -73,21 +75,50 @@ You need at least CUDA 10.2 to build and run spconv 2.x. We won't offer any supp ...@@ -73,21 +75,50 @@ You need at least CUDA 10.2 to build and run spconv 2.x. We won't offer any supp
### Prebuilt ### Prebuilt
We offer python 3.7-3.10 and 11.1/11.4 prebuilt binaries for linux (manylinux) and windows 10/11. We offer python 3.7-3.10 and cuda 10.2/11.1/11.3/11.4 prebuilt binaries for linux (manylinux) and windows 10/11.
CUDA 10.2 support will be added in version 2.0.2.
We will offer prebuilts for CUDA versions supported by latest pytorch release. For example, pytorch 1.9 support cuda 10.2 and 11.1, so we support them too. We will provide prebuilts for CUDA versions supported by latest pytorch release. For example, pytorch 1.10 provide cuda 10.2 and 11.3 prebuilts, so we provide them too.
For Linux users, you need to install pip >= 20.3 first to install prebuilt. For Linux users, you need to install pip >= 20.3 first to install prebuilt.
CUDA 11.1 will be removed in spconv 2.2 because pytorch 1.10 don't provide prebuilts for it.
```pip install spconv``` for CPU only (**Linux Only**). you should only use this for debug usage, the performance isn't optimized due to manylinux limit (no omp support).
```pip install spconv-cu102``` for CUDA 10.2
```pip install spconv-cu111``` for CUDA 11.1 ```pip install spconv-cu111``` for CUDA 11.1
```pip install spconv-cu113``` for CUDA 11.3
```pip install spconv-cu114``` for CUDA 11.4 ```pip install spconv-cu114``` for CUDA 11.4
**NOTE** It's safe to have different minor cuda version between system and conda (pytorch). for example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed. **NOTE** It's safe to have different **minor** cuda version between system and conda (pytorch) **in Linux**. for example, you can use spconv-cu114 with anaconda version of pytorch cuda 11.1 in a OS with CUDA 11.2 installed.
### Build from source for development (JIT, recommend)
The c++ code will be built automatically when you change c++ code in project.
For NVIDIA Embedded Platforms, you need to specify cuda arch before build: ```export CUMM_CUDA_ARCH_LIST="7.2"``` for xavier.
#### Linux
0. uninstall spconv and cumm installed by pip
1. install build-essential, install CUDA
2. ```git clone https://github.com/FindDefinition/cumm```, ```cd ./cumm```, ```pip install -e .```
3. ```git clone https://github.com/traveller59/spconv```, ```cd ./spconv```, ```pip install -e .```
4. in python, ```import spconv``` and wait for build finish.
#### Windows
0. uninstall spconv and cumm installed by pip
1. install visual studio 2019 or newer. make sure C++ development component is installed. install CUDA
2. set [powershell script execution policy](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_execution_policies?view=powershell-7.1)
3. start a new powershell, run ```tools/msvc_setup.ps1```
4. ```git clone https://github.com/FindDefinition/cumm```, ```cd ./cumm```, ```pip install -e .```
5. ```git clone https://github.com/traveller59/spconv```, ```cd ./spconv```, ```pip install -e .```
6. in python, ```import spconv``` and wait for build finish.
### Build from source ### Build wheel from source (not recommend, this is done in CI.)
You need to rebuild ```cumm``` first if you are build along a CUDA version that not provided in prebuilts. You need to rebuild ```cumm``` first if you are build along a CUDA version that not provided in prebuilts.
...@@ -95,15 +126,17 @@ You need to rebuild ```cumm``` first if you are build along a CUDA version that ...@@ -95,15 +126,17 @@ You need to rebuild ```cumm``` first if you are build along a CUDA version that
1. install build-essential, install CUDA 1. install build-essential, install CUDA
2. run ```export SPCONV_DISABLE_JIT="1"``` 2. run ```export SPCONV_DISABLE_JIT="1"```
3. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl``` 3. run ```pip install pccm cumm wheel```
4. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl```
#### Windows 10/11 #### Windows
1. install visual studio 2019 or newer. make sure C++ development package is installed. install CUDA 1. install visual studio 2019 or newer. make sure C++ development component is installed. install CUDA
2. set [powershell script execution policy](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_execution_policies?view=powershell-7.1) 2. set [powershell script execution policy](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_execution_policies?view=powershell-7.1)
3. start a new powershell, run ```tools/msvc_setup.ps1``` 3. start a new powershell, run ```tools/msvc_setup.ps1```
4. run ```$Env:SPCONV_DISABLE_JIT = "1"``` 4. run ```$Env:SPCONV_DISABLE_JIT = "1"```
5. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl``` 5. run ```pip install pccm cumm wheel```
6. run ```python setup.py bdist_wheel```+```pip install dists/xxx.whl```
## TODO in Spconv 2.x ## TODO in Spconv 2.x
- [ ] Ampere (A100 / RTX 3000 series) feature support (work in progress) - [ ] Ampere (A100 / RTX 3000 series) feature support (work in progress)
...@@ -112,7 +145,6 @@ You need to rebuild ```cumm``` first if you are build along a CUDA version that ...@@ -112,7 +145,6 @@ You need to rebuild ```cumm``` first if you are build along a CUDA version that
- [ ] Build C++ only package - [ ] Build C++ only package
- [ ] JIT compilation for CUDA kernels - [ ] JIT compilation for CUDA kernels
- [ ] Document (low priority) - [ ] Document (low priority)
- [ ] CPU support (low priority)
## Note ## Note
......
...@@ -14,3 +14,12 @@ ...@@ -14,3 +14,12 @@
limitations under the License. limitations under the License.
--> -->
# How to develop spconv 2.x
## First step
spconv 2.x is written in a unique c++ framework ```pccm```. read [pccm guide]() to learn how to use ```pccm```.
It's recommend to uninstall spconv and cumm installed by pip, then install spconv and cumm both in editable mode (```pip install -e .```)
## Architecture
\ No newline at end of file
...@@ -14,3 +14,46 @@ ...@@ -14,3 +14,46 @@
limitations under the License. limitations under the License.
--> -->
# Spconv 2.x Performance Guide
## Short Guide
* If you train without Tensor Core (i.e. FP32 training), set all ```algo``` in convolution/maxpool to ```ConvAlgo.Native``` manually.
* If your GPU support Tensor Core, use FP16 (mixed precision training) if possible.
* If you train with mixed precision training (use Tensor Core), you don't need to set algorithm manually.
* Currently fast algorithm only support kernel volume (prod of kernel size) <= 32, so don't use large kernel size.
* make sure your channel size is multiple of 8 when using fp16. multiple of 32 is better.
Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU
| F32/F16 | Spconv 1.x | Native| Implicit Gemm | Implicit Gemm Split Mask |
| -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
| Forward | 43ms | 29ms/23ms | 30ms/15ms | 30ms/19ms |
| Backward | 80ms | 47ms/32ms | 56ms/15ms | 45ms/14ms |
## Algorithm Overview
### Native Explicit (deprecated and removed in spconv 2.x)
native algorithm (explicit, no fused) is standard gather-gemm-scatter algorithm. Assume we compute 3x3 conv, We can split it to 9 of 1x1 conv which can be computed by matmul, then sum them to get final result.
For sparse convolution, we also do split-gemm-sum to calculate conv, but we need to collect data first because it's sparse.
### Native
Fused version of above algorithm. 1.5x-2x faster than non-fused version.
### Implicit Gemm
```Native``` algorithm do minimal mma (matrix multiply add), but it need to serialize IO. The pipeline of ```Native``` is gather-gemm-scatter-gather-gemm-scatter-...
```Implicit Gemm``` fuse all calculation to one kernel and perform overlapped gather-mma-scatter to save a lot of time.
![Image Overlapped Gemm](https://raw.githubusercontent.com/NVIDIA/cutlass/master/media/images/software-pipeline.png)
In my test, ```Implicit Gemm``` is almost 2x faster than ```Native```.
### Implicit Gemm Split Mask
TODO
In my test, ```Implicit Gemm Split Mask``` is slightly faster than ```Implicit Gemm```, but the indice generation is greatly slower, so currently we use ```Implicit Gemm``` by default.
\ No newline at end of file
<!--
Copyright 2021 Yan Yan
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Breaking changes in Spconv 2.x for spconv 1.x users
* ```spconv.xxx``` move to ```spconv.pytorch.xxx```, change all ```import spconv``` to ```import spconv.pytorch as spconv``` and ```from spconv.xxx import``` to ```from spconv.pytorch.xxx import```.
* ```use_hash``` and ```fused_bn``` in Sparse Convolution is removed, we only use hash table in 2.x.
* ```x.features = F.relu(x.features)``` now raise error. use ```x = x.replace_feature(F.relu(x.features))``` instead.
* weight layout has been changed to RSKC (native algorithm) or KRSC (implicit gemm), no longer RSCK (spconv 1.x). RS is kernel size, C is input channel, K is output channel.
* all util ops are removed (pillar scatter/nms/rbbox_iou...)
* VoxelGenerator has been replaced by Point2VoxelGPU[1-4]d/Point2VoxelCPU[1-4]d.
* spconv < 2.1 don't support CPU. spconv 2.1+ support cpu for debug usage.
* test spconv 1.x model in spconv 2.x: set environment variable before run program. Linux: ```export SPCONV_FILTER_HWIO="1"```, Windows powershell: ```$Env:SPCONV_FILTER_HWIO = "1"```. **WARNING** test spconv 1.x model don't support implicit gemm algorithm, you need to train from scratch with spconv 2.x and select ConvAlgo.MaskSplitImplicitGemm.
...@@ -52,7 +52,7 @@ class ExampleNet(nn.Module): ...@@ -52,7 +52,7 @@ class ExampleNet(nn.Module):
def __init__(self, shape): def __init__(self, shape):
super().__init__() super().__init__()
self.net = spconv.SparseSequential( self.net = spconv.SparseSequential(
spconv.SparseConv3d(32, 64, 3), # just like nn.Conv3d but don't support group and all([d > 1, s > 1]) spconv.SparseConv3d(32, 64, 3), # just like nn.Conv3d but don't support group
nn.BatchNorm1d(64), # non-spatial layers can be used directly in SparseSequential. nn.BatchNorm1d(64), # non-spatial layers can be used directly in SparseSequential.
nn.ReLU(), nn.ReLU(),
spconv.SubMConv3d(64, 64, 3, indice_key="subm0"), spconv.SubMConv3d(64, 64, 3, indice_key="subm0"),
...@@ -100,6 +100,9 @@ class ExampleNet(nn.Module): ...@@ -100,6 +100,9 @@ class ExampleNet(nn.Module):
return self.net(x) return self.net(x)
``` ```
### Fast Mixed Percision Training
see example/mnist_sparse. we support ```torch.cuda.amp```.
### Utility functions ### Utility functions
...@@ -107,23 +110,18 @@ class ExampleNet(nn.Module): ...@@ -107,23 +110,18 @@ class ExampleNet(nn.Module):
voxel generator in spconv generate indices in **ZYX** order, the params format are **XYZ**. voxel generator in spconv generate indices in **ZYX** order, the params format are **XYZ**.
voxel generator in spconv takes a ```tv.Tensor``` return a ```tv.Tensor```, this tensor reference to a **permanent** storage in generator. generated indices don't include batch axis, you need to add it by yourself.
```Python ```Python
from spconv.utils import Point2VoxelCPU3d from spconv.pytorch.utils import PointToVoxel
# this generator generate ZYX indices. # this generator generate ZYX indices.
gen = Point2VoxelCPU3d( gen = PointToVoxel(
vsize_xyz=[0.1, 0.1, 0.1], vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6], coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3, num_point_features=3,
max_num_voxels=5000, max_num_voxels=5000,
max_num_points_per_voxel=5) max_num_points_per_voxel=5)
pc = np.random.uniform(-10, 10, size=[1000, 3]) pc = np.random.uniform(-10, 10, size=[1000, 3])
pc_tv = tv.from_numpy(pc) pc_th = torch.from_numpy(pc)
voxels, coords, num_points_per_voxel = gen.generate(pc_tv) voxels, coords, num_points_per_voxel = gen(pc_th)
# get numpy
voxels_np = voxels.numpy_view() # no copy, but become invalid if generator is destroyed.
voxels_np = voxels.numpy() # will perform copy
``` ```
...@@ -21,6 +21,12 @@ import torch.nn.functional as F ...@@ -21,6 +21,12 @@ import torch.nn.functional as F
import torch.optim as optim import torch.optim as optim
from torchvision import datasets, transforms from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR from torch.optim.lr_scheduler import StepLR
import contextlib
import torch.cuda.amp
@contextlib.contextmanager
def identity_ctx():
yield
class Net(nn.Module): class Net(nn.Module):
...@@ -58,26 +64,57 @@ class Net(nn.Module): ...@@ -58,26 +64,57 @@ class Net(nn.Module):
def train(args, model, device, train_loader, optimizer, epoch): def train(args, model, device, train_loader, optimizer, epoch):
model.train() model.train()
scaler = torch.cuda.amp.grad_scaler.GradScaler()
amp_ctx = identity_ctx()
if args.fp16:
amp_ctx = torch.cuda.amp.autocast()
for batch_idx, (data, target) in enumerate(train_loader): for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device) data, target = data.to(device), target.to(device)
optimizer.zero_grad() optimizer.zero_grad()
with amp_ctx:
output = model(data) output = model(data)
loss = F.nll_loss(output, target) loss = F.nll_loss(output, target)
scale = 1.0
if args.fp16:
assert loss.dtype is torch.float32
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer's assigned params.
# If these gradients do not contain infs or NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
# scaler.unscale_(optim)
# Since the gradients of optimizer's assigned params are now unscaled, clips as usual.
# You may use the same value for max_norm here as you would without gradient scaling.
# torch.nn.utils.clip_grad_norm_(models[0].net.parameters(), max_norm=0.1)
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
scale = scaler.get_scale()
else:
loss.backward() loss.backward()
optimizer.step() optimizer.step()
if batch_idx % args.log_interval == 0: if batch_idx % args.log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset), epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item())) 100. * batch_idx / len(train_loader), loss.item()))
def test(model, device, test_loader): def test(args, model, device, test_loader):
model.eval() model.eval()
test_loss = 0 test_loss = 0
correct = 0 correct = 0
amp_ctx = identity_ctx()
if args.fp16:
amp_ctx = torch.cuda.amp.autocast()
with torch.no_grad(): with torch.no_grad():
for data, target in test_loader: for data, target in test_loader:
data, target = data.to(device), target.to(device) data, target = data.to(device), target.to(device)
with amp_ctx:
output = model(data) output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
...@@ -112,6 +149,9 @@ def main(): ...@@ -112,6 +149,9 @@ def main():
parser.add_argument('--save-model', action='store_true', default=False, parser.add_argument('--save-model', action='store_true', default=False,
help='For Saving the current Model') help='For Saving the current Model')
parser.add_argument('--fp16', action='store_true', default=False,
help='For mixed precision training')
args = parser.parse_args() args = parser.parse_args()
use_cuda = not args.no_cuda and torch.cuda.is_available() use_cuda = not args.no_cuda and torch.cuda.is_available()
...@@ -142,7 +182,7 @@ def main(): ...@@ -142,7 +182,7 @@ def main():
scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma) scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
for epoch in range(1, args.epochs + 1): for epoch in range(1, args.epochs + 1):
train(args, model, device, train_loader, optimizer, epoch) train(args, model, device, train_loader, optimizer, epoch)
test(model, device, test_loader) test(args, model, device, test_loader)
scheduler.step() scheduler.step()
if args.save_model: if args.save_model:
......
...@@ -16,7 +16,8 @@ import numpy as np ...@@ -16,7 +16,8 @@ import numpy as np
from cumm import tensorview as tv from cumm import tensorview as tv
from spconv.utils import Point2VoxelCPU3d from spconv.utils import Point2VoxelCPU3d
from spconv.pytorch.utils import PointToVoxel
import torch
def main(): def main():
# voxel gen source code: spconv/csrc/sparse/pointops.py # voxel gen source code: spconv/csrc/sparse/pointops.py
...@@ -45,5 +46,91 @@ def main(): ...@@ -45,5 +46,91 @@ def main():
print("------Voxels with mean filled-------") print("------Voxels with mean filled-------")
print(voxels_np[0]) print(voxels_np[0])
def main_point_with_features():
# voxel gen source code: spconv/csrc/sparse/pointops.py
gen = Point2VoxelCPU3d(
vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=4, # here num_point_features must equal to pc.shape[1]
max_num_voxels=5000,
max_num_points_per_voxel=5)
pc = np.random.uniform(-10, 10, size=[1000, 3])
other_pc_feature = np.random.uniform(-1, 1, size=[1000, 1])
pc_with_feature = np.concatenate([pc, other_pc_feature], axis=1)
pc_tv = tv.from_numpy(pc_with_feature)
# generate voxels, note that voxels_tv reference to a persistent buffer in generator,
# so we can't run it in multi-thread.
voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel(pc_tv)
voxels_np = voxels_tv.numpy_view()
indices_np = indices_tv.numpy_view()
num_p_in_vx_np = num_p_in_vx_tv.numpy_view()
print("------Raw Voxels-------")
print(voxels_np[0])
# run voxel gen and FILL MEAN VALUE to voxel remain
voxels_tv, indices_tv, num_p_in_vx_tv = gen.point_to_voxel_empty_mean(pc_tv)
voxels_np = voxels_tv.numpy_view()
indices_np = indices_tv.numpy_view()
num_p_in_vx_np = num_p_in_vx_tv.numpy_view()
print("------Voxels with mean filled-------")
print(voxels_np[0])
def main_pytorch_voxel_gen():
# voxel gen source code: spconv/csrc/sparse/pointops.py
gen = PointToVoxel(
vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3,
max_num_voxels=5000,
max_num_points_per_voxel=5)
pc = np.random.uniform(-10, 10, size=[1000, 3])
pc_th = torch.from_numpy(pc)
voxels_th, indices_th, num_p_in_vx_th = gen(pc_th)
voxels_np = voxels_th.numpy()
indices_np = indices_th.numpy()
num_p_in_vx_np = num_p_in_vx_th.numpy()
print("------Raw Voxels-------")
print(voxels_np[0])
# run voxel gen and FILL MEAN VALUE to voxel remain
voxels_tv, indices_tv, num_p_in_vx_tv = gen(pc_th, empty_mean=True)
voxels_np = voxels_tv.numpy()
indices_np = indices_tv.numpy()
num_p_in_vx_np = num_p_in_vx_tv.numpy()
print("------Voxels with mean filled-------")
print(voxels_np[0])
def main_pytorch_voxel_gen_cuda():
# voxel gen source code: spconv/csrc/sparse/pointops.py
device = torch.device("cuda:0")
gen = PointToVoxel(
vsize_xyz=[0.1, 0.1, 0.1],
coors_range_xyz=[-80, -80, -2, 80, 80, 6],
num_point_features=3,
max_num_voxels=5000,
max_num_points_per_voxel=5,
device=device)
pc = np.random.uniform(-10, 10, size=[1000, 3]).astype(np.float32)
pc_th = torch.from_numpy(pc).to(device)
voxels_th, indices_th, num_p_in_vx_th = gen(pc_th)
voxels_np = voxels_th.cpu().numpy()
indices_np = indices_th.cpu().numpy()
num_p_in_vx_np = num_p_in_vx_th.cpu().numpy()
print("------Raw Voxels-------")
print(voxels_np[0])
# run voxel gen and FILL MEAN VALUE to voxel remain
voxels_tv, indices_tv, num_p_in_vx_tv = gen(pc_th, empty_mean=True)
voxels_np = voxels_tv.cpu().numpy()
indices_np = indices_tv.cpu().numpy()
num_p_in_vx_np = num_p_in_vx_tv.cpu().numpy()
print("------Voxels with mean filled-------")
print(voxels_np[0])
if __name__ == "__main__": if __name__ == "__main__":
main() main()
main_point_with_features()
main_pytorch_voxel_gen()
if torch.cuda.is_available():
main_pytorch_voxel_gen_cuda()
\ No newline at end of file
[build-system] [build-system]
requires = ["setuptools>=41.0", "wheel", "pccm>=0.2.15", "cumm>=0.1.9"] requires = ["setuptools>=41.0", "wheel", "pccm>=0.2.19", "cumm>=0.2.0"]
build-backend = "setuptools.build_meta" build-backend = "setuptools.build_meta"
import sys
from pathlib import Path
from typing import Dict, List, Tuple
import pickle
import sys
import time
from pathlib import Path
from cumm.gemm.algospec.core import GemmAlgo
import numpy as np
import pccm
import torch
import torch.nn.functional as F
from cumm import dtypes
from cumm import tensorview as tv
from cumm.constants import PACKAGE_ROOT
from cumm.conv.bases import NCHW, NHWC, ConvIterAlgo, ConvOpType
from cumm.conv.main import ConvMainUnitTest, gen_gemm_kernels
from cumm.conv.params import ConvProblem
from cumm.gemm import kernel
import os
from spconv.core_cc.csrc.sparse.all import SpconvOps
from cumm.gemm.codeops import div_up
from spconv.constants import PACKAGE_ROOT
from spconv.core import ConvAlgo
from spconv.pytorch import ops
from spconv.algo import CONV, BestConvAlgoByProfile
from spconv.pytorch.cppcore import torch_tensor_to_tv
def reduce_mask_count(mask: np.ndarray, width: int):
mask_length_32 = (div_up(mask.shape[0], width)) * width
if mask.shape[0] < mask_length_32:
mask_pad = np.zeros((mask_length_32,), dtype=mask.dtype)
mask_pad[:mask.shape[0]] = mask
mask = mask_pad
mask = mask.reshape(-1, width)
maskr = np.bitwise_or.reduce(mask, axis=1)
maskr_tv = tv.from_numpy(maskr)
return SpconvOps.count_bits(maskr_tv).numpy().sum() * width
def reduce_mask_count_x(mask: np.ndarray, width: int):
mask_length_32 = (div_up(mask.shape[0], width)) * width
if mask.shape[0] < mask_length_32:
mask_pad = np.zeros((mask_length_32,), dtype=mask.dtype)
mask_pad[:mask.shape[0]] = mask
mask = mask_pad
mask = mask.reshape(-1, width)
maskr = np.bitwise_or.reduce(mask, axis=1)
return maskr
def dev_subm_inds_v2(subm: bool = False, run_conv: bool = True):
limit_input_n = 16384
limit_input_n = None
np.random.seed(484)
with (PACKAGE_ROOT.parent / "test/data/test_spconv.pkl").open("rb") as f:
voxels_np, indices_np, spatial_shape = pickle.load(f)
from spconv.test_utils import generate_sparse_data
voxels_np = voxels_np[:limit_input_n]
indices_np = indices_np[:limit_input_n]
spatial_shape = [19, 18, 17]
sparse_dict = generate_sparse_data(spatial_shape, [1024], 128)
voxels_np = np.ascontiguousarray(sparse_dict["features"]).astype(
np.float32)
indices_np = np.ascontiguousarray(
sparse_dict["indices"][:, [3, 0, 1, 2]]).astype(np.int32)
voxels = tv.from_numpy(voxels_np).cuda()
indices = tv.from_numpy(indices_np).cuda()
indices_th = torch.from_numpy(indices_np).cuda()
print(spatial_shape, indices_np.shape)
ndim = 3
if subm:
ksize = [3, 3, 3]
kv = np.prod(ksize)
padding = [1] * ndim
stride = [1] * ndim
dilation = [1] * ndim
out_padding = [0] * ndim
else:
ksize = [2, 2, 2]
kv = np.prod(ksize)
padding = [0] * ndim
stride = [1] * ndim
dilation = [1] * ndim
out_padding = [0] * ndim
out_inds, pair_ref, indice_num_per_loc = ops.get_indice_pairs(indices_th, 1, spatial_shape, ConvAlgo.Native,
ksize, stride, padding, dilation, out_padding, subm)
indice_num_per_loc_np = indice_num_per_loc.cpu().numpy()
indice_pairs_np = pair_ref.cpu().numpy()
algo = ConvAlgo.MaskSplitImplicitGemm
if algo == ConvAlgo.MaskImplicitGemm:
num_split = 1
else:
num_split = 2
for i in range(5):
res = ops.get_indice_pairs_implicit_gemm(indices_th, 1, spatial_shape, algo,
ksize, stride, padding, dilation, out_padding, subm)
out_inds = res[0]
num_inds_per_loc = res[1]
pair_fwd = res[2]
pair_fwd_x = pair_fwd.cpu().numpy().reshape(-1)
pair_fwd_x[pair_fwd_x == -1] = 0
loc_num_np = (pair_fwd_x > 0).reshape(kv, -1).sum(1)
print(loc_num_np)
print(indice_num_per_loc_np)
pair_bwd = res[3]
pair_mask_fwd_splits = res[4]
pair_mask_bwd_splits = res[5]
mask_argsort_fwd_splits = res[6]
mask_argsort_bwd_splits = res[7]
masks = res[8]
pair_mask_fwd_splits_tv = [ops.torch_tensor_to_tv(t, dtype=tv.uint32) for t in pair_mask_fwd_splits]
valid_location_bitcount = [SpconvOps.count_bits(t) for t in pair_mask_fwd_splits_tv]
valid_location_count = sum([t.cpu().numpy().sum() for t in valid_location_bitcount])
reduce_length = 32
split_mask_valid_count = sum([reduce_mask_count(t.cpu().numpy(), reduce_length) for t in pair_mask_fwd_splits_tv])
if subm:
print("SUBM", valid_location_count, split_mask_valid_count, pair_fwd.numel())
else:
print("REGULAR", valid_location_count, split_mask_valid_count, pair_fwd.numel())
# return
if run_conv:
C = 64
K = 64
desps = CONV.desps
mask_output_fwd = torch.zeros([2, div_up(out_inds.shape[0], 32)], dtype=torch.int32, device=indices_th.device)
mask_output_bwd = torch.zeros([2, div_up(indices.dim(0), 32)], dtype=torch.int32, device=indices_th.device)
for desp in desps:
if desp.algo != GemmAlgo.Simt.value:
continue
# if desp.op_type == ConvOpType.kBackwardWeight.value:
# continue
# if desp.tile_shape !
if desp.dtype_a == dtypes.int8.tv_dtype:
inp = np.random.randint(-1, 1, size=[voxels_np.shape[0], C]).astype(np.int8)
weight = np.random.randint(-1, 1, size=[K, *ksize, C]).astype(np.int8)
output = np.random.randint(-1, 1, size=[out_inds.shape[0], K]).astype(
dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
else:
inp = np.random.uniform(-1, 1, size=[voxels_np.shape[0], C]).astype(
dtypes.get_npdtype_from_tvdtype(desp.dtype_input))
weight = np.random.uniform(-1, 1, size=[K, *ksize, C]).astype(
dtypes.get_npdtype_from_tvdtype(desp.dtype_weight))
output = np.random.uniform(-1, 1, size=[out_inds.shape[0], K]).astype(
dtypes.get_npdtype_from_tvdtype(desp.dtype_output))
weight_ref = weight.transpose(1, 2, 3, 0, 4)
weight_ref = np.ascontiguousarray(weight_ref).reshape(-1, K, C)
if desp.op_type == ConvOpType.kBackwardInput.value:
inp_tv = tv.zeros(inp.shape, desp.dtype_input, 0)
else:
inp_tv = tv.from_numpy(inp).cuda()
if desp.op_type == ConvOpType.kBackwardWeight.value:
weight_tv = tv.zeros(weight.shape, desp.dtype_weight, 0)
else:
weight_tv = tv.from_numpy(weight).cuda()
# _ = tv.zeros([5000, 10], tv.float32, 0)
if desp.op_type == ConvOpType.kForward.value:
output_tv = tv.zeros(output.shape, desp.dtype_output, 0)
else:
output_tv = tv.from_numpy(output).cuda()
torch.cuda.synchronize()
t = time.time()
spk = 1
if desp.op_type == ConvOpType.kBackwardWeight.value:
# TODO support splitk parallel
spk = 32
if subm:
if desp.op_type == ConvOpType.kForward.value:
indice_pairs = pair_fwd
elif desp.op_type == ConvOpType.kBackwardInput.value:
indice_pairs = pair_bwd
else:
indice_pairs = pair_fwd
mask_output = mask_output_fwd
# print([bin(x.item()) for x in masks])
for j in range(num_split):
beta = 1 if j == 1 else 0
mask_filter = 0xffffffff
mask_filter = masks[j].item()
reverse_mask = False
if desp.op_type == ConvOpType.kBackwardWeight.value:
mask_op = mask_output[j]
else:
mask_op = pair_mask_fwd_splits[j]
if desp.op_type == ConvOpType.kBackwardInput.value:
reverse_mask = True
CONV.run_with_tuned_result(
BestConvAlgoByProfile(desp, spk),
desp.op_type,
inp_tv,
weight_tv,
output_tv,
torch_tensor_to_tv(mask_op, dtype=tv.uint32),
torch_tensor_to_tv(mask_argsort_fwd_splits[j]),
torch_tensor_to_tv(mask_output[j], dtype=tv.uint32),
torch_tensor_to_tv(indice_pairs),
reverse_mask,
mask_filter=mask_filter,
mask_width=32,
beta=beta,
verbose=True,
)
else:
if desp.op_type == ConvOpType.kForward.value:
indice_pairs = pair_fwd # inp -> out
mask_ops = pair_mask_fwd_splits
mask_argsorts = mask_argsort_fwd_splits
mask_output = mask_output_fwd
elif desp.op_type == ConvOpType.kBackwardInput.value:
indice_pairs = pair_bwd # out -> inp
mask_ops = pair_mask_bwd_splits
mask_argsorts = mask_argsort_bwd_splits
mask_output = mask_output_bwd
print([bin(x.item()) for x in masks])
else:
indice_pairs = pair_fwd # inp -> out
mask_ops = pair_mask_fwd_splits
mask_argsorts = mask_argsort_fwd_splits
mask_output = mask_output_fwd
for j in range(2):
beta = 1 if j == 1 else 0
mask_filter = masks[j].item()
reverse_mask = False
if desp.op_type == ConvOpType.kBackwardWeight.value:
mask_op = mask_output[j]
else:
mask_op = mask_ops[j]
CONV.run_with_tuned_result(
BestConvAlgoByProfile(desp, spk),
desp.op_type,
inp_tv,
weight_tv,
output_tv,
torch_tensor_to_tv(mask_op, dtype=tv.uint32),
torch_tensor_to_tv(mask_argsorts[j]),
torch_tensor_to_tv(mask_output[j], dtype=tv.uint32),
torch_tensor_to_tv(indice_pairs),
reverse_mask,
mask_filter=mask_filter,
mask_width=32,
beta=beta,
verbose=True,
)
torch.cuda.synchronize()
duration = time.time() - t
if desp.op_type == ConvOpType.kForward.value:
output_ref = np.zeros_like(output, dtype=np.float32)
# ref algorithm
for filter_offset in range(kv):
if subm and filter_offset > kv // 2:
nhot = indice_num_per_loc_np[kv - 1 - filter_offset]
elif subm and filter_offset == kv // 2:
nhot = voxels.shape[0]
else:
nhot = indice_num_per_loc_np[filter_offset]
a_inds = indice_pairs_np[0][filter_offset][:nhot]
c_inds = indice_pairs_np[1][filter_offset][:nhot]
# print(a_inds_cpu[:10])
a = inp[a_inds]
cc = a.astype(np.float32) @ weight_ref[filter_offset].T.astype(np.float32)
output_ref[c_inds] += cc
output_cpu = output_tv.cpu().numpy().astype(np.float32)
duration = time.time() - t
my = output_cpu.reshape(-1)
print("ERROR", np.linalg.norm(output_ref.reshape(-1) - my))
elif desp.op_type == ConvOpType.kBackwardInput.value:
dinput_ref = np.zeros_like(inp, dtype=np.float32)
# ref algorithm
for filter_offset in range(kv):
if subm and filter_offset > kv // 2:
nhot = indice_num_per_loc_np[kv - 1 - filter_offset]
elif subm and filter_offset == kv // 2:
nhot = voxels.shape[0]
else:
nhot = indice_num_per_loc_np[filter_offset]
a_inds = indice_pairs_np[1][filter_offset][:nhot]
c_inds = indice_pairs_np[0][filter_offset][:nhot]
# print(a_inds_cpu[:10])
a = output[a_inds]
# NK @ KC
cc = a.astype(np.float32) @ weight_ref[filter_offset].astype(np.float32)
dinput_ref[c_inds] += cc
din_cpu = inp_tv.cpu().numpy()
print("ERROR", np.linalg.norm(din_cpu.reshape(-1) - dinput_ref.reshape(-1)))
else:
dw_ref = np.zeros_like(weight_ref, dtype=np.float32) # KV, K, C
for filter_offset in range(kv):
if subm and filter_offset > kv // 2:
nhot = indice_num_per_loc_np[kv - 1 - filter_offset]
elif subm and filter_offset == kv // 2:
nhot = voxels.shape[0]
else:
nhot = indice_num_per_loc_np[filter_offset]
o_inds = indice_pairs_np[1][filter_offset][:nhot]
i_inds = indice_pairs_np[0][filter_offset][:nhot]
# print(a_inds_cpu[:10])
out_gather = output[o_inds] # [N, K]
inp_gather = inp[i_inds] # [N, C]
# KN @ NC
dw_res = out_gather.astype(np.float32).T @ inp_gather.astype(np.float32)
dw_ref[filter_offset] = dw_res
# print(indice_pairs_np_test[0])
dw_ref_kcrs = dw_ref.transpose(1, 0, 2)
dw_cpu = weight_tv.cpu().numpy().reshape(K, np.prod(ksize), C)
print("ERROR", np.linalg.norm(dw_cpu.reshape(-1) - dw_ref_kcrs.reshape(-1)))
if __name__ == "__main__":
dev_subm_inds_v2()
# Copyright 2021 Yan Yan
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from cumm import tensorview as tv
from spconv.core_cc.csrc.sparse.all import SpconvOps
import pickle
import torch
from spconv.pytorch.cppcore import torch_tensor_to_tv
def main():
with open("/home/yy/asd.pkl", "rb") as f:
a_th = pickle.load(f)
mask_argsort = torch.empty((1, a_th.shape[1]),
dtype=torch.int32,
device=a_th.device)
a = a_th.cpu().numpy()[0]
a_tv = torch_tensor_to_tv(a_th)
mask_argsort_tv = torch_tensor_to_tv(mask_argsort)
for i in range(10):
a_tv_1 = a_tv.clone()
SpconvOps.sort_1d_by_key(a_tv_1[0], mask_argsort_tv[0])
if __name__ == "__main__":
main()
\ No newline at end of file
...@@ -25,27 +25,34 @@ NAME = 'spconv' ...@@ -25,27 +25,34 @@ NAME = 'spconv'
RELEASE_NAME = NAME RELEASE_NAME = NAME
deps = ["cumm"] deps = ["cumm"]
cuda_ver = os.environ.get("CUMM_CUDA_VERSION", "") cuda_ver = os.environ.get("CUMM_CUDA_VERSION", "")
if not cuda_ver: # is_ci_build = cuda_ver != ""
nvcc_version = subprocess.check_output(["nvcc", "--version" # if not cuda_ver:
]).decode("utf-8").strip() # nvcc_version = subprocess.check_output(["nvcc", "--version"
nvcc_version_str = nvcc_version.split("\n")[3] # ]).decode("utf-8").strip()
version_str: str = re.findall(r"release (\d+.\d+)", # nvcc_version_str = nvcc_version.split("\n")[3]
nvcc_version_str)[0] # version_str: str = re.findall(r"release (\d+.\d+)",
cuda_ver = version_str # nvcc_version_str)[0]
cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102 # cuda_ver = version_str
RELEASE_NAME += "-cu{}".format(cuda_ver) if cuda_ver:
deps = ["cumm-cu{}".format(cuda_ver)] cuda_ver = cuda_ver.replace(".", "") # 10.2 to 102
RELEASE_NAME += "-cu{}".format(cuda_ver)
deps = ["cumm-cu{}".format(cuda_ver)]
else:
deps = ["cumm"]
DESCRIPTION = 'spatial sparse convolution' DESCRIPTION = 'spatial sparse convolution'
URL = 'https://github.com/traveller59/spconv' URL = 'https://github.com/traveller59/spconv'
EMAIL = 'yanyan.sub@outlook.com' EMAIL = 'yanyan.sub@outlook.com'
AUTHOR = 'Yan Yan' AUTHOR = 'Yan Yan'
REQUIRES_PYTHON = '>=3.6' REQUIRES_PYTHON = '>=3.7'
VERSION = None VERSION = None
# What packages are required for this module to be executed? # What packages are required for this module to be executed?
REQUIRED = ["pccm>=0.2.14", "pybind11>=2.6.0", "fire", "numpy", *deps] REQUIRED = ["pccm>=0.2.19", "pybind11>=2.6.0", "fire", "numpy", *deps]
# What packages are optional? # What packages are optional?
EXTRAS = { EXTRAS = {
...@@ -145,18 +152,27 @@ if disable_jit is not None and disable_jit == "1": ...@@ -145,18 +152,27 @@ if disable_jit is not None and disable_jit == "1":
} }
from cumm.gemm.main import GemmMainUnitTest from cumm.gemm.main import GemmMainUnitTest
from spconv.core import SHUFFLE_SIMT_PARAMS, SHUFFLE_VOLTA_PARAMS, SHUFFLE_TURING_PARAMS from spconv.core import SHUFFLE_SIMT_PARAMS, SHUFFLE_VOLTA_PARAMS, SHUFFLE_TURING_PARAMS
from spconv.core import IMPLGEMM_SIMT_PARAMS, IMPLGEMM_VOLTA_PARAMS, IMPLGEMM_TURING_PARAMS
from cumm.conv.main import ConvMainUnitTest
from cumm.constants import CUMM_CPU_ONLY_BUILD
from spconv.csrc.sparse.all import SpconvOps from spconv.csrc.sparse.all import SpconvOps
cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS + SHUFFLE_TURING_PARAMS) cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS + SHUFFLE_TURING_PARAMS)
convcu = ConvMainUnitTest(IMPLGEMM_SIMT_PARAMS + IMPLGEMM_VOLTA_PARAMS + IMPLGEMM_TURING_PARAMS)
convcu.namespace = "cumm.conv.main"
cu.namespace = "cumm.gemm.main" cu.namespace = "cumm.gemm.main"
std = "c++17"
if cuda_ver:
cuda_ver_number = int(cuda_ver) cuda_ver_number = int(cuda_ver)
if cuda_ver_number < 110: if cuda_ver_number < 110:
std = "c++14" std = "c++14"
else: else:
std = "c++17" std = "c++17"
cus = [cu, convcu, SpconvOps()]
if CUMM_CPU_ONLY_BUILD:
cus = [SpconvOps()]
ext_modules: List[Extension] = [ ext_modules: List[Extension] = [
PCCMExtension([cu, SpconvOps()], PCCMExtension(cus,
"spconv/core_cc", "spconv/core_cc",
Path(__file__).resolve().parent / "spconv", Path(__file__).resolve().parent / "spconv",
objects_folder="objects", objects_folder="objects",
......
...@@ -16,3 +16,4 @@ from . import build as _build ...@@ -16,3 +16,4 @@ from . import build as _build
from .core import ConvAlgo, AlgoHint from .core import ConvAlgo, AlgoHint
from . import constants from . import constants
from .__version__ import __version__
\ No newline at end of file
This diff is collapsed.
...@@ -16,17 +16,27 @@ from pathlib import Path ...@@ -16,17 +16,27 @@ from pathlib import Path
import pccm import pccm
from pccm.utils import project_is_editable, project_is_installed from pccm.utils import project_is_editable, project_is_installed
from ccimport.compat import InWindows
from .constants import PACKAGE_NAME, PACKAGE_ROOT, DISABLE_JIT
from .constants import PACKAGE_NAME, PACKAGE_ROOT if project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME) and not DISABLE_JIT:
if project_is_installed(PACKAGE_NAME) and project_is_editable(PACKAGE_NAME):
from spconv.core import SHUFFLE_SIMT_PARAMS, SHUFFLE_VOLTA_PARAMS, SHUFFLE_TURING_PARAMS from spconv.core import SHUFFLE_SIMT_PARAMS, SHUFFLE_VOLTA_PARAMS, SHUFFLE_TURING_PARAMS
from spconv.core import IMPLGEMM_SIMT_PARAMS, IMPLGEMM_VOLTA_PARAMS, IMPLGEMM_TURING_PARAMS
from cumm.gemm.main import GemmMainUnitTest from cumm.gemm.main import GemmMainUnitTest
from cumm.conv.main import ConvMainUnitTest
from spconv.csrc.sparse.all import SpconvOps from spconv.csrc.sparse.all import SpconvOps
cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS + SHUFFLE_TURING_PARAMS) cu = GemmMainUnitTest(SHUFFLE_SIMT_PARAMS + SHUFFLE_VOLTA_PARAMS + SHUFFLE_TURING_PARAMS)
cu.namespace = "cumm.gemm.main" cu.namespace = "cumm.gemm.main"
pccm.builder.build_pybind([cu, SpconvOps()], convcu = ConvMainUnitTest(IMPLGEMM_SIMT_PARAMS + IMPLGEMM_VOLTA_PARAMS + IMPLGEMM_TURING_PARAMS)
convcu.namespace = "cumm.conv.main"
objects_folder = None
if InWindows:
# windows have command line limit, so we use objects_folder to reduce command size.
objects_folder = "objects"
pccm.builder.build_pybind([cu, convcu, SpconvOps()],
PACKAGE_ROOT / "core_cc", PACKAGE_ROOT / "core_cc",
namespace_root=PACKAGE_ROOT, namespace_root=PACKAGE_ROOT,
objects_folder="objects", objects_folder=objects_folder,
load_library=False) load_library=False)
...@@ -25,3 +25,5 @@ EDITABLE_INSTALLED = project_is_installed(PACKAGE_NAME) and project_is_editable( ...@@ -25,3 +25,5 @@ EDITABLE_INSTALLED = project_is_installed(PACKAGE_NAME) and project_is_editable(
_filter_hwio_env = os.getenv("SPCONV_FILTER_HWIO", "0") _filter_hwio_env = os.getenv("SPCONV_FILTER_HWIO", "0")
FILTER_HWIO = _filter_hwio_env == "1" FILTER_HWIO = _filter_hwio_env == "1"
DISABLE_JIT = os.getenv("SPCONV_DISABLE_JIT", "0") == "1"
NDIM_DONT_CARE = 3
\ No newline at end of file
This diff is collapsed.
from typing import overload, Any, Callable, Dict, List, Optional, Set, Tuple, Type, Union from typing import overload, Any, Callable, Dict, List, Optional, Set, Tuple, Type, Union
from pccm.stubs import EnumValue, EnumClassValue from pccm.stubs import EnumValue, EnumClassValue
from cumm.tensorview import Tensor from cumm.tensorview import Tensor
class ThrustCustomAllocatorV2:
alloc_func: Callable[int, int]
class SpconvOps: class SpconvOps:
@staticmethod @staticmethod
def generate_conv_inds_stage1(indices: Tensor, indice_pairs: Tensor, indice_pairs_uniq: Tensor, indice_num_per_loc: Tensor, batch_size: int, output_dims: List[int], input_dims: List[int], ksize: List[int], stride: List[int], padding: List[int], dilation: List[int], transposed: bool = False, stream_int: int = 0) -> None: def generate_conv_inds_stage1(indices: Tensor, indice_pairs: Tensor, indice_pairs_uniq: Tensor, indice_num_per_loc: Tensor, batch_size: int, output_dims: List[int], input_dims: List[int], ksize: List[int], stride: List[int], padding: List[int], dilation: List[int], transposed: bool = False, stream_int: int = 0) -> None:
...@@ -53,6 +55,49 @@ class SpconvOps: ...@@ -53,6 +55,49 @@ class SpconvOps:
""" """
... ...
@staticmethod @staticmethod
def generate_conv_inds_mask_stage1(indices: Tensor, indice_pairs_bwd: Tensor, indice_pairs_uniq: Tensor, indice_num_per_loc: Tensor, batch_size: int, output_dims: List[int], input_dims: List[int], ksize: List[int], stride: List[int], padding: List[int], dilation: List[int], transposed: bool = False, stream_int: int = 0) -> None:
"""
Args:
indices:
indice_pairs_bwd:
indice_pairs_uniq:
indice_num_per_loc:
batch_size:
output_dims:
input_dims:
ksize:
stride:
padding:
dilation:
transposed:
stream_int:
"""
...
@staticmethod
def generate_conv_inds_mask_stage2(indices: Tensor, hashdata: Tensor, indice_pairs_fwd: Tensor, indice_pairs_bwd: Tensor, indice_pairs_uniq: Tensor, out_inds: Tensor, mask_fwd: Tensor, mask_bwd: Tensor, num_out_act: int, batch_size: int, output_dims: List[int], input_dims: List[int], ksize: List[int], stride: List[int], padding: List[int], dilation: List[int], transposed: bool = False, stream_int: int = 0) -> int:
"""
Args:
indices:
hashdata:
indice_pairs_fwd:
indice_pairs_bwd:
indice_pairs_uniq:
out_inds:
mask_fwd:
mask_bwd:
num_out_act:
batch_size:
output_dims:
input_dims:
ksize:
stride:
padding:
dilation:
transposed:
stream_int:
"""
...
@staticmethod
def generate_subm_conv_inds(indices: Tensor, hashdata: Tensor, indice_pairs: Tensor, out_inds: Tensor, indice_num_per_loc: Tensor, batch_size: int, input_dims: List[int], ksize: List[int], dilation: List[int], indice_pair_mask: Tensor = Tensor(), backward: bool = False, stream_int: int = 0) -> int: def generate_subm_conv_inds(indices: Tensor, hashdata: Tensor, indice_pairs: Tensor, out_inds: Tensor, indice_num_per_loc: Tensor, batch_size: int, input_dims: List[int], ksize: List[int], dilation: List[int], indice_pair_mask: Tensor = Tensor(), backward: bool = False, stream_int: int = 0) -> int:
""" """
Args: Args:
...@@ -71,6 +116,38 @@ class SpconvOps: ...@@ -71,6 +116,38 @@ class SpconvOps:
""" """
... ...
@staticmethod @staticmethod
def generate_conv_inds_cpu(indices: Tensor, indice_pairs: Tensor, out_inds: Tensor, indice_num_per_loc: Tensor, batch_size: int, output_dims: List[int], input_dims: List[int], ksize: List[int], stride: List[int], padding: List[int], dilation: List[int], transposed: bool = False) -> int:
"""
Args:
indices:
indice_pairs:
out_inds:
indice_num_per_loc:
batch_size:
output_dims:
input_dims:
ksize:
stride:
padding:
dilation:
transposed:
"""
...
@staticmethod
def generate_subm_conv_inds_cpu(indices: Tensor, indice_pairs: Tensor, out_inds: Tensor, indice_num_per_loc: Tensor, batch_size: int, input_dims: List[int], ksize: List[int], dilation: List[int]) -> int:
"""
Args:
indices:
indice_pairs:
out_inds:
indice_num_per_loc:
batch_size:
input_dims:
ksize:
dilation:
"""
...
@staticmethod
def maxpool_forward(out: Tensor, inp: Tensor, out_inds: Tensor, in_inds: Tensor, stream: int = 0) -> None: def maxpool_forward(out: Tensor, inp: Tensor, out_inds: Tensor, in_inds: Tensor, stream: int = 0) -> None:
""" """
Args: Args:
...@@ -95,9 +172,157 @@ class SpconvOps: ...@@ -95,9 +172,157 @@ class SpconvOps:
""" """
... ...
@staticmethod @staticmethod
def sort_1d_by_key(data: Tensor) -> Tensor: def maxpool_implicit_gemm_forward(out: Tensor, inp: Tensor, inds: Tensor, stream: int = 0) -> None:
"""
Args:
out:
inp:
inds:
stream:
"""
...
@staticmethod
def maxpool_implicit_gemm_backward(out: Tensor, inp: Tensor, dout: Tensor, dinp: Tensor, inds: Tensor, stream: int = 0) -> None:
"""
Args:
out:
inp:
dout:
dinp:
inds:
stream:
"""
...
@staticmethod
def maxpool_forward_cpu(out: Tensor, inp: Tensor, out_inds: Tensor, in_inds: Tensor) -> None:
"""
Args:
out:
inp:
out_inds:
in_inds:
"""
...
@staticmethod
def maxpool_backward_cpu(out: Tensor, inp: Tensor, dout: Tensor, dinp: Tensor, out_inds: Tensor, in_inds: Tensor) -> None:
"""
Args:
out:
inp:
dout:
dinp:
out_inds:
in_inds:
"""
...
@staticmethod
def gather_cpu(out: Tensor, inp: Tensor, inds: Tensor) -> None:
"""
Args:
out:
inp:
inds:
"""
...
@staticmethod
def scatter_add_cpu(out: Tensor, inp: Tensor, inds: Tensor) -> None:
"""
Args:
out:
inp:
inds:
"""
...
@staticmethod
def sort_1d_by_key(data: Tensor, indices: Tensor = Tensor(), stream: int = 0) -> Tensor:
"""
Args:
data:
indices:
stream:
"""
...
@staticmethod
def sort_1d_by_key_allocator(data: Tensor, alloc_func, indices: Tensor = Tensor(), stream: int = 0) -> Tensor:
""" """
Args: Args:
data: data:
alloc_func:
indices:
stream:
"""
...
@staticmethod
def sort_1d_by_key_split(data: Tensor, mask: Tensor, indices: Tensor = Tensor(), stream: int = 0, mask_output: bool = False) -> Tensor:
"""
Args:
data:
mask:
indices:
stream:
mask_output:
"""
...
@staticmethod
def sort_1d_by_key_split_allocator(data: Tensor, alloc_func, mask: Tensor, indices: Tensor = Tensor(), stream: int = 0, mask_output: bool = False) -> Tensor:
"""
Args:
data:
alloc_func:
mask:
indices:
stream:
mask_output:
"""
...
@staticmethod
def count_bits(a: Tensor) -> Tensor:
"""
Args:
a:
"""
...
@staticmethod
def calc_point2voxel_meta_data(vsize_xyz: List[float], coors_range_xyz: List[float]) -> Tuple[List[float], List[int], List[int], List[float]]:
"""
Args:
vsize_xyz:
coors_range_xyz:
"""
...
@staticmethod
def point2voxel_cpu(points: Tensor, voxels: Tensor, indices: Tensor, num_per_voxel: Tensor, densehashdata: Tensor, vsize: List[float], grid_size: List[int], grid_stride: List[int], coors_range: List[float], empty_mean: bool = False, clear_voxels: bool = True) -> Tuple[Tensor, Tensor, Tensor]:
"""
Args:
points:
voxels:
indices:
num_per_voxel:
densehashdata:
vsize:
grid_size:
grid_stride:
coors_range:
empty_mean:
clear_voxels:
"""
...
@staticmethod
def point2voxel_cuda(points: Tensor, voxels: Tensor, indices: Tensor, num_per_voxel: Tensor, hashdata: Tensor, point_indice_data: Tensor, vsize: List[float], grid_size: List[int], grid_stride: List[int], coors_range: List[float], empty_mean: bool = False, clear_voxels: bool = True, stream_int: int = 0) -> Tuple[Tensor, Tensor, Tensor]:
"""
Args:
points:
voxels:
indices:
num_per_voxel:
hashdata:
point_indice_data:
vsize:
grid_size:
grid_stride:
coors_range:
empty_mean:
clear_voxels:
stream_int:
""" """
... ...
from typing import overload, Any, Callable, Dict, List, Optional, Set, Tuple, Type, Union
from pccm.stubs import EnumValue, EnumClassValue
from cumm.tensorview import Tensor
class Point2Voxel:
hashdata: Tensor
point_indice_data: Tensor
voxels: Tensor
indices: Tensor
num_per_voxel: Tensor
@property
def grid_size(self) -> List[int]: ...
def __init__(self, vsize_xyz: List[float], coors_range_xyz: List[float], num_point_features: int, max_num_voxels: int, max_num_points_per_voxel: int) -> None:
"""
Args:
vsize_xyz:
coors_range_xyz:
num_point_features:
max_num_voxels:
max_num_points_per_voxel:
"""
...
def point_to_voxel_hash(self, points: Tensor, clear_voxels: bool = True, empty_mean: bool = False, stream_int: int = 0) -> Tuple[Tensor, Tensor, Tensor]:
"""
Args:
points:
clear_voxels:
empty_mean:
stream_int:
"""
...
@staticmethod
def point_to_voxel_hash_static(points: Tensor, voxels: Tensor, indices: Tensor, num_per_voxel: Tensor, hashdata: Tensor, point_indice_data: Tensor, vsize: List[float], grid_size: List[int], grid_stride: List[int], coors_range: List[float], clear_voxels: bool = True, empty_mean: bool = False, stream_int: int = 0) -> Tuple[Tensor, Tensor, Tensor]:
"""
Args:
points:
voxels:
indices:
num_per_voxel:
hashdata:
point_indice_data:
vsize:
grid_size:
grid_stride:
coors_range:
clear_voxels:
empty_mean:
stream_int:
"""
...
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment