Commit 596a3cc0 authored by yan.yan's avatar yan.yan
Browse files

update docs

parent b63c08aa
......@@ -45,7 +45,7 @@
| -------------- |:---------------------:| ---------------------:| ---------------------:|
| CPU (Linux Only) | [![PyPI Version][pypi-ver-cpu]][pypi-url-cpu] | ```pip install spconv``` | [![pypi monthly download][pypi-download-cpu]][pypi-url-cpu] |
| CUDA 10.2 | [![PyPI Version][pypi-ver-102]][pypi-url-102] | ```pip install spconv-cu102```| [![pypi monthly download][pypi-download-102]][pypi-url-102]|
| CUDA 11.3 (Linux Only) | [![PyPI Version][pypi-ver-113]][pypi-url-113] | ```pip install spconv-cu113```| [![pypi monthly download][pypi-download-113]][pypi-url-113]|
| CUDA 11.3 | [![PyPI Version][pypi-ver-113]][pypi-url-113] | ```pip install spconv-cu113```| [![pypi monthly download][pypi-download-113]][pypi-url-113]|
| CUDA 11.4 | [![PyPI Version][pypi-ver-114]][pypi-url-114] | ```pip install spconv-cu114```| [![pypi monthly download][pypi-download-114]][pypi-url-114]|
| CUDA 11.7 | [![PyPI Version][pypi-ver-117]][pypi-url-117] | ```pip install spconv-cu117```| [![pypi monthly download][pypi-download-117]][pypi-url-117]|
<!-- | CUDA 12.0 | [![PyPI Version][pypi-ver-120]][pypi-url-120] | ```pip install spconv-cu120```| [![pypi monthly download][pypi-download-120]][pypi-url-120]| -->
......
......@@ -16,47 +16,30 @@
## Simple Benchmark
### Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU 150W
### Network Benchmark without batchnorm (TF32/F16) in Different GPUs
Network Code: test/benchmark.py
Basic: ```python -m spconv.benchmark bench_basic f16``` and ```python -m spconv.benchmark bench_basic tf32```
| F32/F16 | Spconv 1.x F32 (1080Ti) | Native| Implicit Gemm | Implicit Gemm Split Mask |
| GPUs | F16-Forward | F16-Backward | TF32-Forward | TF32-Backward |
| -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
| Forward | 43ms | 21.7ms/13.7ms | 23.5ms/11.2ms | 22ms/12.2ms |
| Backward | 80ms | 41.9ms/25.2ms | 51.0ms/13.8ms | 41.1ms/12.2ms |
| T4 | 18.74 | 25.51 | N/A | N/A |
| RTX 3080 Laptop (150W) | 8.2 | 11.51 | 15.04 | 26.90 |
| A100 | 13.02 | 12.43 | 12.35 | 14.93 |
| RTX3090 | 11.84 | 11.84 | 13.23 | 15.79 |
| RTX A6000 | 11.11 | 8.97 | 12.30 | 12.79 |
| F16 Forward | Native| Implicit Gemm | Implicit Gemm Split Mask |
| -------------- |:---------------------:|---------------------:| ---------------------:|
| RTX 3080 Laptop 150W@1755MHz | 13.7ms | 11.2ms | 12.2ms |
| RTX A6000 | 19.1ms | 11.7ms | 14.0ms |
| TESLA V100 | 17.9ms | 11.4ms | 13.4ms |
| A100 | 23.8ms | 12.4ms | 15.1ms |
Large: ```python -m spconv.benchmark bench_large f16``` and ```python -m spconv.benchmark bench_large tf32```
| F16 Backward | Native| Implicit Gemm | Implicit Gemm Split Mask |
| -------------- |:---------------------:|---------------------:| ---------------------:|
| RTX 3080 Laptop 150W@1755MHz | 25.2ms | 13.8ms | 12.2ms |
| RTX A6000 | 28.1ms | 9.2ms | 8.9ms |
| TESLA V100 | 33.9ms | 12.2ms | 12.9ms |
| A100 | 37.6ms | 12.2ms | 13.9ms |
### Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU
Network Code: test/benchmark.py
The network/input/profile code is same as above table.
This table only profile **fp16 gemm kernels** without output tensor create/clear overhead. this table show the performance upper bound of our algorithm.
| F16 | Native| Implicit Gemm | Implicit Gemm Split Mask |
| -------------- |:---------------------:|---------------------:| ---------------------:|
| Forward | 8.0ms | 4.3ms | 4.0ms |
| GPUs | F16-Forward | F16-Backward | TF32-Forward | TF32-Backward |
| -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
| T4 | 128.7 | 203.3 | N/A | N/A |
| RTX 3080 Laptop (150W) | 43.15 | 74.57 | 84.65 | 165.19 |
| A100 | 19.85 | 31.24 | 29.58 | 55.63 |
| RTX3090 | 27.83 | 40.45 | 44.51 | 73.17 |
| RTX A6000 | 28.62 | 39.86 | 45.43 | 74.11 |
We can see that the implicit gemm is very fast, gemm only use 4.3ms/11.2ms in network forward. we can achieve better performance in TensorRT + Pure C++.
**NOTE**
When you want to benchmark network in your laptop, don't forget to close all apps except terminals! Other apps will consume GPU resource and make kernels run slower.
## Comparsion with [MinkowskiEngine](https://github.com/NVIDIA/MinkowskiEngine) and [torchsparse](https://github.com/mit-han-lab/torchsparse)
TODO
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment