update docs

596a3cc0 · yan.yan · b63c08aa · 596a3cc0 · 596a3cc0
Commit 596a3cc0 authored Sep 25, 2022 by yan.yan
Hide whitespace changes
Inline Side-by-side

Showing with 17 additions and 34 deletions

README.md README.md +1 -1

docs/BENCHMARK.md docs/BENCHMARK.md +16 -33

No files found.
--- a/README.md
+++ b/README.md
@@ -45,7 +45,7 @@
 | -------------- |:---------------------:| ---------------------:| ---------------------:| 
 | CPU (Linux Only) | [![PyPI Version][pypi-ver-cpu]][pypi-url-cpu] | ```pip install spconv``` | [![pypi monthly download][pypi-download-cpu]][pypi-url-cpu] | 
 | CUDA 10.2 | [![PyPI Version][pypi-ver-102]][pypi-url-102] | ```pip install spconv-cu102```| [![pypi monthly download][pypi-download-102]][pypi-url-102]| 
-| CUDA 11.3 (Linux Only) | [![PyPI Version][pypi-ver-113]][pypi-url-113] | ```pip install spconv-cu113```| [![pypi monthly download][pypi-download-113]][pypi-url-113]| 
+| CUDA 11.3 | [![PyPI Version][pypi-ver-113]][pypi-url-113] | ```pip install spconv-cu113```| [![pypi monthly download][pypi-download-113]][pypi-url-113]| 
 | CUDA 11.4 | [![PyPI Version][pypi-ver-114]][pypi-url-114] | ```pip install spconv-cu114```| [![pypi monthly download][pypi-download-114]][pypi-url-114]|
 | CUDA 11.7 | [![PyPI Version][pypi-ver-117]][pypi-url-117] | ```pip install spconv-cu117```| [![pypi monthly download][pypi-download-117]][pypi-url-117]| 
 <!-- | CUDA 12.0 | [![PyPI Version][pypi-ver-120]][pypi-url-120] | ```pip install spconv-cu120```| [![pypi monthly download][pypi-download-120]][pypi-url-120]| -->

--- a/docs/BENCHMARK.md
+++ b/docs/BENCHMARK.md
@@ -16,47 +16,30 @@

 ## Simple Benchmark

-### Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU 150W
+### Network Benchmark without batchnorm (TF32/F16) in Different GPUs

-Network Code: test/benchmark.py
+Basic: ```python -m spconv.benchmark bench_basic f16``` and ```python -m spconv.benchmark bench_basic tf32```

-| F32/F16 | Spconv 1.x F32 (1080Ti) | Native| Implicit Gemm | Implicit Gemm Split Mask  |
+| GPUs | F16-Forward | F16-Backward | TF32-Forward  | TF32-Backward |
 | -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
-| Forward | 43ms     | 21.7ms/13.7ms    | 23.5ms/11.2ms      | 22ms/12.2ms      |
-| Backward | 80ms    | 41.9ms/25.2ms    | 51.0ms/13.8ms      | 41.1ms/12.2ms      |
+| T4 | 18.74     | 25.51    | N/A      | N/A      |
+| RTX 3080 Laptop (150W) | 8.2    | 11.51    | 15.04      | 26.90      |
+| A100 | 13.02    | 12.43    | 12.35      | 14.93      |
+| RTX3090 | 11.84    | 11.84    | 13.23      | 15.79      |
+| RTX A6000 | 11.11    | 8.97    | 12.30      | 12.79      |

-| F16 Forward | Native| Implicit Gemm | Implicit Gemm Split Mask  |
-| -------------- |:---------------------:|---------------------:| ---------------------:|
-| RTX 3080 Laptop 150W@1755MHz | 13.7ms     | 11.2ms    | 12.2ms      |
-| RTX A6000 | 19.1ms    |  11.7ms   | 14.0ms      |
-| TESLA V100 | 17.9ms    |  11.4ms   | 13.4ms      |
-| A100 | 23.8ms    |  12.4ms   | 15.1ms      |
+Large: ```python -m spconv.benchmark bench_large f16``` and ```python -m spconv.benchmark bench_large tf32```

-| F16 Backward | Native| Implicit Gemm | Implicit Gemm Split Mask  |
-| -------------- |:---------------------:|---------------------:| ---------------------:|
-| RTX 3080 Laptop 150W@1755MHz | 25.2ms     | 13.8ms    | 12.2ms      |
-| RTX A6000       | 28.1ms     | 9.2ms     | 8.9ms      |
-| TESLA V100 | 33.9ms    |  12.2ms   | 12.9ms      |
-| A100 | 37.6ms    |  12.2ms   | 13.9ms      |
-
-### Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU
-
-Network Code: test/benchmark.py
-
-The network/input/profile code is same as above table.
-
-This table only profile **fp16 gemm kernels** without output tensor create/clear overhead. this table show the performance upper bound of our algorithm.
-
-| F16 |  Native| Implicit Gemm | Implicit Gemm Split Mask  |
-| -------------- |:---------------------:|---------------------:| ---------------------:|
-| Forward | 8.0ms    | 4.3ms      | 4.0ms      |
+| GPUs | F16-Forward | F16-Backward | TF32-Forward  | TF32-Backward |
+| -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:|
+| T4 | 128.7     | 203.3    | N/A      | N/A      |
+| RTX 3080 Laptop (150W) | 43.15    | 74.57    | 84.65      | 165.19      |
+| A100 | 19.85    | 31.24    | 29.58      | 55.63      |
+| RTX3090 | 27.83    | 40.45    | 44.51      | 73.17      |
+| RTX A6000 | 28.62    | 39.86    | 45.43      | 74.11      |

-We can see that the implicit gemm is very fast, gemm only use 4.3ms/11.2ms in network forward. we can achieve better performance in TensorRT + Pure C++.

 **NOTE** 
 When you want to benchmark network in your laptop, don't forget to close all apps except terminals! Other apps will consume GPU resource and make kernels run slower.


-## Comparsion with [MinkowskiEngine](https://github.com/NVIDIA/MinkowskiEngine) and [torchsparse](https://github.com/mit-han-lab/torchsparse)
-
-TODO
\ No newline at end of file