## Simple Benchmark ### Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU 150W Network Code: test/benchmark.py | F32/F16 | Spconv 1.x F32 (1080Ti) | Native| Implicit Gemm | Implicit Gemm Split Mask | | -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:| | Forward | 43ms | 21.7ms/13.7ms | 23.5ms/11.2ms | 22ms/12.2ms | | Backward | 80ms | 41.9ms/25.2ms | 51.0ms/13.8ms | 41.1ms/12.2ms | | F16 Forward | Native| Implicit Gemm | Implicit Gemm Split Mask | | -------------- |:---------------------:|---------------------:| ---------------------:| | RTX 3080 Laptop 150W | 13.7ms | 11.2ms | 12.2ms | | RTX A6000 | 19.1ms | 11.7ms | 14.0ms | | TESLA V100 | 17.9ms | 11.4ms | 13.4ms | | F16 Backward | Native| Implicit Gemm | Implicit Gemm Split Mask | | -------------- |:---------------------:|---------------------:| ---------------------:| | RTX 3080 Laptop 150W | 25.2ms | 13.8ms | 12.2ms | | RTX A6000 | 28.1ms | 9.2ms | 8.9ms | | TESLA V100 | 33.9ms | 12.2ms | 12.9ms | ### Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU Network Code: test/benchmark.py The network/input/profile code is same as above table. This table only profile **fp16 gemm kernels** without output tensor create/clear overhead. this table show the performance upper bound of our algorithm. | F16 | Native| Implicit Gemm | Implicit Gemm Split Mask | | -------------- |:---------------------:|---------------------:| ---------------------:| | Forward | 8.0ms | 4.3ms | 4.0ms | We can see that the implicit gemm is very fast, gemm only use 4.3ms/11.2ms in network forward. we can achieve better performance in TensorRT + Pure C++. **NOTE** When you want to benchmark network in your laptop, don't forget to close all apps except terminals! Other apps will consume GPU resource and make kernels run slower. ## Comparsion with [MinkowskiEngine](https://github.com/NVIDIA/MinkowskiEngine) and [torchsparse](https://github.com/mit-han-lab/torchsparse) TODO