## Simple Benchmark ### Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU Network Code: test/benchmark.py | F32/F16 | Spconv 1.x F32 (1080Ti) | Native| Implicit Gemm | Implicit Gemm Split Mask | | -------------- |:---------------------:|---------------------:|---------------------:| ---------------------:| | Forward | 43ms | 21.7ms/13.7ms | 23.5ms/11.2ms | 22ms/12.2ms | | Backward | 80ms | 41.9ms/25.2ms | 51.0ms/13.8ms | 41.1ms/12.2ms | ### Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU Network Code: test/benchmark.py The network/input/profile code is same as above table. This table only profile **fp16 gemm kernels** without output tensor create/clear overhead. this table show the performance upper bound of our algorithm. | F16 | Native| Implicit Gemm | Implicit Gemm Split Mask | | -------------- |:---------------------:|---------------------:| ---------------------:| | Forward | 8.0ms | 4.3ms | 4.0ms | We can see that the implicit gemm is very fast, gemm only use 4.3ms/11.2ms in network forward. we can achieve better performance in TensorRT + Pure C++. **NOTE** When you want to benchmark network in your laptop, don't forget to close all apps except terminals! Other apps will consume GPU resource and make kernels run slower. ## Comparsion with [MinkowskiEngine](https://github.com/NVIDIA/MinkowskiEngine) and [torchsparse](https://github.com/mit-han-lab/torchsparse) TODO