"csrc/gfx93/decode/vscode:/vscode.git/clone" did not exist on "a45f646b34e02b12f85db4cc0c246d9377c4d365"
PERFORMANCE_GUIDE.md 2.62 KB
Newer Older
yan.yan's avatar
yan.yan committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<!--
 Copyright 2021 Yan Yan
 
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
 
     http://www.apache.org/licenses/LICENSE-2.0
 
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

yan.yan's avatar
v2.1  
yan.yan committed
17
18
19
20
# Spconv 2.x Performance Guide

## Short Guide

yan.yan's avatar
yan.yan committed
21
* If you train without Tensor Core (i.e. FP32 training), set all ```algo``` in convolution/maxpool to ```ConvAlgo.Native``` manually. Default Algorithm is ```ConvAlgo.MaskImplicitGemm```, which is **SLOWER** than ```ConvAlgo.Native``` when use float32. this will be fixed in spconv 2.2.
yan.yan's avatar
v2.1  
yan.yan committed
22
23
24
25
* If your GPU support Tensor Core, use FP16 (mixed precision training) if possible. 
* If you train with mixed precision training (use Tensor Core), you don't need to set algorithm manually.
* Currently fast algorithm only support kernel volume (prod of kernel size) <= 32, so don't use large kernel size.
* make sure your channel size is multiple of 8 when using fp16. multiple of 32 is better.
yan.yan's avatar
yan.yan committed
26
* spconv 2.x in Windows 10 is 1.5x~2x slower than Linux. use Linux if possible.
yan.yan's avatar
v2.1  
yan.yan committed
27

28
See [benchmark](BENCHMARK.md) for more performance details of different algorithms.
yan.yan's avatar
v2.1  
yan.yan committed
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

## Algorithm Overview

### Native Explicit (deprecated and removed in spconv 2.x)

native algorithm (explicit, no fused) is standard gather-gemm-scatter algorithm. Assume we compute 3x3 conv, We can split it to 9 of 1x1 conv which can be computed by matmul, then sum them to get final result.
For sparse convolution, we also do split-gemm-sum to calculate conv, but we need to collect data first because it's sparse.

### Native

Fused version of above algorithm. 1.5x-2x faster than non-fused version.

### Implicit Gemm

```Native``` algorithm do minimal mma (matrix multiply add), but it need to serialize IO. The pipeline of ```Native``` is gather-gemm-scatter-gather-gemm-scatter-...

```Implicit Gemm``` fuse all calculation to one kernel and perform overlapped gather-mma-scatter to save a lot of time. 

![Image Overlapped Gemm](https://raw.githubusercontent.com/NVIDIA/cutlass/master/media/images/software-pipeline.png)

In my test, ```Implicit Gemm``` is almost 2x faster than ```Native```.

### Implicit Gemm Split Mask

TODO

55
In my test, ```Implicit Gemm Split Mask``` is slightly faster than ```Implicit Gemm```, but the indice generation is slower, so currently we use ```Implicit Gemm``` by default.