a100_fp16.md 5.25 KB
Newer Older
zhouxiang's avatar
zhouxiang committed
1
# TurboMind Benchmark on A100
2

zhouxiang's avatar
zhouxiang committed
3
All the following results are tested on A100-80G(x8) CUDA 11.8.
4

zhouxiang's avatar
zhouxiang committed
5
The tested lmdeploy version is `v0.2.0`
6

zhouxiang's avatar
zhouxiang committed
7
8
9
10
11
12
13
14
15
## Request Throughput Benchmark

- `batch`: the max batch size during inference
- `tp`: the number of GPU cards for tensor parallelism
- `num_prompts`: the number of prompts, i.e. the number of requests
- `PRS`: **R**equest **P**er **S**econd
- `FTL`: **F**irst **T**oken **L**atency

### FP16
16

zhouxiang's avatar
zhouxiang committed
17
18
19
20
21
22
| model        | batch | tp  | num_promts | RPS    | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) | throughput(out tok/s) | throughput(total tok/s) |
| ------------ | ----- | --- | ---------- | ------ | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ | --------------------- | ----------------------- |
| llama2-7b    | 256   | 1   | 3000       | 14.556 | 0.526       | 0.092       | 4.652       | 0.066  | 0.101  | 0.155  | 0.220  | 3387.419              | 6981.159                |
| llama2-13b   | 128   | 1   | 3000       | 7.950  | 0.352       | 0.075       | 4.193       | 0.051  | 0.067  | 0.138  | 0.202  | 1850.145              | 3812.978                |
| internlm-20b | 128   | 2   | 3000       | 10.291 | 0.287       | 0.073       | 3.845       | 0.053  | 0.072  | 0.113  | 0.161  | 2053.266              | 4345.057                |
| llama2-70b   | 256   | 4   | 3000       | 7.231  | 1.075       | 0.139       | 14.524      | 0.102  | 0.153  | 0.292  | 0.482  | 1682.738              | 3467.969                |
23
24
25

## Static Inference Benchmark

zhouxiang's avatar
zhouxiang committed
26
27
28
29
30
31
- `batch`: the max batch size during inference
- `tp`: the number of GPU cards for tensor parallelism
- `prompt_tokens`: the number of input tokens
- `output_tokens`: the number of generated tokens
- `throughput`: the number of generated tokens per second
- `FTL`: **F**irst **T**oken **L**atency
32

zhouxiang's avatar
zhouxiang committed
33
### FP16
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

| batch | tp  | prompt_tokens | output_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
| ----- | --- | ------------- | ------------- | --------------------- | ------- | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ |
| 1     | 1   | 1             | 128           | 100.02                | 76.55   | 0.011       | 0.01        | 0.011       | 0.009  | 0.009  | 0.01   | 0.011  |
| 1     | 1   | 128           | 128           | 102.21                | 76.59   | 0.022       | 0.022       | 0.022       | 0.01   | 0.01   | 0.01   | 0.01   |
| 1     | 1   | 128           | 2048          | 98.92                 | 76.59   | 0.022       | 0.022       | 0.022       | 0.01   | 0.01   | 0.01   | 0.01   |
| 1     | 1   | 2048          | 128           | 86.1                  | 76.77   | 0.139       | 0.139       | 0.14        | 0.01   | 0.01   | 0.01   | 0.011  |
| 1     | 1   | 2048          | 2048          | 93.78                 | 76.77   | 0.14        | 0.139       | 0.141       | 0.011  | 0.011  | 0.011  | 0.011  |
| 16    | 1   | 1             | 128           | 1504.72               | 76.59   | 0.021       | 0.011       | 0.031       | 0.01   | 0.011  | 0.011  | 0.013  |
| 16    | 1   | 128           | 128           | 1272.47               | 76.77   | 0.129       | 0.023       | 0.149       | 0.011  | 0.011  | 0.012  | 0.014  |
| 16    | 1   | 128           | 2048          | 1010.62               | 76.77   | 0.13        | 0.023       | 0.144       | 0.015  | 0.018  | 0.02   | 0.021  |
| 16    | 1   | 2048          | 128           | 348.87                | 78.3    | 2.897       | 0.143       | 3.576       | 0.02   | 0.021  | 0.022  | 0.025  |
| 16    | 1   | 2048          | 2048          | 601.63                | 78.3    | 2.678       | 0.142       | 3.084       | 0.025  | 0.028  | 0.03   | 0.031  |
| 32    | 1   | 1             | 128           | 2136.73               | 76.62   | 0.079       | 0.014       | 0.725       | 0.011  | 0.012  | 0.013  | 0.021  |
| 32    | 1   | 128           | 128           | 2125.47               | 76.99   | 0.214       | 0.022       | 0.359       | 0.012  | 0.013  | 0.014  | 0.035  |
| 32    | 1   | 128           | 2048          | 1462.12               | 76.99   | 0.2         | 0.026       | 0.269       | 0.021  | 0.026  | 0.031  | 0.033  |
| 32    | 1   | 2048          | 128           | 450.43                | 78.3    | 4.288       | 0.143       | 5.267       | 0.031  | 0.032  | 0.034  | 0.161  |
| 32    | 1   | 2048          | 2048          | 733.34                | 78.34   | 4.118       | 0.19        | 5.429       | 0.04   | 0.045  | 0.05   | 0.053  |
| 64    | 1   | 1             | 128           | 4154.81               | 76.71   | 0.042       | 0.013       | 0.21        | 0.012  | 0.018  | 0.028  | 0.041  |
| 64    | 1   | 128           | 128           | 3024.07               | 77.43   | 0.44        | 0.026       | 1.061       | 0.014  | 0.018  | 0.026  | 0.158  |
| 64    | 1   | 128           | 2048          | 1852.06               | 77.96   | 0.535       | 0.027       | 1.231       | 0.03   | 0.041  | 0.048  | 0.053  |
| 64    | 1   | 2048          | 128           | 493.46                | 78.4    | 6.59        | 0.142       | 16.235      | 0.046  | 0.049  | 0.055  | 0.767  |
| 64    | 1   | 2048          | 2048          | 755.65                | 78.4    | 39.105      | 0.142       | 116.285     | 0.047  | 0.049  | 0.051  | 0.207  |