# TurboMind Benchmark on A100 All the following results are tested on A100-80G(x8) CUDA 11.8. The tested lmdeploy version is `v0.2.0` ## Request Throughput Benchmark - `batch`: the max batch size during inference - `tp`: the number of GPU cards for tensor parallelism - `num_prompts`: the number of prompts, i.e. the number of requests - `PRS`: **R**equest **P**er **S**econd - `FTL`: **F**irst **T**oken **L**atency ### FP16 | model | batch | tp | num_promts | RPS | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) | throughput(out tok/s) | throughput(total tok/s) | | ------------ | ----- | --- | ---------- | ------ | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ | --------------------- | ----------------------- | | llama2-7b | 256 | 1 | 3000 | 14.556 | 0.526 | 0.092 | 4.652 | 0.066 | 0.101 | 0.155 | 0.220 | 3387.419 | 6981.159 | | llama2-13b | 128 | 1 | 3000 | 7.950 | 0.352 | 0.075 | 4.193 | 0.051 | 0.067 | 0.138 | 0.202 | 1850.145 | 3812.978 | | internlm-20b | 128 | 2 | 3000 | 10.291 | 0.287 | 0.073 | 3.845 | 0.053 | 0.072 | 0.113 | 0.161 | 2053.266 | 4345.057 | | llama2-70b | 256 | 4 | 3000 | 7.231 | 1.075 | 0.139 | 14.524 | 0.102 | 0.153 | 0.292 | 0.482 | 1682.738 | 3467.969 | ## Static Inference Benchmark - `batch`: the max batch size during inference - `tp`: the number of GPU cards for tensor parallelism - `prompt_tokens`: the number of input tokens - `output_tokens`: the number of generated tokens - `throughput`: the number of generated tokens per second - `FTL`: **F**irst **T**oken **L**atency ### FP16 | batch | tp | prompt_tokens | output_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) | | ----- | --- | ------------- | ------------- | --------------------- | ------- | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ | | 1 | 1 | 1 | 128 | 100.02 | 76.55 | 0.011 | 0.01 | 0.011 | 0.009 | 0.009 | 0.01 | 0.011 | | 1 | 1 | 128 | 128 | 102.21 | 76.59 | 0.022 | 0.022 | 0.022 | 0.01 | 0.01 | 0.01 | 0.01 | | 1 | 1 | 128 | 2048 | 98.92 | 76.59 | 0.022 | 0.022 | 0.022 | 0.01 | 0.01 | 0.01 | 0.01 | | 1 | 1 | 2048 | 128 | 86.1 | 76.77 | 0.139 | 0.139 | 0.14 | 0.01 | 0.01 | 0.01 | 0.011 | | 1 | 1 | 2048 | 2048 | 93.78 | 76.77 | 0.14 | 0.139 | 0.141 | 0.011 | 0.011 | 0.011 | 0.011 | | 16 | 1 | 1 | 128 | 1504.72 | 76.59 | 0.021 | 0.011 | 0.031 | 0.01 | 0.011 | 0.011 | 0.013 | | 16 | 1 | 128 | 128 | 1272.47 | 76.77 | 0.129 | 0.023 | 0.149 | 0.011 | 0.011 | 0.012 | 0.014 | | 16 | 1 | 128 | 2048 | 1010.62 | 76.77 | 0.13 | 0.023 | 0.144 | 0.015 | 0.018 | 0.02 | 0.021 | | 16 | 1 | 2048 | 128 | 348.87 | 78.3 | 2.897 | 0.143 | 3.576 | 0.02 | 0.021 | 0.022 | 0.025 | | 16 | 1 | 2048 | 2048 | 601.63 | 78.3 | 2.678 | 0.142 | 3.084 | 0.025 | 0.028 | 0.03 | 0.031 | | 32 | 1 | 1 | 128 | 2136.73 | 76.62 | 0.079 | 0.014 | 0.725 | 0.011 | 0.012 | 0.013 | 0.021 | | 32 | 1 | 128 | 128 | 2125.47 | 76.99 | 0.214 | 0.022 | 0.359 | 0.012 | 0.013 | 0.014 | 0.035 | | 32 | 1 | 128 | 2048 | 1462.12 | 76.99 | 0.2 | 0.026 | 0.269 | 0.021 | 0.026 | 0.031 | 0.033 | | 32 | 1 | 2048 | 128 | 450.43 | 78.3 | 4.288 | 0.143 | 5.267 | 0.031 | 0.032 | 0.034 | 0.161 | | 32 | 1 | 2048 | 2048 | 733.34 | 78.34 | 4.118 | 0.19 | 5.429 | 0.04 | 0.045 | 0.05 | 0.053 | | 64 | 1 | 1 | 128 | 4154.81 | 76.71 | 0.042 | 0.013 | 0.21 | 0.012 | 0.018 | 0.028 | 0.041 | | 64 | 1 | 128 | 128 | 3024.07 | 77.43 | 0.44 | 0.026 | 1.061 | 0.014 | 0.018 | 0.026 | 0.158 | | 64 | 1 | 128 | 2048 | 1852.06 | 77.96 | 0.535 | 0.027 | 1.231 | 0.03 | 0.041 | 0.048 | 0.053 | | 64 | 1 | 2048 | 128 | 493.46 | 78.4 | 6.59 | 0.142 | 16.235 | 0.046 | 0.049 | 0.055 | 0.767 | | 64 | 1 | 2048 | 2048 | 755.65 | 78.4 | 39.105 | 0.142 | 116.285 | 0.047 | 0.049 | 0.051 | 0.207 |