Commit d7117b95 authored by zhouxiang's avatar zhouxiang
Browse files

同步0.2.6代码

parent 5f83e392
# Benchmark on A100 (FP16)
# TurboMind Benchmark on A100
All the following results are tested on (x8) A100-80G CUDA 11.8.
All the following results are tested on A100-80G(x8) CUDA 11.8.
The tested lmdeploy version is `v0.1.0a1`.
The tested lmdeploy version is `v0.2.0`
The commands provided below facilitate benchmarking both [static inference performance](#static-inference-benchmark) and [request throughput](#request-throughput-benchmark) on an A100-80G(x8) for models of various sizes.
## Request Throughput Benchmark
- `batch`: the max batch size during inference
- `tp`: the number of GPU cards for tensor parallelism
- `num_prompts`: the number of prompts, i.e. the number of requests
- `PRS`: **R**equest **P**er **S**econd
- `FTL`: **F**irst **T**oken **L**atency
### FP16
```shell
bash benchmark/benchmark_7b.sh <the/path/of/llama2-7b/model>
bash benchmark/benchmark_13b.sh <the/path/of/llama2-13b/model>
bash benchmark/benchmark_20b.sh <the/path/of/internlm-20b/model>
bash benchmark/benchmark_70b.sh <the/path/of/llama2-70b/model>
```
| model | batch | tp | num_promts | RPS | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) | throughput(out tok/s) | throughput(total tok/s) |
| ------------ | ----- | --- | ---------- | ------ | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ | --------------------- | ----------------------- |
| llama2-7b | 256 | 1 | 3000 | 14.556 | 0.526 | 0.092 | 4.652 | 0.066 | 0.101 | 0.155 | 0.220 | 3387.419 | 6981.159 |
| llama2-13b | 128 | 1 | 3000 | 7.950 | 0.352 | 0.075 | 4.193 | 0.051 | 0.067 | 0.138 | 0.202 | 1850.145 | 3812.978 |
| internlm-20b | 128 | 2 | 3000 | 10.291 | 0.287 | 0.073 | 3.845 | 0.053 | 0.072 | 0.113 | 0.161 | 2053.266 | 4345.057 |
| llama2-70b | 256 | 4 | 3000 | 7.231 | 1.075 | 0.139 | 14.524 | 0.102 | 0.153 | 0.292 | 0.482 | 1682.738 | 3467.969 |
## Static Inference Benchmark
FTL: **F**irst **T**oken **L**atency
- `batch`: the max batch size during inference
- `tp`: the number of GPU cards for tensor parallelism
- `prompt_tokens`: the number of input tokens
- `output_tokens`: the number of generated tokens
- `throughput`: the number of generated tokens per second
- `FTL`: **F**irst **T**oken **L**atency
### llama2-7b
### FP16
| batch | tp | prompt_tokens | output_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
| ----- | --- | ------------- | ------------- | --------------------- | ------- | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ |
......@@ -41,94 +54,3 @@ FTL: **F**irst **T**oken **L**atency
| 64 | 1 | 128 | 2048 | 1852.06 | 77.96 | 0.535 | 0.027 | 1.231 | 0.03 | 0.041 | 0.048 | 0.053 |
| 64 | 1 | 2048 | 128 | 493.46 | 78.4 | 6.59 | 0.142 | 16.235 | 0.046 | 0.049 | 0.055 | 0.767 |
| 64 | 1 | 2048 | 2048 | 755.65 | 78.4 | 39.105 | 0.142 | 116.285 | 0.047 | 0.049 | 0.051 | 0.207 |
### llama2-13b
| batch | tp | prompt_tokens | output_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
| ----- | --- | ------------- | ------------- | --------------------- | ------- | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ |
| 1 | 1 | 1 | 128 | 57.49 | 74.84 | 0.018 | 0.018 | 0.019 | 0.017 | 0.017 | 0.017 | 0.017 |
| 1 | 1 | 128 | 128 | 56.58 | 74.84 | 0.04 | 0.039 | 0.04 | 0.017 | 0.017 | 0.017 | 0.018 |
| 1 | 1 | 128 | 2048 | 55.29 | 74.84 | 0.04 | 0.04 | 0.04 | 0.018 | 0.018 | 0.018 | 0.019 |
| 1 | 1 | 2048 | 128 | 48.99 | 75.09 | 0.242 | 0.242 | 0.243 | 0.019 | 0.019 | 0.019 | 0.019 |
| 1 | 1 | 2048 | 2048 | 52.12 | 75.09 | 0.243 | 0.24 | 0.244 | 0.019 | 0.019 | 0.019 | 0.02 |
| 16 | 1 | 1 | 128 | 869.45 | 74.87 | 0.036 | 0.019 | 0.053 | 0.018 | 0.019 | 0.019 | 0.02 |
| 16 | 1 | 128 | 128 | 757.3 | 75.09 | 0.252 | 0.041 | 0.272 | 0.019 | 0.02 | 0.02 | 0.021 |
| 16 | 1 | 128 | 2048 | 605.88 | 75.09 | 0.253 | 0.041 | 0.275 | 0.026 | 0.03 | 0.033 | 0.034 |
| 16 | 1 | 2048 | 128 | 257.92 | 76.96 | 3.442 | 0.245 | 3.668 | 0.033 | 0.034 | 0.035 | 0.035 |
| 16 | 1 | 2048 | 2048 | 366.67 | 76.99 | 3.122 | 0.249 | 3.671 | 0.04 | 0.044 | 0.047 | 0.047 |
| 32 | 1 | 1 | 128 | 1667.5 | 74.9 | 0.034 | 0.021 | 0.057 | 0.019 | 0.02 | 0.021 | 0.023 |
| 32 | 1 | 128 | 128 | 1301.27 | 75.37 | 0.461 | 0.04 | 0.497 | 0.021 | 0.022 | 0.023 | 0.025 |
| 32 | 1 | 128 | 2048 | 860.14 | 75.84 | 0.833 | 0.041 | 1.151 | 0.034 | 0.042 | 0.047 | 0.048 |
| 32 | 1 | 2048 | 128 | 291.54 | 77.02 | 5.315 | 0.245 | 13.483 | 0.046 | 0.047 | 0.049 | 0.51 |
| 32 | 1 | 2048 | 2048 | 389.64 | 77.02 | 38.725 | 0.245 | 108.104 | 0.047 | 0.047 | 0.049 | 0.05 |
| 64 | 1 | 1 | 128 | 3049.16 | 74.96 | 0.044 | 0.025 | 0.073 | 0.02 | 0.022 | 0.026 | 0.029 |
| 64 | 1 | 128 | 128 | 2033.22 | 75.87 | 0.703 | 0.046 | 0.951 | 0.024 | 0.026 | 0.029 | 0.032 |
| 64 | 1 | 128 | 2048 | 998.86 | 76.9 | 7.805 | 0.042 | 60.1 | 0.045 | 0.047 | 0.05 | 0.063 |
| 64 | 1 | 2048 | 128 | 286.32 | 76.99 | 19.69 | 0.245 | 32.394 | 0.047 | 0.048 | 0.05 | 0.27 |
| 64 | 1 | 2048 | 2048 | 387.86 | 77.09 | 190.453 | 0.245 | 307.331 | 0.047 | 0.048 | 0.049 | 0.05 |
### internlm-20b
| batch | tp | prompt_tokens | output_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
| ----- | --- | ------------- | ------------- | --------------------- | ------- | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ |
| 1 | 2 | 1 | 128 | 61.14 | 73.55 | 0.018 | 0.017 | 0.019 | 0.016 | 0.016 | 0.016 | 0.018 |
| 1 | 2 | 128 | 128 | 60.03 | 73.55 | 0.042 | 0.041 | 0.043 | 0.016 | 0.016 | 0.016 | 0.017 |
| 1 | 2 | 128 | 2048 | 58.26 | 73.55 | 0.042 | 0.042 | 0.043 | 0.017 | 0.017 | 0.018 | 0.018 |
| 1 | 2 | 2048 | 128 | 51.93 | 73.68 | 0.217 | 0.216 | 0.217 | 0.018 | 0.018 | 0.018 | 0.018 |
| 1 | 2 | 2048 | 2048 | 56.36 | 73.68 | 0.217 | 0.217 | 0.217 | 0.018 | 0.018 | 0.018 | 0.018 |
| 16 | 2 | 1 | 128 | 903.01 | 73.65 | 0.034 | 0.018 | 0.051 | 0.017 | 0.018 | 0.019 | 0.02 |
| 16 | 2 | 128 | 128 | 794.13 | 73.74 | 0.227 | 0.043 | 0.248 | 0.018 | 0.019 | 0.02 | 0.021 |
| 16 | 2 | 128 | 2048 | 669.87 | 73.74 | 0.227 | 0.043 | 0.25 | 0.024 | 0.027 | 0.029 | 0.03 |
| 16 | 2 | 2048 | 128 | 288.60 | 75.60 | 3.09 | 0.247 | 4.485 | 0.029 | 0.03 | 0.031 | 0.032 |
| 16 | 2 | 2048 | 2048 | 441.46 | 75.61 | 3.172 | 0.219 | 4.442 | 0.035 | 0.037 | 0.04 | 0.041 |
| 32 | 2 | 1 | 128 | 1673.64 | 73.71 | 0.037 | 0.02 | 0.066 | 0.019 | 0.02 | 0.021 | 0.023 |
| 32 | 2 | 128 | 128 | 1347.57 | 73.90 | 0.351 | 0.043 | 0.436 | 0.02 | 0.021 | 0.023 | 0.025 |
| 32 | 2 | 128 | 2048 | 1025.62 | 73.90 | 0.391 | 0.042 | 0.441 | 0.031 | 0.037 | 0.041 | 0.043 |
| 32 | 2 | 2048 | 128 | 352.45 | 75.74 | 6.062 | 0.218 | 6.3 | 0.042 | 0.043 | 0.045 | 0.046 |
| 32 | 2 | 2048 | 2048 | 514.60 | 75.77 | 10.36 | 0.222 | 70.328 | 0.049 | 0.05 | 0.051 | 0.053 |
| 64 | 2 | 1 | 128 | 2954.34 | 73.82 | 0.05 | 0.029 | 0.074 | 0.021 | 0.023 | 0.026 | 0.03 |
| 64 | 2 | 128 | 128 | 2122.92 | 74.24 | 0.591 | 0.047 | 0.808 | 0.024 | 0.026 | 0.029 | 0.032 |
| 64 | 2 | 128 | 2048 | 1276.61 | 75.18 | 2.529 | 0.049 | 41.212 | 0.042 | 0.048 | 0.052 | 0.055 |
| 64 | 2 | 2048 | 128 | 350.82 | 75.88 | 12.382 | 0.219 | 20.986 | 0.05 | 0.051 | 0.054 | 0.249 |
| 64 | 2 | 2048 | 2048 | 512.37 | 76.26 | 111.149 | 0.221 | 211.531 | 0.05 | 0.051 | 0.052 | 0.055 |
### llama2-70b
| batch | tp | prompt_tokens | output_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
| ----- | --- | ------------- | ------------- | --------------------- | ------- | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ |
| 1 | 4 | 1 | 128 | 33.94 | 73.72 | 0.031 | 0.03 | 0.031 | 0.029 | 0.029 | 0.029 | 0.03 |
| 1 | 4 | 128 | 128 | 33.63 | 73.72 | 0.074 | 0.073 | 0.074 | 0.029 | 0.029 | 0.029 | 0.03 |
| 1 | 4 | 128 | 2048 | 32.38 | 73.72 | 0.074 | 0.074 | 0.075 | 0.031 | 0.031 | 0.031 | 0.031 |
| 1 | 4 | 2048 | 128 | 28.32 | 73.78 | 0.402 | 0.401 | 0.403 | 0.031 | 0.031 | 0.031 | 0.051 |
| 1 | 4 | 2048 | 2048 | 31.9 | 73.78 | 0.405 | 0.402 | 0.407 | 0.031 | 0.031 | 0.031 | 0.031 |
| 16 | 4 | 1 | 128 | 468.52 | 73.72 | 0.071 | 0.034 | 0.939 | 0.03 | 0.031 | 0.032 | 0.251 |
| 16 | 4 | 128 | 128 | 439.77 | 73.81 | 0.437 | 0.08 | 0.687 | 0.03 | 0.031 | 0.032 | 0.207 |
| 16 | 4 | 128 | 2048 | 482.99 | 73.81 | 0.403 | 0.079 | 0.44 | 0.033 | 0.033 | 0.035 | 0.036 |
| 16 | 4 | 2048 | 128 | 189.34 | 73.98 | 5.776 | 0.437 | 7.612 | 0.035 | 0.036 | 0.036 | 0.037 |
| 16 | 4 | 2048 | 2048 | 399.42 | 73.98 | 5.773 | 0.411 | 6.844 | 0.036 | 0.037 | 0.038 | 0.041 |
| 32 | 4 | 1 | 128 | 906.03 | 73.75 | 0.098 | 0.043 | 0.253 | 0.032 | 0.033 | 0.035 | 0.178 |
| 32 | 4 | 128 | 128 | 746.36 | 73.91 | 0.749 | 0.078 | 1.026 | 0.032 | 0.033 | 0.035 | 0.438 |
| 32 | 4 | 128 | 2048 | 853.56 | 73.91 | 0.732 | 0.076 | 1.129 | 0.036 | 0.038 | 0.041 | 0.158 |
| 32 | 4 | 2048 | 128 | 232.6 | 73.99 | 11.834 | 0.408 | 13.321 | 0.04 | 0.041 | 0.043 | 0.248 |
| 32 | 4 | 2048 | 2048 | 636.23 | 73.99 | 11.711 | 0.409 | 12.689 | 0.043 | 0.045 | 0.048 | 0.179 |
| 64 | 4 | 1 | 128 | 1425.79 | 73.81 | 0.213 | 0.046 | 1.264 | 0.037 | 0.039 | 0.044 | 0.329 |
| 64 | 4 | 128 | 128 | 1159.84 | 73.96 | 1.292 | 0.107 | 2.676 | 0.037 | 0.04 | 0.045 | 0.378 |
| 64 | 4 | 128 | 2048 | 1391.8 | 73.95 | 1.173 | 0.135 | 1.623 | 0.043 | 0.047 | 0.052 | 0.251 |
| 64 | 4 | 2048 | 128 | 270.47 | 74.02 | 17.402 | 0.452 | 24.164 | 0.05 | 0.052 | 0.057 | 0.345 |
| 64 | 4 | 2048 | 2048 | 930.46 | 74.01 | 21.29 | 0.423 | 24.498 | 0.055 | 0.059 | 0.065 | 0.299 |
## Request Throughput Benchmark
FTL: **F**irst **T**oken **L**atency
| model | batch | tp | num_prompts | PRS | PRM | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | throughput(out tok/s) | throughput(total tok/s) |
| ------------ | ----- | --- | ----------- | ------ | ------- | ----------- | ----------- | ----------- | --------------------- | ----------------------- |
| llama2-7b | 64 | 1 | 3000 | 10.275 | 616.477 | 0.092 | 0.036 | 1.145 | 2562.435 | 5283.547 |
| | 128 | 1 | 3000 | 12.611 | 756.677 | 0.205 | 0.056 | 2.241 | 3210.281 | 6619.357 |
| llama2-13b | 64 | 1 | 3000 | 6.337 | 380.244 | 0.159 | 0.051 | 2.048 | 1474.786 | 3039.398 |
| | 128 | 1 | 3000 | 7.588 | 455.273 | 0.412 | 0.085 | 4.445 | 1765.788 | 3639.128 |
| internlm-20b | 64 | 2 | 3000 | 7.842 | 470.516 | 0.166 | 0.059 | 2.461 | 1564.696 | 3311.16 |
| | 128 | 2 | 3000 | 9.776 | 586.568 | 0.34 | 0.079 | 5.808 | 1950.627 | 4127.855 |
| llama2-70b | 64 | 4 | 3000 | 4.285 | 257.08 | 0.301 | 0.083 | 4.689 | 1000.376 | 2062.7 |
| | 128 | 4 | 3000 | 5.833 | 349.996 | 0.633 | 0.107 | 8.431 | 1361.939 | 2808.216 |
| | 256 | 4 | 3000 | 6.568 | 394.108 | 1.49 | 0.171 | 19.52 | 1533.592 | 3162.15 |
# API Server Performance Test Method
# Profile API Server
The way to profiling api_server performance is similar to the method for [profiling throughput](./profile_throughput.md). The difference is api_server should be launched successfully before testing.
The way to profiling `api_server` performance is similar to the method for [profiling throughput](./profile_throughput.md). The difference is `api_server` should be launched successfully before testing.
The evaluation script is `profile_restful_api.py`. Before running it, please install the lmdeploy precompiled package, download the evaluation script and the test dataset:
The profiling script is `profile_restful_api.py`. Before running it, please install the lmdeploy precompiled package, download the script and the test dataset:
```shell
pip install 'lmdeploy[serve]>=0.1.0a1'
pip install lmdeploy
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```
During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
In the following sections, we assume the model is in turbomind format.
## Metrics
LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)
......@@ -36,72 +31,22 @@ $$
Total time includes prefill time.
## Example
We take `internlm-7b` as an example. The entire benchmark procedure is:
```shell
pip install 'lmdeploy[serve]>=0.1.0a1'
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
## Profile
# get internlm-7b from huggingface and convert it to turbomind format
lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) as an example to show the benchmark procedure.
# launch server
lmdeploy serve api_server ./internlm-7b --server-port 23333
### Launch api_server
# open another terminal and run the following command in the directory `lmdeploy/benchmark`
python3 ./profile_restful_api.py http://0.0.0.0:23333 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
```shell
lmdeploy serve api_server internlm/internlm-7b
```
## Methods
If you would like to change the server's port or other parameters, such as inference engine, max batch size and etc., please run `lmdeploy serve api_server -h` or read [this](../serving/api_server.md) guide to get the detailed explanation.
Please refer to [this](../restful_api.md) guide to start `api_server`.
The argument `--instance-num` reflects the inference instance number. When more than `--instance-num` requests arrive at the `api_server` at the same time, the exceeding part of the requests will wait in the inference queue.
### Profile
```shell
python3 profile_restful_api.py <server_addr> <tokenizer_path> <dataset> <optional arguments>
python3 profile_restful_api.py http://0.0.0.0:23333 internlm/internlm-7b ./ShareGPT_V3_unfiltered_cleaned_split.json
```
The required parameters are:
- `server_addr`
The address of api_server with format `http://{server_ip}:{server_port}`
- `tokenizer_path`
The path of the tokenizer model, which is used to encode the dataset to get the token size of prompts and responses
- `dataset`
The path of the downloaded dataset
Optional arguments are listed as below:
- `--concurrency`
It represents the number of request threads with default value 64. Requests of concurrent threads will be batched by the inference engine. Its value should not exceed the number of inference instances in the api_server.
Otherwise, the excess requests will wait in the inference queue.
- `--num-prompts`
The number of sampled prompts from dataset to process. The default is 2000.
- `--top_p` and `--temperature`
They are used to sample the generated token_id.
- `--stream_output`
Indicator for streaming output. The default is `False`.
- `--csv`
The path of a csv file to save the result with default value `../profile_api_server.csv`
- `--seed`
It is the seed used in sampling prompts from dataset with default value 0.
For detailed argument specification of `profile_restful_api.py`, such as request concurrency, sampling parameters an so on, please run the help command `python3 profile_restful_api.py -h`.
# Static Inference Performance Test Method
# Profile Token Latency and Throughput
We view the performance of the inference engine under the fixed batch and fixed input/output token as static inference performance.
We profile the latency and throughput of generated tokens with fixed batch size and fixed input/output token.
The evaluation script is `profile_generation.py`. Before running it, please install the lmdeploy precompiled package and download the evaluation script:
The profiling script is `profile_generation.py`. Before running it, please install the lmdeploy precompiled package and download the profiling script:
```shell
pip install 'lmdeploy>=0.1.0a1'
pip install lmdeploy
git clone --depth=1 https://github.com/InternLM/lmdeploy
```
During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
In the following sections, we assume the model is in turbomind format.
## Metrics
LMDeploy records test results like first token latency, token throughput (tokens/s), percentile data of each token's latency (P50, P75, P95, P99), GPU mem, etc.
......@@ -30,59 +25,22 @@ Total time includes prefill time.
During the test process, all graphics cards on the node should not run any other programs, otherwise the statistics of GPU mem would be inaccurate.
## Example
## Profile
In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) as an example to show how to profile the inference engines of LMDeploy.
We take `internlm-7b` as an example. The entire benchmark procedure is:
### Profile turbomind engine
```shell
pip install 'lmdeploy>=0.1.0a1'
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
# get internlm-7b from huggingface and convert it to turbomind format
lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
# benchmark
python3 profile_generation.py ./internlm-7b
python3 profile_generation.py internlm/internlm-7b
```
## Command details
### Profile pytorch engine
```shell
python3 profile_generation.py <model_path> <optional arguments>
cd lmdeploy/benchmark
python3 profile_generation.py internlm/internlm-7b --backend pytorch
```
`model_path` refers to the path on localhost where the model in turbomind format is located.
Optional arguments are listed as below:
- `--concurrency`
It represents the number of request threads. Requests of concurrent threads will be batched by the inference engine. It is a list with default value `[1, 16, 32, 64]`, which implies that the performance under 4 different levels of concurrency is tested. The level of concurrency should not exceed `max_batch_size` in [turbomind config](../turbomind_config.md#turbomind-20-config). Otherwise, there will be `max_batch_size - concurrency` number of threads randomly waiting almost at any time during test.
- `--prompt-tokens` and `--completion-tokens`
Input token and output token numbers. They are lists of the same length. The elements in the list correspond one-to-one, that is,
the pair `(prompt_tokens[i], completion_tokens[i])` is a test case. In the default list `[1, 128, 128, 2048, 2048]` and `[128, 128, 2048, 128, 2048]`, the test cases are `(1, 128)`, `(128, 128)`, `(128, 2048)`, `(2048, 128)` and `(2048, 2048)`
- `--tp`
The number of GPUs used when the inference is in tensor parallel mode. It must be a power of 2. The default is 1.
- `--top_k`, `--top_p` and `--temperature`
They are used to sample the generated token_id.
- `--csv`
A csv file path used to store test results. The default is `./profile_generation.csv`
- `--log-level`
The log level. The default is 'ERROR'.
- `--test-round`
The number of test rounds is set to 10 by default. This means that each case will undergo 10 rounds of testing, and the average result will be calculated.
We refer to a tuple of `(#concurrency, #prompt_token, #completion_token)` as a test case. Therefore, the total number of test cases (`#test_cases`) executed by the script is `len(concurrency) * len(prompt-tokens)`, and the total test rounds are `#test_cases * #test_round`. Users can flexibly adjust test parameters according to their actual situation.
For detailed argument specification of `profile_generation.py`, such as batch size, input and output token number an so on, please run the help command `python3 profile_generation.py -h`.
# Request Throughput Test Method
# Profile Request Throughput
In the applications, the length of the user's input prompt and the size of generated tokens are dynamic. The static inference performance is insufficient to reflect the inference engine's ability to handle the dynamic characteristics.
Therefore, it is necessary to use real dialogue data to evaluate the dynamic inference capabilities of the inference engine. This article will introduce how to test the dynamic inference performance of LMDeploy on localhost.
The evaluation script is `profile_throughput.py`. Before running it, please install the lmdeploy precompiled package, download the evaluation script and the test dataset:
The profiling script is `profile_throughput.py`. Before running it, please install the lmdeploy precompiled package, download the profiling script and the test dataset:
```shell
pip install 'lmdeploy>=0.1.0a1'
pip install lmdeploy
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```
During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
In the following sections, we assume the model is in turbomind format.
## Metrics
LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)
......@@ -38,68 +33,20 @@ $$
Total time includes prefill time.
## Example
## Profile
We take `internlm-7b` as an example. The entire benchmark procedure is:
```shell
pip install 'lmdeploy>=0.1.0a1'
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) as an example to show how to profile the inference engines of LMDeploy.
# get internlm-7b from huggingface and convert it to turbomind format
lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
### Profile turbomind engine
python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json ./internlm-7b
```shell
python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json internlm/internlm-7b
```
## Command details
### Profile pytorch engine
```shell
python3 profile_throughput.py <dataset> <model_path> <optional arguments>
python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json internlm/internlm-7b --backend pytorch
```
The required parameters are:
- `dataset`
The path of the downloaded dataset
- `model_path`
The path on localhost where the model in turbomind format is located.
Optional arguments are listed as below:
- `--concurrency`
It represents the number of request threads with default value 64. Requests of concurrent threads will be batched by the inference engine. Its value should not exceed `max_batch_size` in `config.ini`. Otherwise, the excess requests will wait in the inference queue.
- `--num-prompts`
The number of sampled prompts from dataset to process. The default is 2000.
- `--tp`
The number of GPUs used when the inference is in tensor parallel mode. It must be a power of 2. The default is 1.
- `--top_k``--top_p` and `--temperature`
They are used to sample the generated token_id.
- `--stream_output`
Indicator for streaming output. The default is `True`.
- `--csv`
The path of a csv file to save the result with default value `./profile_throughput.csv`
- `--log-level`
The log level. The default is `ERROR`.
- `--seed`
It is the seed used in sampling prompts from dataset with default value 0.
For detailed argument specification of `profile_throughput.py`, such as request concurrency, sampling parameters, k/v cache memory percentage an so on, please run the help command `python3 profile_throughput.py -h`.
# Triton Inference Server Performance Test Method
# Profile Triton Inference Server
Triton Inference Server (TIS) is another serving method supported by LMDeploy besides from api_server. Its performance testing methods and metrics are similar to those of [api_server](./profile_api_server.md).
Triton Inference Server (TIS) is another serving method supported by LMDeploy besides `api_server`. Its performance testing methods and metrics are similar to those of [api_server](./profile_api_server.md).
The evaluation script is `profile_serving.py`. Before running it, please install the lmdeploy precompiled package, download the evaluation script and the test dataset:
The profiling script is `profile_serving.py`. Before running it, please install the lmdeploy precompiled package, download the profiling script and the test dataset:
```shell
pip install 'lmdeploy[serve]>=0.1.0a1'
pip install 'lmdeploy[serve]'
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```
During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
In the following sections, we assume the model is in turbomind format.
## Metrics
LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)
......@@ -36,72 +31,28 @@ $$
Total time includes prefill time.
## Example
## Profile
We take `internlm-7b` as an example. The entire benchmark procedure is:
```shell
pip install 'lmdeploy[serve]>=0.1.0a1'
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) as an example to show the benchmark procedure.
# get internlm-7b from huggingface and convert it to turbomind format
lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
### Launch triton inference server
# launch server
bash ./internlm-7b/service_docker_up.sh
Before launching the server, the LLM model must be converted to the turbomind format in advance.
# open another terminal and run the following command in the directory `lmdeploy/benchmark`
python3 ./profile_serving 0.0.0.0:33337 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
```shell
lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
```
## Command details
Then, the triton inference server can be launched by:
```shell
python3 profile_serving.py <server_addr> <tokenizer_path> <dataset> <optional arguments>
bash ./internlm-7b/service_docker_up.sh
```
The required parameters are:
- `server_addr`
The address of api_server with format `{server_ip}:{server_port}`
- `tokenizer_path`
The path of the tokenizer model, which is used to encode the dataset to get the token size of prompts and responses
### Profile
- `dataset`
The path of the downloaded dataset
Optional arguments are listed as below:
- `--concurrency`
It represents the number of request threads with default value 32. Requests of concurrent threads will be batched by the inference engine.
It is recommended that `concurrency` does not exceed the `max_batch_size` in `config.ini`, nor should it exceed the number of inference instances in `triton_models`.
Otherwise, the excess requests will wait in the inference queue.
The configuration item for the number of inference instances is `instance_group`, which is located in the file `{model_path}/triton_models/interactive/config.pbtxt`, and the default is 48.
- `--num-prompts`
The number of sampled prompts from dataset to process. The default is 1000. It is suggested 2000 when `concurrency >= 64`
- `--top_k``--top_p` and `--temperature`
They are used to sample the generated token_id.
- `--stream_output`
Indicator for streaming output. The default is `True`.
- `--csv`
The path of a csv file to save the result with default value `../profile_tis.csv`
- `--seed`
```shell
python3 profile_serving.py 0.0.0.0:33337 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
```
It is the seed used in sampling prompts from dataset with default value 0.
For detailed argument specification of `profile_serving.py`, such as request concurrency, sampling parameters an so on, please run the help command `python3 profile_serving.py -h`.
......@@ -17,7 +17,8 @@ The docker image is `openmmlab/lmdeploy-builder:cuda11.8`. Make sure that docker
In the root directory of the lmdeploy source code, please run the following command:
```shell
cd lmdeploy # the home folder of lmdeploy source code
# the home folder of lmdeploy source code
cd lmdeploy
bash builder/manywheel/build_all_wheel.sh
```
......@@ -67,8 +68,10 @@ Then, follow the steps below to set up the compilation environment:
```
- build and install lmdeploy libraries:
```shell
apt install ninja-build # install ninja
cd lmdeploy # the home folder of lmdeploy
# install ninja
apt install ninja-build
# the home folder of lmdeploy
cd lmdeploy
mkdir build && cd build
sh ../generate.sh
ninja -j$(nproc) && ninja install
......
......@@ -52,6 +52,7 @@ extensions = [
'sphinx.ext.napoleon',
'sphinx.ext.viewcode',
'sphinx.ext.autosectionlabel',
'sphinx_tabs.tabs',
'sphinx_markdown_tables',
'myst_parser',
'sphinx_copybutton',
......
......@@ -41,10 +41,44 @@ export LD_LIBRARY_PATH={Location}/nvidia/nccl/lib:$LD_LIBRARY_PATH
It's probably due to a low-version cuda toolkit. LMDeploy runtime requires a minimum CUDA version of 11.2
## Turbomind Inference
## Inference
## Pytorch Inference
### RuntimeError: \[TM\]\[ERROR\] CUDA runtime error: out of memory /workspace/lmdeploy/src/turbomind/utils/allocator.h
This is usually due to a disproportionately large memory ratio for the k/v cache, which is dictated by `TurbomindEngineConfig.cache_max_entry_count`.
The implications of this parameter have slight variations in different versions of lmdeploy. For specifics, please refer to the source code for the \[detailed notes\] (https://github.com/InternLM/lmdeploy/blob/52419bd5b6fb419a5e3aaf3c3b4dea874b17e094/lmdeploy/messages.py#L107)
If you encounter this issue while using the pipeline interface, please reduce the `cache_max_entry_count` in `TurbomindEngineConfig` like following:
```python
from lmdeploy import pipeline, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
pipe = pipeline('internlm/internlm2-chat-7b',
backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)
```
If OOM occurs when you run CLI tools, please pass `--cache-max-entry-count` to decrease k/v cache memory ratio. For example:
```shell
# chat command
lmdeploy chat turbomind internlm/internlm2-chat-7b --cache-max-entry-count 0.2
# server command
lmdeploy serve api_server internlm/internlm2-chat-7b --cache-max-entry-count 0.2
```
## Serve
## Quantization
### RuntimeError: \[enforce fail at inline_container.cc:337\] . unexpected pos 4566829760 vs 4566829656
Please check your disk space. This error is due to insufficient disk space when saving weights, which might be encountered when quantizing the 70B model
### ModuleNotFoundError: No module named 'flash_attn'
Quantizing `qwen` requires the installation of `flash-attn`. But based on feedback from community users, `flash-attn` can be challenging to install. Therefore, we have removed it from lmdeploy dependencies and now recommend that users install it it manually as needed.
Welcome to LMDeploy's documentation!
Welcome to LMDeploy's tutorials!
====================================
You can switch between Chinese and English documents in the lower-left corner of the layout.
.. _get_started:
.. toctree::
:maxdepth: 2
:caption: Get Started
get_started.md
.. _build:
.. toctree::
:maxdepth: 1
:caption: Build
build.md
.. _benchmark:
.. toctree::
:maxdepth: 2
:caption: Chatting with PyTorch
:maxdepth: 1
:caption: Benchmark
pytorch.md
benchmark/profile_generation.md
benchmark/profile_throughput.md
benchmark/profile_api_server.md
benchmark/profile_triton_server.md
benchmark/evaluate_with_opencompass.md
.. _supported_models:
.. toctree::
:maxdepth: 2
:caption: Quantization
:maxdepth: 1
:caption: Supported Models
quantization.md
supported_models/supported_models.md
.. _inference:
.. toctree::
:maxdepth: 2
:caption: Serving
:maxdepth: 1
:caption: Inference
serving.md
inference/pipeline.md
inference/vl_pipeline.md
.. _serving:
.. toctree::
:maxdepth: 2
:caption: TurboMind
:maxdepth: 1
:caption: serving
serving/api_server.md
serving/api_server_vl.md
serving/gradio.md
serving/proxy_server.md
.. _quantization:
.. toctree::
:maxdepth: 1
:caption: Quantization
turbomind.md
quantization/w4a16.md
quantization/kv_int8.md
quantization/w8a8.md
.. toctree::
:caption: Switch Language
:maxdepth: 1
:caption: Advanced Guide
switch_language.md
inference/turbomind.md
inference/pytorch.md
advance/pytorch_new_model.md
advance/long_context.md
advance/chat_template.md
advance/debug_turbomind.md
serving/qos.md
.. toctree::
:maxdepth: 1
:caption: API Reference
api/pipeline.rst
Indices and tables
==================
......
# KV Cache Quantization and Test Results
For the LLaMa-7B fp16 model with a maximum length of 2048, the server requires approximately 1030MB of GPU memory to store kv_cache for each concurrent session created. This means that even an A100 80G can only serve a limited number of users.
To reduce runtime GPU memory usage, we have implemented PTQ quantization for kv cache, using the following formula:
```bash
zp = (min+max) / 2
scale = (max-min) / 255
quant: q = round( (f-zp) / scale)
dequant: f = q * scale + zp
```
## How to Enable KV Cache INT8
### **Step One**
Convert the Hugging Face model format to the TurboMind inference format to create a workspace directory.
```bash
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
```
If you already have a workspace directory, skip this step.
### **Step Two**
Get the quantization parameters by these two steps:
```bash
# get minmax
lmdeploy lite calibrate \
--model $HF_MODEL \
--calib_dataset 'c4' \ # Support c4, ptb, wikitext2, pileval
--calib_samples 128 \ # Number of samples in the calibration set, if the memory is not enough, it can be adjusted appropriately
--calib_seqlen 2048 \ # Length of a single text, if the memory is not enough, you can adjust it appropriately
--work_dir $WORK_DIR \ # Directory for saving quantized statistical parameters and quantized weights in Pytorch format
# get quant parameters
lmdeploy lite kv_qparams \
--work_dir $WORK_DIR \ # Directory of the last output
--turbomind_dir workspace/triton_models/weights/ \ # Directory to save the quantization parameters
--kv_sym False \ # Symmetric or asymmetric quantization, default is False
--num_tp 1 \ # Number of GPUs used for Tensor parallelization, keep it consistent with deploy.py
```
`kv_qparams` will generate fp32 scaling factors in the `weights` directory. The file format is a binary produced by `numpy.tofile`.
You can also first set `turbomind_dir` to a private directory, then copy the scaling factors into `workspace/triton_models/weights/`.
### **Step Three**
Modify `workspace/triton_models/weights/config.ini`:
- Set quant_policy to 4. This means enabling kv_cache int8
### **Step Four**
Test the chat performance.
```bash
lmdeploy chat turbomind ./workspace
```
## GPU Memory Test
The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) model.
Testing method:
1. Use `deploy.py` to convert the model, modify the maximum concurrency in the `workspace` configuration; adjust the number of requests in `llama_config.ini`.
2. Compile and run `bin/llama_triton_example` to obtain the GPU memory situation of the fp16 version under different batch_size.
3. Enable quantization, re-run `bin/llama_triton_example` to obtain the GPU memory situation of the int8 version under different batch_size.
Below shows the comparison of GPU memory between the two versions:
| batch_size | fp16 memory(MiB) | int8 memory(MiB) | diff(MiB) |
| :--------: | :--------------: | :--------------: | :-------: |
| 8 | 22337 | 18241 | -4096 |
| 16 | 30593 | 22369 | -8224 |
| 32 | 47073 | 30625 | -16448 |
| 48 | 63553 | 38881 | -24672 |
Compared to directly quantizing Weight (such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/)), we have done a comparative estimation of memory growth in the 7B model for both methods, with some data from [llama.cpp](https://github.com/ggerganov/llama.cpp).
![](../../resources/batch_memory.png)
As can be seen, the fp16 version requires 1030MB of GPU memory for each concurrency, so quantizing kv_cache can significantly reduce the rate of increase of runtime memory.
## Accuracy Test
The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) command model.
Below is the result of PTQ quantization of `kCacheKVInt8` method with only 128 randomly selected data from the c4 dataset. The accuracy was tested using [opencompass](https://github.com/InternLM/opencompass) before and after quantization.
| task | dataset | metric | int8 | fp16 | diff |
| :-----------: | :-------------: | :-----------: | :---: | :---: | :---: |
| Language | winogrande | accuracy | 60.77 | 61.48 | -0.71 |
| Knowledge | nq | score | 2.69 | 2.60 | +0.09 |
| Reasoning | gsm8k | accuracy | 33.28 | 34.72 | -1.44 |
| Reasoning | bbh | naive_average | 20.12 | 20.51 | -0.39 |
| Understanding | openbookqa_fact | accuracy | 82.40 | 82.20 | +0.20 |
| Understanding | eprstmt-dev | accuracy | 90.62 | 88.75 | +1.87 |
| Safety | crows_pairs | accuracy | 32.56 | 31.43 | +1.13 |
Note that both `kCacheKVInt8` and `WeightInt4` methods can be enabled at the same time.
# Load huggingface model directly
Starting from v0.1.0, Turbomind adds the ability to pre-process the model parameters on-the-fly while loading them from huggingface style models.
## Supported model type
Currently, Turbomind support loading three types of model:
1. A lmdeploy-quantized model hosted on huggingface.co, such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
2. Other LM models on huggingface.co like Qwen/Qwen-7B-Chat
3. A model converted by `lmdeploy convert`, legacy format
## Usage
### 1) A lmdeploy-quantized model
For models quantized by `lmdeploy.lite` such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
```
repo_id=internlm/internlm-chat-20b-4bit
model_name=internlm-chat-20b
# or
# repo_id=/path/to/downloaded_model
# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name
# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name
# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```
### 2) Other LM models
For other LM models such as Qwen/Qwen-7B-Chat or baichuan-inc/Baichuan2-7B-Chat. LMDeploy supported models can be viewed through `lmdeploy list`.
```
repo_id=Qwen/Qwen-7B-Chat
model_name=qwen-7b
# or
# repo_id=/path/to/Qwen-7B-Chat/local_path
# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name
# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name
# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```
### 3) A model converted by `lmdeploy convert`
The usage is like previous
```
# Convert a model
lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME
# Inference by TurboMind
lmdeploy chat turbomind ./workspace
# Serving with gradio
lmdeploy serve gradio ./workspace
# Serving with Restful API
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```
# Pytorch
## Chat in command line
LMDeploy support chatting with PyTorch models with submodule `lmdeploy.pytorch.chat`.
This submodule allow user to chat with language model through command line, and optionally accelerate model using backends like deepspeed.
**Example 1**: Chat with default setting
```shell
lmdeploy chat torch $PATH_TO_HF_MODEL
```
**Example 2**: Disable sampling and chat history
```shell
lmdeploy chat torch \
$PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
--temperature 0 --max-history 0
```
**Example 3**: Accelerate with deepspeed inference
```shell
lmdeploy chat torch \
$PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
--accel deepspeed
```
Note: to use deepspeed, you need to install deepspeed, and if hope to accelerate InternLM, you need a customized version <https://github.com/wangruohui/DeepSpeed/tree/support_internlm_0.10.0>
**Example 4**: Tensor parallel the model on 2 GPUs
```shell
deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
$PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
--accel deepspeed \
```
This module also allow the following control commands to change generation behaviors during chat.
- `exit`: terminate and exit chat
- `config set key=value`: change generation config `key` to `value`, e.g. config temperature=0 disable sampling for following chats
- `clear`: clear chat history
### Simple diagram of components
```mermaid
graph LR;
subgraph model specific adapter
p((user_input))-->tokenize-->id((input_ids))-->decorate
tmpl_ids((template_ids))-->decorate;
end
subgraph generate
model[CausalLM_model.generate]-->gen_result(("gen_result"))
gen_result-->hid
gen_result-->attn((attention))
end
subgraph streamer
model-->s[streamer]--value-->decode_single--token-->output
end
subgraph session_manager
prepend_history-->fullid((complete_ids));
trim-->prepend_history
end
decorate-->prepend_history
hid((history_ids))-->trim;
attn-->trim;
fullid-->model
tokenizer((tokenizer))-->decode_single
tokenizer-->tokenize
p-->genconfig(GenConfig)-->model
```
# Restful API
### Launch Service
The user can open the http url print by the following command in a browser.
- **Please check the http url for the detailed api usage!!!**
- **Please check the http url for the detailed api usage!!!**
- **Please check the http url for the detailed api usage!!!**
```shell
lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 64 --tp 1
```
We provide some RESTful APIs. Three of them are in OpenAI format.
- /v1/chat/completions
- /v1/models
- /v1/completions
However, we recommend users try
our own api `/v1/chat/interactive` which provides more arguments for users to modify. The performance is comparatively better.
**Note** please, if you want to launch multiple requests, you'd better set different `session_id` for both
`/v1/chat/completions` and `/v1/chat/interactive` apis. Or, we will set them random values.
### python
We have integrated the client-side functionalities of these services into the `APIClient` class. Below are some examples demonstrating how to invoke the `api_server` service on the client side.
If you want to use the `/v1/chat/completions` endpoint, you can try the following code:
```python
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
model_name = api_client.available_models[0]
messages = [{"role": "user", "content": "Say this is a test!"}]
for item in api_client.chat_completions_v1(model=model_name, messages=messages):
print(item)
```
For the `/v1/completions` endpoint. If you want to use the `/v1/completions` endpoint, you can try:
```python
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
model_name = api_client.available_models[0]
for item in api_client.completions_v1(model=model_name, prompt='hi'):
print(item)
```
Lmdeploy supports maintaining session histories on the server for `/v1/chat/interactive` api. We disable the
feature by default.
- On interactive mode, the chat history is kept on the server. In a multiple rounds of conversation, you should set
`interactive_mode = True` and the same `session_id` (can't be -1, it's the default number) to `/v1/chat/interactive` for requests.
- On normal mode, no chat history is kept on the server.
The interactive mode can be controlled by the `interactive_mode` boolean parameter. The following is an example of normal mode. If you want to experience the interactive mode, simply pass in `interactive_mode=True`.
```python
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
for item in api_client.chat_interactive_v1(prompt='hi'):
print(item)
```
### Java/Golang/Rust
May use [openapi-generator-cli](https://github.com/OpenAPITools/openapi-generator-cli) to convert `http://{server_ip}:{server_port}/openapi.json` to java/rust/golang client.
Here is an example:
```shell
$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust
$ ls rust/*
rust/Cargo.toml rust/git_push.sh rust/README.md
rust/docs:
ChatCompletionRequest.md EmbeddingsRequest.md HttpValidationError.md LocationInner.md Prompt.md
DefaultApi.md GenerateRequest.md Input.md Messages.md ValidationError.md
rust/src:
apis lib.rs models
```
### cURL
cURL is a tool for observing the output of the api.
List Models:
```bash
curl http://{server_ip}:{server_port}/v1/models
```
Interactive Chat:
```bash
curl http://{server_ip}:{server_port}/v1/chat/interactive \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello! How are you?",
"session_id": 1,
"interactive_mode": true
}'
```
Chat Completions:
```bash
curl http://{server_ip}:{server_port}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internlm-chat-7b",
"messages": [{"role": "user", "content": "Hello! How are you?"}]
}'
```
Text Completions:
```shell
curl http://{server_ip}:{server_port}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama",
"prompt": "two steps to build a house:"
}'
```
### CLI client
There is a client script for restful api server.
```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client api_server_url
```
### webui
You can also test restful-api through webui.
```shell
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
```
### FAQ
1. When user got `"finish_reason":"length"`, it means the session is too long to be continued. The session length can be
modified by passing `--session_len` to api_server.
2. When OOM appeared at the server side, please reduce the number of `instance_num` when lanching the service.
3. When the request with the same `session_id` to `/v1/chat/interactive` got a empty return value and a negative `tokens`, please consider setting `interactive_mode=false` to restart the session.
4. The `/v1/chat/interactive` api disables engaging in multiple rounds of conversation by default. The input argument `prompt` consists of either single strings or entire chat histories.
# Serving a model
## Serving [LLaMA-2](https://github.com/facebookresearch/llama)
You can download [llama-2 models from huggingface](https://huggingface.co/meta-llama) and serve them like below:
<details open>
<summary><b>7B</b></summary>
```shell
lmdeploy convert llama2 /path/to/llama-2-7b-chat-hf
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>13B</b></summary>
```shell
lmdeploy convert llama2 /path/to/llama-2-13b-chat-hf --tp 2
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>70B</b></summary>
```shell
lmdeploy convert llama2 /path/to/llama-2-70b-chat-hf --tp 8
bash workspace/service_docker_up.sh
```
</details>
## Serving [LLaMA](https://github.com/facebookresearch/llama)
Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)
<details open>
<summary><b>7B</b></summary>
```shell
lmdeploy convert llama /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>13B</b></summary>
```shell
lmdeploy convert llama /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>30B</b></summary>
```shell
lmdeploy convert llama /path/to/llama-30b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>65B</b></summary>
```shell
lmdeploy convert llama /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh
```
</details>
### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)
<details open>
<summary><b>7B</b></summary>
```shell
python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
--base-model-path /path/to/llama-7b \
--target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1
lmdeploy convert vicuna /path/to/vicuna-7b
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>13B</b></summary>
```shell
python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
--base-model-path /path/to/llama-13b \
--target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1
lmdeploy convert vicuna /path/to/vicuna-13b
bash workspace/service_docker_up.sh
```
</details>
......@@ -64,10 +64,10 @@ And the generated code piece by `turbomind.chat` is the one to be filled in `<FI
### Chat
```
lmdeploy chat turbomind ./workspace --cap chat --sys-instruct "Provide answers in Python"
lmdeploy chat turbomind ./workspace --cap chat --meta-instruct "Provide answers in Python"
```
`--sys-instruct` instruction can be changed to other coding languages as long as codellama supports it
`--meta-instruct` instruction can be changed to other coding languages as long as codellama supports it
### Python specialist
......@@ -88,9 +88,8 @@ TBD
Launch inference server by:
```shell
# --instance_num: number of instances to performance inference, which can be viewed as max requests concurrency
# --tp: the number of GPUs used in tensor parallelism
lmdeploy serve api_server ./workspace --server_name ${server_ip} --server_port ${server_port} --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --server-name ${server_ip} --server-port ${server_port} --tp 1
```
Then, you can communicate with it by command line,
......@@ -105,8 +104,8 @@ or through webui after launching gradio,
```shell
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
# example: lmdeploy serve gradio http://localhost:23333 --server-name localhost --server-port 6006
lmdeploy serve gradio api_server_url --server-name ${gradio_ui_ip} --server-port ${gradio_ui_port}
```
Regarding the detailed information of RESTful API, you can refer to [restful_api.md](../restful_api.md).
Regarding the detailed information of RESTful API, you can refer to the [guide](../serving/api_server.md).
# Architecture of TurboMind
TurboMind is an inference engine that supports high throughput inference for conversational LLMs. It's based on NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer). Major features of TurboMind include an efficient LLaMa implementation, the persistent batch inference model and an extendable KV cache manager.
## High level overview of TurboMind
```
+--------------------+
| API |
+--------------------+
| ^
request | | stream callback
v |
+--------------------+ fetch +-------------------+
| Persistent Batch | <-------> | KV Cache Manager |
+--------------------+ update +-------------------+
^
|
v
+------------------------+
| LLaMA implementation |
+------------------------+
| FT kernels & utilities |
+------------------------+
```
## Persistent Batch
You may recognize this feature as "continuous batching" in other repos. But during the concurrent development of the feature, we modeled the inference of a conversational LLM as a persistently running batch whose lifetime spans the entire serving process, hence the name "persistent batch". To put it simply
- The persistent batch as N pre-configured batch slots.
- Requests join the batch when there are free slots available. A batch slot is released and can be reused once the generation of the requested tokens is finished.
- __On cache-hits (see below), history tokens don't need to be decoded in every round of a conversation; generation of response tokens will start instantly.__
- The batch grows or shrinks automatically to minimize unnecessary computations.
## KV Cache Manager
The [KV cache manager](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/SequenceManager.h) of TurboMind is a memory-pool-liked object that also implements LRU policy so that it can be viewed as a form of __cache of KV caches__. It works in the following way
- All device memory required for KV cache is allocated by the manager. A fixed number of slots is pre-configured to match the memory size of the system. Each slot corresponds to the memory required by the KV cache of a single sequence. Allocation chunk-size can be configure to implement pre-allocate/on-demand style allocation policy (or something in-between).
- When space for the KV cache of a new sequence is requested but no free slots left in the pool, the least recently used sequence is evicted from the cache and its device memory is directly reused by the new sequence. However, this is not the end of the story.
- Fetching sequence currently resides in one of the slots resembles a _cache-hit_, the history KV cache is returned directly and no context decoding is needed.
- Victim (evicted) sequences are not erased entirely but converted to its most compact form, i.e. token IDs. When the same sequence id is fetched later (_cache-miss_) the token IDs will be decoded by FMHA backed context decoder and converted back to KV cache.
- The eviction and conversion are handled automatically inside TurboMind and thus transparent to the users. __From the user's aspect, system that use TurboMind has access to infinite device memory.__
## LLaMa implementation
Our implementation of the LLaMa family models is modified from Gpt-NeoX model in FasterTransformer. In addition to basic refactoring and modifications to support the LLaMa family, we made some improvements to enable high performance inference of conversational models, most importantly:
- To support fast context decoding in multi-round conversations. We replaced the attention implementation in context decoder with a [cutlass](https://github.com/NVIDIA/cutlass)-based FMHA implementation that supports mismatched Q/K lengths.
- We introduced indirect buffer pointers in both context FMHA and generation FMHA to support the discontinuity in KV cache within the batch.
- To support concurrent inference with persistent batch, new synchronization mechanism was designed to orchestrate the worker threads running in tensor parallel mode.
- To maximize the throughput, we implement INT8 KV cache support to increase the max batch size. It's effective because in real-world serving scenarios, KV cache costs more memory and consumes more memory bandwidth than weights or other activations.
- We resolved an NCCL hang issue when running multiple model instances in TP mode within a single process, NCCL APIs are now guarded by host-side synchronization barriers.
## API
TurboMind supports a Python API that enables streaming output and tensor parallel mode.
The ability to use [tritonserver](https://github.com/triton-inference-server/server) for serving is also inherited from FasterTransformer. However, to support submitting concurrent requests into our persistent batch model, we no longer use sequence batching or dynamic batching as FasterTransformer does. The bookkeeping of request and sequence states are managed by TurboMind instead.
## Difference between FasterTransformer and TurboMind
Apart of the features described above, there are still many minor differences that we don't cover in this document. Notably, many capabilities of FT are dropped in TurboMind because of the difference in objectives (e.g. prefix prompt, beam search, context embedding, sparse GEMM, GPT/T5/other model families, etc)
## FAQ
### Supporting Huggingface models
For historical reasons, TurboMind's weight layout is based on [the original LLaMa implementation](https://github.com/facebookresearch/llama) (differ only by a transpose). The implementation in huggingface transformers uses a [different layout](https://github.com/huggingface/transformers/blob/45025d92f815675e483f32812caa28cce3a960e7/src/transformers/models/llama/convert_llama_weights_to_hf.py#L123C76-L123C76) for `W_q` and `W_k` which is handled in [deploy.py](https://github.com/InternLM/lmdeploy/blob/ff4648a1d09e5aec74cf70efef35bfaeeac552e0/lmdeploy/serve/turbomind/deploy.py#L398).
# TurboMind Config
TurboMind is one of the inference engines of LMDeploy. When using it to do model inference, you need to convert the input model into a TurboMind model. In the TurboMind model folder, besides model weight files, the TurboMind model also includes some other files, among which the most important is the configuration file `triton_models/weights/config.ini` that is closely related to inference performance.
If you are using LMDeploy version 0.0.x, please refer to the [turbomind 1.0 config](#turbomind-10-config) section to learn the relevant content in the configuration. Otherwise, please read [turbomind 2.0 config](#turbomind-20-config) to familiarize yourself with the configuration details.
## TurboMind 2.0 config
Take the `llama-2-7b-chat` model as an example. In TurboMind 2.0, its config.ini content is as follows:
```toml
[llama]
model_name = llama2
tensor_para_size = 1
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
session_len = 4104
weight_type = fp16
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
group_size = 0
max_batch_size = 64
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.5
cache_block_seq_len = 128
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 2048
rope_scaling_factor = 0.0
use_logn_attn = 0
```
These parameters are composed of model attributes and inference parameters. Model attributes include the number of layers, the number of heads, dimensions, etc., and they are **not modifiable**.
```toml
model_name = llama2
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
```
Comparing to TurboMind 1.0, the model attribute part in the config remains the same with TurboMind 1.0, while the inference parameters have changed
In the following sections, we will focus on introducing the inference parameters.
### data type
`weight_type` and `group_size` are the relevant parameters, **which cannot be modified**.
`weight_type` represents the data type of weights. Currently, `fp16` and `int4` are supported. `int4` represents 4bit weights. When `weight_type` is `int4`, `group_size` means the group size used when quantizing weights with `awq`. In LMDeploy prebuilt package, kernels with `group size = 128` are included.
### batch size
The maximum batch size is still set through `max_batch_size`. But its default value has been changed from 32 to 64, and `max_batch_size` is no longer related to `cache_max_entry_count`.
### k/v cache size
k/v cache memory is determined by `cache_block_seq_len` and `cache_max_entry_count`.
TurboMind 2.0 has implemented Paged Attention, managing the k/v cache in blocks.
`cache_block_seq_len` represents the length of the token sequence in a k/v block with a default value 128. TurboMind calculates the memory size of the k/v block according to the following formula:
```
cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_data_type)
```
For the llama2-7b model, when storing k/v as the `half` type, the memory of a k/v block is: `128 * 32 * 32 * 128 * 2 * sizeof(half) = 64MB`
The meaning of `cache_max_entry_count` varies depending on its value:
- When it's a decimal between (0, 1), `cache_max_entry_count` represents the percentage of memory used by k/v blocks. For example, if turbomind launches on a A100-80G GPU with `cache_max_entry_count` being `0.5`, the total memory used by the k/v blocks is `80 * 0.5 = 40G`.
- When it's an integer > 0, it represents the total number of k/v blocks
The `cache_chunk_size` indicates the size of the k/v cache chunk to be allocated each time new k/v cache blocks are needed. Different values represent different meanings:
- When it is an integer > 0, `cache_chunk_size` number of k/v cache blocks are allocated.
- When the value is -1, `cache_max_entry_count` number of k/v cache blocks are allocated.
- When the value is 0, `sqrt(cache_max_entry_count)` number of k/v cache blocks are allocated.
### kv int8 switch
When initiating 8bit k/v inference, set `quant_policy = 4`. Please refer to [kv int8](./kv_int8.md) for a guide.
### long context switch
By setting `rope_scaling_factor = 1.0`, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.
Regarding the principle of Dynamic NTK, please refer to:
1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
2. https://kexue.fm/archives/9675
You can also turn on [LogN attention scaling](https://kexue.fm/archives/8823) by setting `use_logn_attn = 1`.
## TurboMind 1.0 config
Taking the `llama-2-7b-chat` model as an example, in TurboMind 1.0, its `config.ini` content is as follows:
```toml
[llama]
model_name = llama2
tensor_para_size = 1
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
session_len = 4104
weight_type = fp16
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
group_size = 0
max_batch_size = 32
max_context_token_num = 4
step_length = 1
cache_max_entry_count = 48
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 2048
use_dynamic_ntk = 0
use_logn_attn = 0
```
These parameters are composed of model attributes and inference parameters. Model attributes include the number of layers, the number of heads, dimensions, etc., and they are **not modifiable**.
```toml
model_name = llama2
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
```
In the following sections, we will focus on introducing the inference parameters.
### data type
`weight_type` and `group_size` are the relevant parameters, **which cannot be modified**.
`weight_type` represents the data type of weights. Currently, `fp16` and `int4` are supported. `int4` represents 4bit weights. When `weight_type` is `int4`, `group_size` means the group size used when quantizing weights with `awq`. In LMDeploy prebuilt package, kernels with `group size = 128` are included.
### batch size
`max_batch_size` determines the max size of a batch during inference. In general, the larger the batch size is, the higher the throughput is. But make sure that `max_batch_size <= cache_max_entry_count`
### k/v cache size
TurboMind allocates k/v cache memory based on `session_len`, `cache_chunk_size`, and `cache_max_entry_count`.
- `session_len` denotes the maximum length of a sequence, i.e., the size of the context window.
- `cache_chunk_size` indicates the size of k/v sequences to be allocated when new sequences are added.
- `cache_max_entry_count` signifies the maximum number of k/v sequences that can be cached.
### kv int8 switch
When initiating 8bit k/v inference, change `quant_policy = 4` and `use_context_fmha = 0`. Please refer to [kv int8](./kv_int8.md) for a guide.
### long context switch
By setting `use_dynamic_ntk = 1`, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.
Regarding the principle of Dynamic NTK, please refer to:
1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
2. https://kexue.fm/archives/9675
You can also turn on [LogN attention scaling](https://kexue.fm/archives/8823) by setting `use_logn_attn = 1`.
# W4A16 LLM Model Deployment
LMDeploy supports LLM model inference of 4-bit weight, with the minimum requirement for NVIDIA graphics cards being sm80, such as A10, A100, Geforce 30/40 series.
Before proceeding with the inference, please ensure that lmdeploy is installed.
```shell
pip install lmdeploy[all]
```
## 4-bit LLM model Inference
You can download the pre-quantized 4-bit weight models from LMDeploy's [model zoo](https://huggingface.co/lmdeploy) and conduct inference using the following command.
Alternatively, you can quantize 16-bit weights to 4-bit weights following the ["4-bit Weight Quantization"](#4-bit-weight-quantization) section, and then perform inference as per the below instructions.
Take the 4-bit Llama-2-chat-7B model from the model zoo as an example:
```shell
git-lfs install
git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
```
As demonstrated in the command below, first convert the model's layout using `turbomind.deploy`, and then you can interact with the AI assistant in the terminal
```shell
## Convert the model's layout and store it in the default path, ./workspace.
lmdeploy convert \
--model-name llama2 \
--model-path ./llama2-chat-7b-w4 \
--model-format awq \
--group-size 128
## inference
lmdeploy chat turbomind ./workspace
```
## Serve with gradio
If you wish to interact with the model via web ui, please initiate the gradio server as indicated below:
```shell
lmdeploy serve gradio ./workspace --server_name {ip_addr} --server_port {port}
```
Subsequently, you can open the website `http://{ip_addr}:{port}` in your browser and interact with the model
## Inference Performance
We benchmarked the Llama-2-7B-chat and Llama-2-13B-chat models with 4-bit quantization on NVIDIA GeForce RTX 4090 using [profile_generation.py](https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py). And we measure the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. All the results are measured for single batch inference.
| model | llm-awq | mlc-llm | turbomind |
| ---------------- | ------- | ------- | --------- |
| Llama-2-7B-chat | 112.9 | 159.4 | 206.4 |
| Llama-2-13B-chat | N/A | 90.7 | 115.8 |
Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively,
| model | 16bit(2048) | 4bit(2048) | 16bit(4096) | 4bit(4096) |
| ---------------- | ----------- | ---------- | ----------- | ---------- |
| Llama-2-7B-chat | 15.1 | 6.3 | 16.2 | 7.5 |
| Llama-2-13B-chat | OOM | 10.3 | OOM | 12.0 |
```
pip install nvidia-ml-py
```
```shell
python benchmark/profile_generation.py \
--model-path ./workspace \
--concurrency 1 8 --prompt-tokens 1 512 --completion-tokens 2048 512
```
## 4-bit Weight Quantization
It includes two steps:
- generate quantization parameter
- quantize model according to the parameter
### Step 1: Generate Quantization Parameter
```shell
lmdeploy lite calibrate \
--model $HF_MODEL \
--calib_dataset 'c4' \ # Calibration dataset, supports c4, ptb, wikitext2, pileval
--calib_samples 128 \ # Number of samples in the calibration set, if memory is insufficient, you can appropriately reduce this
--calib_seqlen 2048 \ # Length of a single piece of text, if memory is insufficient, you can appropriately reduce this
--work_dir $WORK_DIR \ # Folder storing Pytorch format quantization statistics parameters and post-quantization weight
```
### Step2: Quantize Weights
LMDeploy employs AWQ algorithm for model weight quantization.
```shell
lmdeploy lite auto_awq \
--model $HF_MODEL \
--w_bits 4 \ # Bit number for weight quantization
--w_group_size 128 \ # Group size for weight quantization statistics
--work_dir $WORK_DIR \ # Directory saving quantization parameters from Step 1
```
After the quantization is complete, the quantized model is saved to `$WORK_DIR`. Then you can proceed with model inference according to the instructions in the ["4-Bit Weight Model Inference"](#4-bit-llm-model-inference) section.
.header-logo {
background-image: url("../image/lmdeploy-logo.png");
background-size: 150px 60px;
background-image: url("../image/lmdeploy-logo.svg");
background-size: 257px 60px;
height: 60px;
width: 150px;
width: 257px;
}
@media screen and (min-width: 1100px) {
.header-logo {
top: -15px;
}
}
pre {
white-space: pre;
}
@media screen and (min-width: 2000px) {
.pytorch-content-left {
width: 1200px;
margin-left: 30px;
}
article.pytorch-article {
max-width: 1200px;
}
.pytorch-breadcrumbs-wrapper {
width: 1200px;
}
.pytorch-right-menu.scrolling-fixed {
position: fixed;
top: 45px;
left: 1580px;
}
}
article.pytorch-article section code {
padding: .2em .4em;
background-color: #f3f4f7;
border-radius: 5px;
}
/* Disable the change in tables */
article.pytorch-article section table code {
padding: unset;
background-color: unset;
border-radius: unset;
}
table.autosummary td {
width: 50%
}
img.align-center {
display: block;
margin-left: auto;
margin-right: auto;
}
article.pytorch-article p.rubric {
font-weight: bold;
}
resources/lmdeploy-logo.png
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment