同步0.2.6代码

d7117b95 · zhouxiang · 5f83e392 · d7117b95 · d7117b95 · d7117b95
Commit d7117b95 authored Mar 22, 2024 by zhouxiang
20 changed files
--- a/docs/en/benchmark/a100_fp16.md
+++ b/docs/en/benchmark/a100_fp16.md
-# Benchmark on A100 (FP16)
+# TurboMind Benchmark on A100

-All the following results are tested on (x8) A100-80G CUDA 11.8.
+All the following results are tested on A100-80G(x8) CUDA 11.8.

-The tested lmdeploy version is `v0.1.0a1`.
+The tested lmdeploy version is `v0.2.0`

-The commands provided below facilitate benchmarking both [static inference performance](#static-inference-benchmark) and [request throughput](#request-throughput-benchmark) on an A100-80G(x8) for models of various sizes.
+## Request Throughput Benchmark
+
+- `batch`: the max batch size during inference
+- `tp`: the number of GPU cards for tensor parallelism
+- `num_prompts`: the number of prompts, i.e. the number of requests
+- `PRS`: **R**equest **P**er **S**econd
+- `FTL`: **F**irst **T**oken **L**atency
+
+### FP16

-```shell
-bash benchmark/benchmark_7b.sh <the/path/of/llama2-7b/model>
-bash benchmark/benchmark_13b.sh <the/path/of/llama2-13b/model>
-bash benchmark/benchmark_20b.sh <the/path/of/internlm-20b/model>
-bash benchmark/benchmark_70b.sh <the/path/of/llama2-70b/model>
-```
+| model        | batch | tp  | num_promts | RPS    | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) | throughput(out tok/s) | throughput(total tok/s) |
+| ------------ | ----- | --- | ---------- | ------ | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ | --------------------- | ----------------------- |
+| llama2-7b    | 256   | 1   | 3000       | 14.556 | 0.526       | 0.092       | 4.652       | 0.066  | 0.101  | 0.155  | 0.220  | 3387.419              | 6981.159                |
+| llama2-13b   | 128   | 1   | 3000       | 7.950  | 0.352       | 0.075       | 4.193       | 0.051  | 0.067  | 0.138  | 0.202  | 1850.145              | 3812.978                |
+| internlm-20b | 128   | 2   | 3000       | 10.291 | 0.287       | 0.073       | 3.845       | 0.053  | 0.072  | 0.113  | 0.161  | 2053.266              | 4345.057                |
+| llama2-70b   | 256   | 4   | 3000       | 7.231  | 1.075       | 0.139       | 14.524      | 0.102  | 0.153  | 0.292  | 0.482  | 1682.738              | 3467.969                |

 ## Static Inference Benchmark

-FTL: **F**irst **T**oken **L**atency
+- `batch`: the max batch size during inference
+- `tp`: the number of GPU cards for tensor parallelism
+- `prompt_tokens`: the number of input tokens
+- `output_tokens`: the number of generated tokens
+- `throughput`: the number of generated tokens per second
+- `FTL`: **F**irst **T**oken **L**atency

-### llama2-7b
+### FP16

 | batch | tp  | prompt_tokens | output_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
 | ----- | --- | ------------- | ------------- | --------------------- | ------- | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ |
@@ -41,94 +54,3 @@ FTL: **F**irst **T**oken **L**atency
 | 64    | 1   | 128           | 2048          | 1852.06               | 77.96   | 0.535       | 0.027       | 1.231       | 0.03   | 0.041  | 0.048  | 0.053  |
 | 64    | 1   | 2048          | 128           | 493.46                | 78.4    | 6.59        | 0.142       | 16.235      | 0.046  | 0.049  | 0.055  | 0.767  |
 | 64    | 1   | 2048          | 2048          | 755.65                | 78.4    | 39.105      | 0.142       | 116.285     | 0.047  | 0.049  | 0.051  | 0.207  |
-
-### llama2-13b
-
-| batch | tp  | prompt_tokens | output_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
-| ----- | --- | ------------- | ------------- | --------------------- | ------- | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ |
-| 1     | 1   | 1             | 128           | 57.49                 | 74.84   | 0.018       | 0.018       | 0.019       | 0.017  | 0.017  | 0.017  | 0.017  |
-| 1     | 1   | 128           | 128           | 56.58                 | 74.84   | 0.04        | 0.039       | 0.04        | 0.017  | 0.017  | 0.017  | 0.018  |
-| 1     | 1   | 128           | 2048          | 55.29                 | 74.84   | 0.04        | 0.04        | 0.04        | 0.018  | 0.018  | 0.018  | 0.019  |
-| 1     | 1   | 2048          | 128           | 48.99                 | 75.09   | 0.242       | 0.242       | 0.243       | 0.019  | 0.019  | 0.019  | 0.019  |
-| 1     | 1   | 2048          | 2048          | 52.12                 | 75.09   | 0.243       | 0.24        | 0.244       | 0.019  | 0.019  | 0.019  | 0.02   |
-| 16    | 1   | 1             | 128           | 869.45                | 74.87   | 0.036       | 0.019       | 0.053       | 0.018  | 0.019  | 0.019  | 0.02   |
-| 16    | 1   | 128           | 128           | 757.3                 | 75.09   | 0.252       | 0.041       | 0.272       | 0.019  | 0.02   | 0.02   | 0.021  |
-| 16    | 1   | 128           | 2048          | 605.88                | 75.09   | 0.253       | 0.041       | 0.275       | 0.026  | 0.03   | 0.033  | 0.034  |
-| 16    | 1   | 2048          | 128           | 257.92                | 76.96   | 3.442       | 0.245       | 3.668       | 0.033  | 0.034  | 0.035  | 0.035  |
-| 16    | 1   | 2048          | 2048          | 366.67                | 76.99   | 3.122       | 0.249       | 3.671       | 0.04   | 0.044  | 0.047  | 0.047  |
-| 32    | 1   | 1             | 128           | 1667.5                | 74.9    | 0.034       | 0.021       | 0.057       | 0.019  | 0.02   | 0.021  | 0.023  |
-| 32    | 1   | 128           | 128           | 1301.27               | 75.37   | 0.461       | 0.04        | 0.497       | 0.021  | 0.022  | 0.023  | 0.025  |
-| 32    | 1   | 128           | 2048          | 860.14                | 75.84   | 0.833       | 0.041       | 1.151       | 0.034  | 0.042  | 0.047  | 0.048  |
-| 32    | 1   | 2048          | 128           | 291.54                | 77.02   | 5.315       | 0.245       | 13.483      | 0.046  | 0.047  | 0.049  | 0.51   |
-| 32    | 1   | 2048          | 2048          | 389.64                | 77.02   | 38.725      | 0.245       | 108.104     | 0.047  | 0.047  | 0.049  | 0.05   |
-| 64    | 1   | 1             | 128           | 3049.16               | 74.96   | 0.044       | 0.025       | 0.073       | 0.02   | 0.022  | 0.026  | 0.029  |
-| 64    | 1   | 128           | 128           | 2033.22               | 75.87   | 0.703       | 0.046       | 0.951       | 0.024  | 0.026  | 0.029  | 0.032  |
-| 64    | 1   | 128           | 2048          | 998.86                | 76.9    | 7.805       | 0.042       | 60.1        | 0.045  | 0.047  | 0.05   | 0.063  |
-| 64    | 1   | 2048          | 128           | 286.32                | 76.99   | 19.69       | 0.245       | 32.394      | 0.047  | 0.048  | 0.05   | 0.27   |
-| 64    | 1   | 2048          | 2048          | 387.86                | 77.09   | 190.453     | 0.245       | 307.331     | 0.047  | 0.048  | 0.049  | 0.05   |
-
-### internlm-20b
-
-| batch | tp  | prompt_tokens | output_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
-| ----- | --- | ------------- | ------------- | --------------------- | ------- | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ |
-| 1     | 2   | 1             | 128           | 61.14                 | 73.55   | 0.018       | 0.017       | 0.019       | 0.016  | 0.016  | 0.016  | 0.018  |
-| 1     | 2   | 128           | 128           | 60.03                 | 73.55   | 0.042       | 0.041       | 0.043       | 0.016  | 0.016  | 0.016  | 0.017  |
-| 1     | 2   | 128           | 2048          | 58.26                 | 73.55   | 0.042       | 0.042       | 0.043       | 0.017  | 0.017  | 0.018  | 0.018  |
-| 1     | 2   | 2048          | 128           | 51.93                 | 73.68   | 0.217       | 0.216       | 0.217       | 0.018  | 0.018  | 0.018  | 0.018  |
-| 1     | 2   | 2048          | 2048          | 56.36                 | 73.68   | 0.217       | 0.217       | 0.217       | 0.018  | 0.018  | 0.018  | 0.018  |
-| 16    | 2   | 1             | 128           | 903.01                | 73.65   | 0.034       | 0.018       | 0.051       | 0.017  | 0.018  | 0.019  | 0.02   |
-| 16    | 2   | 128           | 128           | 794.13                | 73.74   | 0.227       | 0.043       | 0.248       | 0.018  | 0.019  | 0.02   | 0.021  |
-| 16    | 2   | 128           | 2048          | 669.87                | 73.74   | 0.227       | 0.043       | 0.25        | 0.024  | 0.027  | 0.029  | 0.03   |
-| 16    | 2   | 2048          | 128           | 288.60                | 75.60   | 3.09        | 0.247       | 4.485       | 0.029  | 0.03   | 0.031  | 0.032  |
-| 16    | 2   | 2048          | 2048          | 441.46                | 75.61   | 3.172       | 0.219       | 4.442       | 0.035  | 0.037  | 0.04   | 0.041  |
-| 32    | 2   | 1             | 128           | 1673.64               | 73.71   | 0.037       | 0.02        | 0.066       | 0.019  | 0.02   | 0.021  | 0.023  |
-| 32    | 2   | 128           | 128           | 1347.57               | 73.90   | 0.351       | 0.043       | 0.436       | 0.02   | 0.021  | 0.023  | 0.025  |
-| 32    | 2   | 128           | 2048          | 1025.62               | 73.90   | 0.391       | 0.042       | 0.441       | 0.031  | 0.037  | 0.041  | 0.043  |
-| 32    | 2   | 2048          | 128           | 352.45                | 75.74   | 6.062       | 0.218       | 6.3         | 0.042  | 0.043  | 0.045  | 0.046  |
-| 32    | 2   | 2048          | 2048          | 514.60                | 75.77   | 10.36       | 0.222       | 70.328      | 0.049  | 0.05   | 0.051  | 0.053  |
-| 64    | 2   | 1             | 128           | 2954.34               | 73.82   | 0.05        | 0.029       | 0.074       | 0.021  | 0.023  | 0.026  | 0.03   |
-| 64    | 2   | 128           | 128           | 2122.92               | 74.24   | 0.591       | 0.047       | 0.808       | 0.024  | 0.026  | 0.029  | 0.032  |
-| 64    | 2   | 128           | 2048          | 1276.61               | 75.18   | 2.529       | 0.049       | 41.212      | 0.042  | 0.048  | 0.052  | 0.055  |
-| 64    | 2   | 2048          | 128           | 350.82                | 75.88   | 12.382      | 0.219       | 20.986      | 0.05   | 0.051  | 0.054  | 0.249  |
-| 64    | 2   | 2048          | 2048          | 512.37                | 76.26   | 111.149     | 0.221       | 211.531     | 0.05   | 0.051  | 0.052  | 0.055  |
-
-### llama2-70b
-
-| batch | tp  | prompt_tokens | output_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
-| ----- | --- | ------------- | ------------- | --------------------- | ------- | ----------- | ----------- | ----------- | ------ | ------ | ------ | ------ |
-| 1     | 4   | 1             | 128           | 33.94                 | 73.72   | 0.031       | 0.03        | 0.031       | 0.029  | 0.029  | 0.029  | 0.03   |
-| 1     | 4   | 128           | 128           | 33.63                 | 73.72   | 0.074       | 0.073       | 0.074       | 0.029  | 0.029  | 0.029  | 0.03   |
-| 1     | 4   | 128           | 2048          | 32.38                 | 73.72   | 0.074       | 0.074       | 0.075       | 0.031  | 0.031  | 0.031  | 0.031  |
-| 1     | 4   | 2048          | 128           | 28.32                 | 73.78   | 0.402       | 0.401       | 0.403       | 0.031  | 0.031  | 0.031  | 0.051  |
-| 1     | 4   | 2048          | 2048          | 31.9                  | 73.78   | 0.405       | 0.402       | 0.407       | 0.031  | 0.031  | 0.031  | 0.031  |
-| 16    | 4   | 1             | 128           | 468.52                | 73.72   | 0.071       | 0.034       | 0.939       | 0.03   | 0.031  | 0.032  | 0.251  |
-| 16    | 4   | 128           | 128           | 439.77                | 73.81   | 0.437       | 0.08        | 0.687       | 0.03   | 0.031  | 0.032  | 0.207  |
-| 16    | 4   | 128           | 2048          | 482.99                | 73.81   | 0.403       | 0.079       | 0.44        | 0.033  | 0.033  | 0.035  | 0.036  |
-| 16    | 4   | 2048          | 128           | 189.34                | 73.98   | 5.776       | 0.437       | 7.612       | 0.035  | 0.036  | 0.036  | 0.037  |
-| 16    | 4   | 2048          | 2048          | 399.42                | 73.98   | 5.773       | 0.411       | 6.844       | 0.036  | 0.037  | 0.038  | 0.041  |
-| 32    | 4   | 1             | 128           | 906.03                | 73.75   | 0.098       | 0.043       | 0.253       | 0.032  | 0.033  | 0.035  | 0.178  |
-| 32    | 4   | 128           | 128           | 746.36                | 73.91   | 0.749       | 0.078       | 1.026       | 0.032  | 0.033  | 0.035  | 0.438  |
-| 32    | 4   | 128           | 2048          | 853.56                | 73.91   | 0.732       | 0.076       | 1.129       | 0.036  | 0.038  | 0.041  | 0.158  |
-| 32    | 4   | 2048          | 128           | 232.6                 | 73.99   | 11.834      | 0.408       | 13.321      | 0.04   | 0.041  | 0.043  | 0.248  |
-| 32    | 4   | 2048          | 2048          | 636.23                | 73.99   | 11.711      | 0.409       | 12.689      | 0.043  | 0.045  | 0.048  | 0.179  |
-| 64    | 4   | 1             | 128           | 1425.79               | 73.81   | 0.213       | 0.046       | 1.264       | 0.037  | 0.039  | 0.044  | 0.329  |
-| 64    | 4   | 128           | 128           | 1159.84               | 73.96   | 1.292       | 0.107       | 2.676       | 0.037  | 0.04   | 0.045  | 0.378  |
-| 64    | 4   | 128           | 2048          | 1391.8                | 73.95   | 1.173       | 0.135       | 1.623       | 0.043  | 0.047  | 0.052  | 0.251  |
-| 64    | 4   | 2048          | 128           | 270.47                | 74.02   | 17.402      | 0.452       | 24.164      | 0.05   | 0.052  | 0.057  | 0.345  |
-| 64    | 4   | 2048          | 2048          | 930.46                | 74.01   | 21.29       | 0.423       | 24.498      | 0.055  | 0.059  | 0.065  | 0.299  |
-
-## Request Throughput Benchmark
-
-FTL: **F**irst **T**oken **L**atency
-
-| model        | batch | tp  | num_prompts | PRS    | PRM     | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | throughput(out tok/s) | throughput(total tok/s) |
-| ------------ | ----- | --- | ----------- | ------ | ------- | ----------- | ----------- | ----------- | --------------------- | ----------------------- |
-| llama2-7b    | 64    | 1   | 3000        | 10.275 | 616.477 | 0.092       | 0.036       | 1.145       | 2562.435              | 5283.547                |
-|              | 128   | 1   | 3000        | 12.611 | 756.677 | 0.205       | 0.056       | 2.241       | 3210.281              | 6619.357                |
-| llama2-13b   | 64    | 1   | 3000        | 6.337  | 380.244 | 0.159       | 0.051       | 2.048       | 1474.786              | 3039.398                |
-|              | 128   | 1   | 3000        | 7.588  | 455.273 | 0.412       | 0.085       | 4.445       | 1765.788              | 3639.128                |
-| internlm-20b | 64    | 2   | 3000        | 7.842  | 470.516 | 0.166       | 0.059       | 2.461       | 1564.696              | 3311.16                 |
-|              | 128   | 2   | 3000        | 9.776  | 586.568 | 0.34        | 0.079       | 5.808       | 1950.627              | 4127.855                |
-| llama2-70b   | 64    | 4   | 3000        | 4.285  | 257.08  | 0.301       | 0.083       | 4.689       | 1000.376              | 2062.7                  |
-|              | 128   | 4   | 3000        | 5.833  | 349.996 | 0.633       | 0.107       | 8.431       | 1361.939              | 2808.216                |
-|              | 256   | 4   | 3000        | 6.568  | 394.108 | 1.49        | 0.171       | 19.52       | 1533.592              | 3162.15                 |
--- a/docs/en/benchmark/profile_api_server.md
+++ b/docs/en/benchmark/profile_api_server.md
-# API Server Performance Test Method
+# Profile API Server

-The way to profiling api_server performance is similar to the method for [profiling throughput](./profile_throughput.md). The difference is api_server should be launched successfully before testing.
+The way to profiling `api_server` performance is similar to the method for [profiling throughput](./profile_throughput.md). The difference is `api_server` should be launched successfully before testing.

-The evaluation script is `profile_restful_api.py`. Before running it, please install the lmdeploy precompiled package, download the evaluation script and the test dataset:
+The profiling script is `profile_restful_api.py`. Before running it, please install the lmdeploy precompiled package, download the script and the test dataset:

 ```shell
-pip install 'lmdeploy[serve]>=0.1.0a1'
+pip install lmdeploy
 git clone --depth=1 https://github.com/InternLM/lmdeploy
 cd lmdeploy/benchmark
 wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 ```

-During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
-The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
-
-In the following sections, we assume the model is in turbomind format.
-
 ## Metrics

 LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)
@@ -36,72 +31,22 @@ $$

 Total time includes prefill time.

-## Example
-
-We take `internlm-7b` as an example. The entire benchmark procedure is:
-
-```shell
-pip install 'lmdeploy[serve]>=0.1.0a1'
-git clone --depth=1 https://github.com/InternLM/lmdeploy
-cd lmdeploy/benchmark
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+## Profile

-# get internlm-7b from huggingface and convert it to turbomind format
-lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
+In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) as an example to show the benchmark procedure.

-# launch server
-lmdeploy serve api_server ./internlm-7b --server-port 23333
+### Launch api_server

-# open another terminal and run the following command in the directory `lmdeploy/benchmark`
-python3 ./profile_restful_api.py http://0.0.0.0:23333 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
+```shell
+lmdeploy serve api_server internlm/internlm-7b
 ```

-## Methods
+If you would like to change the server's port or other parameters, such as inference engine, max batch size and etc., please run `lmdeploy serve api_server -h` or read [this](../serving/api_server.md) guide to get the detailed explanation.

-Please refer to [this](../restful_api.md) guide to start `api_server`.
-The argument `--instance-num` reflects the inference instance number. When more than `--instance-num` requests arrive at the `api_server` at the same time, the exceeding part of the requests will wait in the inference queue.
+### Profile

 ```shell
-python3 profile_restful_api.py <server_addr> <tokenizer_path> <dataset> <optional arguments>
+python3 profile_restful_api.py http://0.0.0.0:23333 internlm/internlm-7b ./ShareGPT_V3_unfiltered_cleaned_split.json
 ```

-The required parameters are:
-
- `server_addr`
-
-  The address of api_server with format `http://{server_ip}:{server_port}`
-
- `tokenizer_path`
-
-  The path of the tokenizer model, which is used to encode the dataset to get the token size of prompts and responses
-
- `dataset`
-
-  The path of the downloaded dataset
-
-Optional arguments are listed as below:
-
- `--concurrency`
-
-  It represents the number of request threads with default value 64. Requests of concurrent threads will be batched by the inference engine. Its value should not exceed the number of inference instances in the api_server.
-  Otherwise, the excess requests will wait in the inference queue.
-
- `--num-prompts`
-
-  The number of sampled prompts from dataset to process. The default is 2000.
-
- `--top_p` and `--temperature`
-
-  They are used to sample the generated token_id.
-
- `--stream_output`
-
-  Indicator for streaming output. The default is `False`.
-
- `--csv`
-
-  The path of a csv file to save the result with default value `../profile_api_server.csv`
-
- `--seed`
-
-  It is the seed used in sampling prompts from dataset with default value 0.
+For detailed argument specification of `profile_restful_api.py`, such as request concurrency, sampling parameters an so on, please run the help command `python3 profile_restful_api.py -h`.
--- a/docs/en/benchmark/profile_generation.md
+++ b/docs/en/benchmark/profile_generation.md
-# Static Inference Performance Test Method
+# Profile Token Latency and Throughput

-We view the performance of the inference engine under the fixed batch and fixed input/output token as static inference performance.
+We profile the latency and throughput of generated tokens with fixed batch size and fixed input/output token.

-The evaluation script is `profile_generation.py`. Before running it, please install the lmdeploy precompiled package and download the evaluation script:
+The profiling script is `profile_generation.py`. Before running it, please install the lmdeploy precompiled package and download the profiling script:

 ```shell
-pip install 'lmdeploy>=0.1.0a1'
+pip install lmdeploy
 git clone --depth=1 https://github.com/InternLM/lmdeploy
 ```

-During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
-The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
-
-In the following sections, we assume the model is in turbomind format.
-
 ## Metrics

 LMDeploy records test results like first token latency, token throughput (tokens/s), percentile data of each token's latency (P50, P75, P95, P99), GPU mem, etc.
@@ -30,59 +25,22 @@ Total time includes prefill time.

 During the test process, all graphics cards on the node should not run any other programs, otherwise the statistics of GPU mem would be inaccurate.

-## Example
+## Profile
+
+In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) as an example to show how to profile the inference engines of LMDeploy.

-We take `internlm-7b` as an example. The entire benchmark procedure is:
+### Profile turbomind engine

 ```shell
-pip install 'lmdeploy>=0.1.0a1'
-git clone --depth=1 https://github.com/InternLM/lmdeploy
 cd lmdeploy/benchmark
-
-# get internlm-7b from huggingface and convert it to turbomind format
-lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
-
-# benchmark
-python3 profile_generation.py ./internlm-7b
+python3 profile_generation.py internlm/internlm-7b
 ```

-## Command details
+### Profile pytorch engine

 ```shell
-python3 profile_generation.py <model_path> <optional arguments>
+cd lmdeploy/benchmark
+python3 profile_generation.py internlm/internlm-7b --backend pytorch
 ```

-`model_path` refers to the path on localhost where the model in turbomind format is located.
-
-Optional arguments are listed as below:
-
- `--concurrency`
-
-  It represents the number of request threads. Requests of concurrent threads will be batched by the inference engine. It is a list with default value `[1, 16, 32, 64]`, which implies that the performance under 4 different levels of concurrency is tested. The level of concurrency should not exceed `max_batch_size` in [turbomind config](../turbomind_config.md#turbomind-20-config). Otherwise, there will be `max_batch_size - concurrency` number of threads randomly waiting almost at any time during test.
-
- `--prompt-tokens` and `--completion-tokens`
-
-  Input token and output token numbers. They are lists of the same length. The elements in the list correspond one-to-one, that is,
-  the pair `(prompt_tokens[i], completion_tokens[i])` is a test case. In the default list `[1, 128, 128, 2048, 2048]` and `[128, 128, 2048, 128, 2048]`, the test cases are `(1, 128)`, `(128, 128)`, `(128, 2048)`, `(2048, 128)` and `(2048, 2048)`
-
- `--tp`
-
-  The number of GPUs used when the inference is in tensor parallel mode. It must be a power of 2. The default is 1.
-
- `--top_k`, `--top_p` and `--temperature`
-
-  They are used to sample the generated token_id.
-
- `--csv`
-
-  A csv file path used to store test results. The default is `./profile_generation.csv`
-
- `--log-level`
-
-  The log level. The default is 'ERROR'.
-
- `--test-round`
-
-  The number of test rounds is set to 10 by default. This means that each case will undergo 10 rounds of testing, and the average result will be calculated.
-
-We refer to a tuple of `(#concurrency, #prompt_token, #completion_token)` as a test case. Therefore, the total number of test cases (`#test_cases`) executed by the script is `len(concurrency) * len(prompt-tokens)`, and the total test rounds  are `#test_cases * #test_round`. Users can flexibly adjust test parameters according to their actual situation.
+For detailed argument specification of `profile_generation.py`, such as batch size, input and output token number an so on, please run the help command `python3 profile_generation.py -h`.
--- a/docs/en/benchmark/profile_throughput.md
+++ b/docs/en/benchmark/profile_throughput.md
-# Request Throughput Test Method
+# Profile Request Throughput

 In the applications, the length of the user's input prompt and the size of generated tokens are dynamic. The static inference performance is insufficient to reflect the inference engine's ability to handle the dynamic characteristics.

 Therefore, it is necessary to use real dialogue data to evaluate the dynamic inference capabilities of the inference engine. This article will introduce how to test the dynamic inference performance of LMDeploy on localhost.

-The evaluation script is `profile_throughput.py`. Before running it, please install the lmdeploy precompiled package, download the evaluation script and the test dataset:
+The profiling script is `profile_throughput.py`. Before running it, please install the lmdeploy precompiled package, download the profiling script and the test dataset:

 ```shell
-pip install 'lmdeploy>=0.1.0a1'
+pip install lmdeploy
 git clone --depth=1 https://github.com/InternLM/lmdeploy
 cd lmdeploy/benchmark
 wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 ```

-During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
-The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
-
-In the following sections, we assume the model is in turbomind format.
-
 ## Metrics

 LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)
@@ -38,68 +33,20 @@ $$

 Total time includes prefill time.

-## Example
+## Profile

-We take `internlm-7b` as an example. The entire benchmark procedure is:
-
-```shell
-pip install 'lmdeploy>=0.1.0a1'
-git clone --depth=1 https://github.com/InternLM/lmdeploy
-cd lmdeploy/benchmark
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) as an example to show how to profile the inference engines of LMDeploy.

-# get internlm-7b from huggingface and convert it to turbomind format
-lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
+### Profile turbomind engine

-python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json ./internlm-7b
+```shell
+python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json internlm/internlm-7b
 ```

-## Command details
+### Profile pytorch engine

 ```shell
-python3 profile_throughput.py <dataset> <model_path> <optional arguments>
+python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json internlm/internlm-7b  --backend pytorch
 ```

-The required parameters are:
-
- `dataset`
-
-  The path of the downloaded dataset
-
- `model_path`
-
-  The path on localhost where the model in turbomind format is located.
-
-Optional arguments are listed as below:
-
- `--concurrency`
-
-  It represents the number of request threads with default value 64. Requests of concurrent threads will be batched by the inference engine. Its value should not exceed `max_batch_size` in `config.ini`. Otherwise, the excess requests will wait in the inference queue.
-
- `--num-prompts`
-
-  The number of sampled prompts from dataset to process. The default is 2000.
-
- `--tp`
-
-  The number of GPUs used when the inference is in tensor parallel mode. It must be a power of 2. The default is 1.
-
- `--top_k`、`--top_p` and `--temperature`
-
-  They are used to sample the generated token_id.
-
- `--stream_output`
-
-  Indicator for streaming output. The default is `True`.
-
- `--csv`
-
-  The path of a csv file to save the result with default value `./profile_throughput.csv`
-
- `--log-level`
-
-  The log level. The default is `ERROR`.
-
- `--seed`
-
-  It is the seed used in sampling prompts from dataset with default value 0.
+For detailed argument specification of `profile_throughput.py`, such as request concurrency, sampling parameters, k/v cache memory percentage an so on, please run the help command `python3 profile_throughput.py -h`.
--- a/docs/en/benchmark/profile_triton_server.md
+++ b/docs/en/benchmark/profile_triton_server.md
-# Triton Inference Server Performance Test Method
+# Profile Triton Inference Server

-Triton Inference Server (TIS) is another serving method supported by LMDeploy besides from api_server. Its performance testing methods and metrics are similar to those of [api_server](./profile_api_server.md).
+Triton Inference Server (TIS) is another serving method supported by LMDeploy besides `api_server`. Its performance testing methods and metrics are similar to those of [api_server](./profile_api_server.md).

-The evaluation script is `profile_serving.py`. Before running it, please install the lmdeploy precompiled package, download the evaluation script and the test dataset:
+The profiling script is `profile_serving.py`. Before running it, please install the lmdeploy precompiled package, download the profiling script and the test dataset:

 ```shell
-pip install 'lmdeploy[serve]>=0.1.0a1'
+pip install 'lmdeploy[serve]'
 git clone --depth=1 https://github.com/InternLM/lmdeploy
 cd lmdeploy/benchmark
 wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 ```

-During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
-The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
-
-In the following sections, we assume the model is in turbomind format.
-
 ## Metrics

 LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)
@@ -36,72 +31,28 @@ $$

 Total time includes prefill time.

-## Example
+## Profile

-We take `internlm-7b` as an example. The entire benchmark procedure is:
-
-```shell
-pip install 'lmdeploy[serve]>=0.1.0a1'
-git clone --depth=1 https://github.com/InternLM/lmdeploy
-cd lmdeploy/benchmark
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) as an example to show the benchmark procedure.

-# get internlm-7b from huggingface and convert it to turbomind format
-lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
+### Launch triton inference server

-# launch server
-bash ./internlm-7b/service_docker_up.sh
+Before launching the server, the LLM model must be converted to the turbomind format in advance.

-# open another terminal and run the following command in the directory `lmdeploy/benchmark`
-python3 ./profile_serving 0.0.0.0:33337 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
+```shell
+lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
 ```

-## Command details
+Then, the triton inference server can be launched by:

 ```shell
-python3 profile_serving.py <server_addr> <tokenizer_path> <dataset> <optional arguments>
+bash ./internlm-7b/service_docker_up.sh
 ```

-The required parameters are:
-
- `server_addr`
-
-  The address of api_server with format `{server_ip}:{server_port}`
-
- `tokenizer_path`
-
-  The path of the tokenizer model, which is used to encode the dataset to get the token size of prompts and responses
+### Profile

- `dataset`
-
-  The path of the downloaded dataset
-
-Optional arguments are listed as below:
-
- `--concurrency`
-
-  It represents the number of request threads with default value 32. Requests of concurrent threads will be batched by the inference engine.
-  It is recommended that `concurrency` does not exceed the `max_batch_size` in `config.ini`, nor should it exceed the number of inference instances in `triton_models`.
-  Otherwise, the excess requests will wait in the inference queue.
-
-  The configuration item for the number of inference instances is `instance_group`, which is located in the file `{model_path}/triton_models/interactive/config.pbtxt`, and the default is 48.
-
- `--num-prompts`
-
-  The number of sampled prompts from dataset to process. The default is 1000. It is suggested 2000 when `concurrency >= 64`
-
- `--top_k`、`--top_p` and `--temperature`
-
-  They are used to sample the generated token_id.
-
- `--stream_output`
-
-  Indicator for streaming output. The default is `True`.
-
- `--csv`
-
-  The path of a csv file to save the result with default value `../profile_tis.csv`
-
- `--seed`
+```shell
+python3 profile_serving.py 0.0.0.0:33337 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
+```

-  It is the seed used in sampling prompts from dataset with default value 0.
+For detailed argument specification of `profile_serving.py`, such as request concurrency, sampling parameters an so on, please run the help command `python3 profile_serving.py -h`.
--- a/docs/en/build.md
+++ b/docs/en/build.md
@@ -17,7 +17,8 @@ The docker image is `openmmlab/lmdeploy-builder:cuda11.8`. Make sure that docker
 In the root directory of the lmdeploy source code, please run the following command:

 ```shell
-cd lmdeploy # the home folder of lmdeploy source code
+# the home folder of lmdeploy source code
+cd lmdeploy
 bash builder/manywheel/build_all_wheel.sh
 ```

@@ -67,8 +68,10 @@ Then, follow the steps below to set up the compilation environment:
  ```
 - build and install lmdeploy libraries:
  ```shell
-  apt install ninja-build # install ninja
-  cd lmdeploy # the home folder of lmdeploy
+  # install ninja
+  apt install ninja-build
+  # the home folder of lmdeploy
+  cd lmdeploy
  mkdir build && cd build
  sh ../generate.sh
  ninja -j$(nproc) && ninja install

--- a/docs/en/conf.py
+++ b/docs/en/conf.py
@@ -52,6 +52,7 @@ extensions = [
    'sphinx.ext.napoleon',
    'sphinx.ext.viewcode',
    'sphinx.ext.autosectionlabel',
+    'sphinx_tabs.tabs',
    'sphinx_markdown_tables',
    'myst_parser',
    'sphinx_copybutton',

--- a/docs/en/faq.md
+++ b/docs/en/faq.md
@@ -41,10 +41,44 @@ export LD_LIBRARY_PATH={Location}/nvidia/nccl/lib:$LD_LIBRARY_PATH

 It's probably due to a low-version cuda toolkit. LMDeploy runtime requires a minimum CUDA version of 11.2

-## Turbomind Inference
+## Inference

-## Pytorch Inference
+### RuntimeError: \[TM\]\[ERROR\] CUDA runtime error: out of memory /workspace/lmdeploy/src/turbomind/utils/allocator.h
+
+This is usually due to a disproportionately large memory ratio for the k/v cache, which is dictated by `TurbomindEngineConfig.cache_max_entry_count`.
+The implications of this parameter have slight variations in different versions of lmdeploy. For specifics, please refer to the source code for the \[detailed notes\] (https://github.com/InternLM/lmdeploy/blob/52419bd5b6fb419a5e3aaf3c3b4dea874b17e094/lmdeploy/messages.py#L107)
+
+If you encounter this issue while using the pipeline interface, please reduce the `cache_max_entry_count` in `TurbomindEngineConfig` like following:
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+
+backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
+
+pipe = pipeline('internlm/internlm2-chat-7b',
+                backend_config=backend_config)
+response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
+print(response)
+```
+
+If OOM occurs when you run CLI tools, please pass `--cache-max-entry-count` to decrease k/v cache memory ratio. For example:
+
+```shell
+# chat command
+lmdeploy chat turbomind internlm/internlm2-chat-7b --cache-max-entry-count 0.2
+
+# server command
+lmdeploy serve api_server internlm/internlm2-chat-7b --cache-max-entry-count 0.2
+```

 ## Serve

 ## Quantization
+
+### RuntimeError: \[enforce fail at inline_container.cc:337\] . unexpected pos 4566829760 vs 4566829656
+
+Please check your disk space. This error is due to insufficient disk space when saving weights, which might be encountered when quantizing the 70B model
+
+### ModuleNotFoundError: No module named 'flash_attn'
+
+Quantizing `qwen` requires the installation of `flash-attn`. But based on feedback from community users, `flash-attn` can be challenging to install. Therefore, we have removed it from lmdeploy dependencies and now recommend that users install it it manually as needed.
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
-Welcome to LMDeploy's documentation!
+Welcome to LMDeploy's tutorials!
 ====================================

-You can switch between Chinese and English documents in the lower-left corner of the layout.
-
+.. _get_started:
 .. toctree::
   :maxdepth: 2
+   :caption: Get Started
+
+   get_started.md
+
+.. _build:
+.. toctree::
+   :maxdepth: 1
   :caption: Build

   build.md

+.. _benchmark:
 .. toctree::
-   :maxdepth: 2
-   :caption: Chatting with PyTorch
+   :maxdepth: 1
+   :caption: Benchmark

-   pytorch.md
+   benchmark/profile_generation.md
+   benchmark/profile_throughput.md
+   benchmark/profile_api_server.md
+   benchmark/profile_triton_server.md
+   benchmark/evaluate_with_opencompass.md

+.. _supported_models:
 .. toctree::
-   :maxdepth: 2
-   :caption: Quantization
+   :maxdepth: 1
+   :caption: Supported Models

-   quantization.md
+   supported_models/supported_models.md

+.. _inference:
 .. toctree::
-   :maxdepth: 2
-   :caption: Serving
+   :maxdepth: 1
+   :caption: Inference

-   serving.md
+   inference/pipeline.md
+   inference/vl_pipeline.md

+
+.. _serving:
 .. toctree::
-   :maxdepth: 2
-   :caption: TurboMind
+   :maxdepth: 1
+   :caption: serving
+
+   serving/api_server.md
+   serving/api_server_vl.md
+   serving/gradio.md
+   serving/proxy_server.md
+
+.. _quantization:
+.. toctree::
+   :maxdepth: 1
+   :caption: Quantization

-   turbomind.md
+   quantization/w4a16.md
+   quantization/kv_int8.md
+   quantization/w8a8.md

 .. toctree::
-   :caption: Switch Language
+   :maxdepth: 1
+   :caption: Advanced Guide

-   switch_language.md
+   inference/turbomind.md
+   inference/pytorch.md
+   advance/pytorch_new_model.md
+   advance/long_context.md
+   advance/chat_template.md
+   advance/debug_turbomind.md
+   serving/qos.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: API Reference

+   api/pipeline.rst

 Indices and tables
 ==================

--- a/docs/en/kv_int8.md
+++ b/docs/en/kv_int8.md
-# KV Cache Quantization and Test Results
-
-For the LLaMa-7B fp16 model with a maximum length of 2048, the server requires approximately 1030MB of GPU memory to store kv_cache for each concurrent session created. This means that even an A100 80G can only serve a limited number of users.
-
-To reduce runtime GPU memory usage, we have implemented PTQ quantization for kv cache, using the following formula:
-
-```bash
-zp = (min+max) / 2
-scale = (max-min) / 255
-quant: q = round( (f-zp) / scale)
-dequant: f = q * scale + zp
-```
-
-## How to Enable KV Cache INT8
-
-### **Step One**
-
-Convert the Hugging Face model format to the TurboMind inference format to create a workspace directory.
-
-```bash
-lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
-```
-
-If you already have a workspace directory, skip this step.
-
-### **Step Two**
-
-Get the quantization parameters by these two steps:
-
-```bash
-# get minmax
-lmdeploy lite calibrate \
-  --model $HF_MODEL \
-  --calib_dataset 'c4' \             # Support c4, ptb, wikitext2, pileval
-  --calib_samples 128 \              # Number of samples in the calibration set, if the memory is not enough, it can be adjusted appropriately
-  --calib_seqlen 2048 \              # Length of a single text, if the memory is not enough, you can adjust it appropriately
-  --work_dir $WORK_DIR \             # Directory for saving quantized statistical parameters and quantized weights in Pytorch format
-
-# get quant parameters
-lmdeploy lite kv_qparams \
-  --work_dir $WORK_DIR  \                             # Directory of the last output
-  --turbomind_dir workspace/triton_models/weights/ \ # Directory to save the quantization parameters
-  --kv_sym False \                                    # Symmetric or asymmetric quantization, default is False
-  --num_tp 1  \                                       # Number of GPUs used for Tensor parallelization, keep it consistent with deploy.py
-```
-
-`kv_qparams` will generate fp32 scaling factors in the `weights` directory. The file format is a binary produced by `numpy.tofile`.
-
-You can also first set `turbomind_dir` to a private directory, then copy the scaling factors into `workspace/triton_models/weights/`.
-
-### **Step Three**
-
-Modify `workspace/triton_models/weights/config.ini`:
-
- Set quant_policy to 4. This means enabling kv_cache int8
-
-### **Step Four**
-
-Test the chat performance.
-
-```bash
-lmdeploy chat turbomind ./workspace
-```
-
-## GPU Memory Test
-
-The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) model.
-Testing method:
-
-1. Use `deploy.py` to convert the model, modify the maximum concurrency in the `workspace` configuration; adjust the number of requests in `llama_config.ini`.
-2. Compile and run `bin/llama_triton_example` to obtain the GPU memory situation of the fp16 version under different batch_size.
-3. Enable quantization, re-run `bin/llama_triton_example` to obtain the GPU memory situation of the int8 version under different batch_size.
-
-Below shows the comparison of GPU memory between the two versions:
-
-| batch_size | fp16 memory(MiB) | int8 memory(MiB) | diff(MiB) |
-| :--------: | :--------------: | :--------------: | :-------: |
-|     8      |      22337       |      18241       |   -4096   |
-|     16     |      30593       |      22369       |   -8224   |
-|     32     |      47073       |      30625       |  -16448   |
-|     48     |      63553       |      38881       |  -24672   |
-
-Compared to directly quantizing Weight (such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/)), we have done a comparative estimation of memory growth in the 7B model for both methods, with some data from [llama.cpp](https://github.com/ggerganov/llama.cpp).
-
-![](../../resources/batch_memory.png)
-
-As can be seen, the fp16 version requires 1030MB of GPU memory for each concurrency, so quantizing kv_cache can significantly reduce the rate of increase of runtime memory.
-
-## Accuracy Test
-
-The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) command model.
-
-Below is the result of PTQ quantization of `kCacheKVInt8` method with only 128 randomly selected data from the c4 dataset. The accuracy was tested using [opencompass](https://github.com/InternLM/opencompass) before and after quantization.
-
-|     task      |     dataset     |    metric     | int8  | fp16  | diff  |
-| :-----------: | :-------------: | :-----------: | :---: | :---: | :---: |
-|   Language    |   winogrande    |   accuracy    | 60.77 | 61.48 | -0.71 |
-|   Knowledge   |       nq        |     score     | 2.69  | 2.60  | +0.09 |
-|   Reasoning   |      gsm8k      |   accuracy    | 33.28 | 34.72 | -1.44 |
-|   Reasoning   |       bbh       | naive_average | 20.12 | 20.51 | -0.39 |
-| Understanding | openbookqa_fact |   accuracy    | 82.40 | 82.20 | +0.20 |
-| Understanding |   eprstmt-dev   |   accuracy    | 90.62 | 88.75 | +1.87 |
-|    Safety     |   crows_pairs   |   accuracy    | 32.56 | 31.43 | +1.13 |
-
-Note that both `kCacheKVInt8` and `WeightInt4` methods can be enabled at the same time.
--- a/docs/en/load_hf.md
+++ b/docs/en/load_hf.md
-# Load huggingface model directly
-
-Starting from v0.1.0, Turbomind adds the ability to pre-process the model parameters on-the-fly while loading them from huggingface style models.
-
-## Supported model type
-
-Currently, Turbomind support loading three types of model:
-
-1. A lmdeploy-quantized model hosted on huggingface.co, such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
-2. Other LM models on huggingface.co like Qwen/Qwen-7B-Chat
-3. A model converted by `lmdeploy convert`, legacy format
-
-## Usage
-
-### 1) A lmdeploy-quantized model
-
-For models quantized by `lmdeploy.lite` such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
-
-```
-repo_id=internlm/internlm-chat-20b-4bit
-model_name=internlm-chat-20b
-# or
-# repo_id=/path/to/downloaded_model
-
-# Inference by TurboMind
-lmdeploy chat turbomind $repo_id --model-name $model_name
-
-# Serving with gradio
-lmdeploy serve gradio $repo_id --model-name $model_name
-
-# Serving with Restful API
-lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
-```
-
-### 2) Other LM models
-
-For other LM models such as Qwen/Qwen-7B-Chat or baichuan-inc/Baichuan2-7B-Chat. LMDeploy supported models can be viewed through `lmdeploy list`.
-
-```
-repo_id=Qwen/Qwen-7B-Chat
-model_name=qwen-7b
-# or
-# repo_id=/path/to/Qwen-7B-Chat/local_path
-
-# Inference by TurboMind
-lmdeploy chat turbomind $repo_id --model-name $model_name
-
-# Serving with gradio
-lmdeploy serve gradio $repo_id --model-name $model_name
-
-# Serving with Restful API
-lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
-```
-
-### 3) A model converted by `lmdeploy convert`
-
-The usage is like previous
-
-```
-# Convert a model
-lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME
-
-# Inference by TurboMind
-lmdeploy chat turbomind ./workspace
-
-# Serving with gradio
-lmdeploy serve gradio ./workspace
-
-# Serving with Restful API
-lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
-```
--- a/docs/en/pytorch.md
+++ b/docs/en/pytorch.md
-# Pytorch
-
-## Chat in command line
-
-LMDeploy support chatting with PyTorch models with submodule `lmdeploy.pytorch.chat`.
-
-This submodule allow user to chat with language model through command line, and optionally accelerate model using backends like deepspeed.
-
-**Example 1**: Chat with default setting
-
-```shell
-lmdeploy chat torch $PATH_TO_HF_MODEL
-```
-
-**Example 2**: Disable sampling and chat history
-
-```shell
-lmdeploy chat torch \
-    $PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
-    --temperature 0 --max-history 0
-```
-
-**Example 3**: Accelerate with deepspeed inference
-
-```shell
-lmdeploy chat torch \
-    $PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
-    --accel deepspeed
-```
-
-Note: to use deepspeed, you need to install deepspeed, and if hope to accelerate InternLM, you need a customized version <https://github.com/wangruohui/DeepSpeed/tree/support_internlm_0.10.0>
-
-**Example 4**: Tensor parallel the model on 2 GPUs
-
-```shell
-deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
-    $PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
-    --accel deepspeed \
-```
-
-This module also allow the following control commands to change generation behaviors during chat.
-
- `exit`: terminate and exit chat
- `config set key=value`: change generation config `key` to `value`, e.g. config temperature=0 disable sampling for following chats
- `clear`: clear chat history
-
-### Simple diagram of components
-
-```mermaid
-graph LR;
-    subgraph model specific adapter
-        p((user_input))-->tokenize-->id((input_ids))-->decorate
-        tmpl_ids((template_ids))-->decorate;
-    end
-    subgraph generate
-        model[CausalLM_model.generate]-->gen_result(("gen_result"))
-        gen_result-->hid
-        gen_result-->attn((attention))
-    end
-    subgraph streamer
-        model-->s[streamer]--value-->decode_single--token-->output
-    end
-    subgraph session_manager
-        prepend_history-->fullid((complete_ids));
-        trim-->prepend_history
-    end
-    decorate-->prepend_history
-    hid((history_ids))-->trim;
-    attn-->trim;
-    fullid-->model
-    tokenizer((tokenizer))-->decode_single
-    tokenizer-->tokenize
-    p-->genconfig(GenConfig)-->model
-```
--- a/docs/en/restful_api.md
+++ b/docs/en/restful_api.md
-# Restful API
-
-### Launch Service
-
-The user can open the http url print by the following command in a browser.
-
- **Please check the http url for the detailed api usage!!!**
- **Please check the http url for the detailed api usage!!!**
- **Please check the http url for the detailed api usage!!!**
-
-```shell
-lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 64 --tp 1
-```
-
-We provide some RESTful APIs. Three of them are in OpenAI format.
-
- /v1/chat/completions
- /v1/models
- /v1/completions
-
-However, we recommend users try
-our own api `/v1/chat/interactive` which provides more arguments for users to modify. The performance is comparatively better.
-
-**Note** please, if you want to launch multiple requests, you'd better set different `session_id` for both
-`/v1/chat/completions` and `/v1/chat/interactive` apis. Or, we will set them random values.
-
-### python
-
-We have integrated the client-side functionalities of these services into the `APIClient` class. Below are some examples demonstrating how to invoke the `api_server` service on the client side.
-
-If you want to use the `/v1/chat/completions` endpoint, you can try the following code:
-
-```python
-from lmdeploy.serve.openai.api_client import APIClient
-api_client = APIClient('http://{server_ip}:{server_port}')
-model_name = api_client.available_models[0]
-messages = [{"role": "user", "content": "Say this is a test!"}]
-for item in api_client.chat_completions_v1(model=model_name, messages=messages):
-    print(item)
-```
-
-For the `/v1/completions` endpoint. If you want to use the `/v1/completions` endpoint, you can try:
-
-```python
-from lmdeploy.serve.openai.api_client import APIClient
-api_client = APIClient('http://{server_ip}:{server_port}')
-model_name = api_client.available_models[0]
-for item in api_client.completions_v1(model=model_name, prompt='hi'):
-    print(item)
-```
-
-Lmdeploy supports maintaining session histories on the server for `/v1/chat/interactive` api. We disable the
-feature by default.
-
- On interactive mode, the chat history is kept on the server. In a multiple rounds of conversation, you should set
-  `interactive_mode = True` and the same `session_id` (can't be -1, it's the default number) to `/v1/chat/interactive` for requests.
- On normal mode, no chat history is kept on the server.
-
-The interactive mode can be controlled by the `interactive_mode` boolean parameter. The following is an example of normal mode. If you want to experience the interactive mode, simply pass in `interactive_mode=True`.
-
-```python
-from lmdeploy.serve.openai.api_client import APIClient
-api_client = APIClient('http://{server_ip}:{server_port}')
-for item in api_client.chat_interactive_v1(prompt='hi'):
-    print(item)
-```
-
-### Java/Golang/Rust
-
-May use [openapi-generator-cli](https://github.com/OpenAPITools/openapi-generator-cli) to convert `http://{server_ip}:{server_port}/openapi.json` to java/rust/golang client.
-Here is an example:
-
-```shell
-$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust
-
-$ ls rust/*
-rust/Cargo.toml  rust/git_push.sh  rust/README.md
-
-rust/docs:
-ChatCompletionRequest.md  EmbeddingsRequest.md  HttpValidationError.md  LocationInner.md  Prompt.md
-DefaultApi.md             GenerateRequest.md    Input.md                Messages.md       ValidationError.md
-
-rust/src:
-apis  lib.rs  models
-```
-
-### cURL
-
-cURL is a tool for observing the output of the api.
-
-List Models:
-
-```bash
-curl http://{server_ip}:{server_port}/v1/models
-```
-
-Interactive Chat:
-
-```bash
-curl http://{server_ip}:{server_port}/v1/chat/interactive \
-  -H "Content-Type: application/json" \
-  -d '{
-    "prompt": "Hello! How are you?",
-    "session_id": 1,
-    "interactive_mode": true
-  }'
-```
-
-Chat Completions:
-
-```bash
-curl http://{server_ip}:{server_port}/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "internlm-chat-7b",
-    "messages": [{"role": "user", "content": "Hello! How are you?"}]
-  }'
-```
-
-Text Completions:
-
-```shell
-curl http://{server_ip}:{server_port}/v1/completions \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "model": "llama",
-  "prompt": "two steps to build a house:"
-}'
-```
-
-### CLI client
-
-There is a client script for restful api server.
-
-```shell
-# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
-lmdeploy serve api_client api_server_url
-```
-
-### webui
-
-You can also test restful-api through webui.
-
-```shell
-# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
-# server_ip and server_port here are for gradio ui
-# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
-lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
-```
-
-### FAQ
-
-1. When user got `"finish_reason":"length"`, it means the session is too long to be continued. The session length can be
-   modified by passing `--session_len` to api_server.
-
-2. When OOM appeared at the server side, please reduce the number of `instance_num` when lanching the service.
-
-3. When the request with the same `session_id` to `/v1/chat/interactive` got a empty return value and a negative `tokens`, please consider setting `interactive_mode=false` to restart the session.
-
-4. The `/v1/chat/interactive` api disables engaging in multiple rounds of conversation by default. The input argument `prompt` consists of either single strings or entire chat histories.
--- a/docs/en/serving.md
+++ b/docs/en/serving.md
-# Serving a model
-
-## Serving [LLaMA-2](https://github.com/facebookresearch/llama)
-
-You can download [llama-2 models from huggingface](https://huggingface.co/meta-llama) and serve them like below:
-
-<details open>
-<summary><b>7B</b></summary>
-
-```shell
-lmdeploy convert llama2 /path/to/llama-2-7b-chat-hf
-bash workspace/service_docker_up.sh
-```
-
-</details>
-
-<details open>
-<summary><b>13B</b></summary>
-
-```shell
-lmdeploy convert llama2 /path/to/llama-2-13b-chat-hf --tp 2
-bash workspace/service_docker_up.sh
-```
-
-</details>
-
-<details open>
-<summary><b>70B</b></summary>
-
-```shell
-lmdeploy convert llama2 /path/to/llama-2-70b-chat-hf --tp 8
-bash workspace/service_docker_up.sh
-```
-
-</details>
-
-## Serving [LLaMA](https://github.com/facebookresearch/llama)
-
-Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)
-
-<details open>
-<summary><b>7B</b></summary>
-
-```shell
-lmdeploy convert llama /path/to/llama-7b llama \
-    --tokenizer_path /path/to/tokenizer/model
-bash workspace/service_docker_up.sh
-```
-
-</details>
-
-<details open>
-<summary><b>13B</b></summary>
-
-```shell
-lmdeploy convert llama /path/to/llama-13b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 2
-bash workspace/service_docker_up.sh
-```
-
-</details>
-
-<details open>
-<summary><b>30B</b></summary>
-
-```shell
-lmdeploy convert llama /path/to/llama-30b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 4
-bash workspace/service_docker_up.sh
-```
-
-</details>
-
-<details open>
-<summary><b>65B</b></summary>
-
-```shell
-lmdeploy convert llama /path/to/llama-65b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 8
-bash workspace/service_docker_up.sh
-```
-
-</details>
-
-### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)
-
-<details open>
-<summary><b>7B</b></summary>
-
-```shell
-python3 -m pip install fschat
-python3 -m fastchat.model.apply_delta \
-  --base-model-path /path/to/llama-7b \
-  --target-model-path /path/to/vicuna-7b \
-  --delta-path lmsys/vicuna-7b-delta-v1.1
-
-lmdeploy convert vicuna /path/to/vicuna-7b
-bash workspace/service_docker_up.sh
-```
-
-</details>
-
-<details open>
-<summary><b>13B</b></summary>
-
-```shell
-python3 -m pip install fschat
-python3 -m fastchat.model.apply_delta \
-  --base-model-path /path/to/llama-13b \
-  --target-model-path /path/to/vicuna-13b \
-  --delta-path lmsys/vicuna-13b-delta-v1.1
-
-lmdeploy convert vicuna /path/to/vicuna-13b
-bash workspace/service_docker_up.sh
-```
-
-</details>
--- a/docs/en/supported_models/codellama.md
+++ b/docs/en/supported_models/codellama.md
@@ -64,10 +64,10 @@ And the generated code piece by `turbomind.chat` is the one to be filled in `<FI
 ### Chat

 ```
-lmdeploy chat turbomind ./workspace --cap chat --sys-instruct "Provide answers in Python"
+lmdeploy chat turbomind ./workspace --cap chat --meta-instruct "Provide answers in Python"
 ```

-`--sys-instruct` instruction can be changed to other coding languages as long as codellama supports it
+`--meta-instruct` instruction can be changed to other coding languages as long as codellama supports it

 ### Python specialist

@@ -88,9 +88,8 @@ TBD
 Launch inference server by:

 ```shell
-# --instance_num: number of instances to performance inference, which can be viewed as max requests concurrency
 # --tp: the number of GPUs used in tensor parallelism
-lmdeploy serve api_server ./workspace --server_name ${server_ip} --server_port ${server_port} --instance_num 32 --tp 1
+lmdeploy serve api_server ./workspace --server-name ${server_ip} --server-port ${server_port} --tp 1
 ```

 Then, you can communicate with it by command line,
@@ -105,8 +104,8 @@ or through webui after launching gradio,
 ```shell
 # api_server_url is what printed in api_server.py, e.g. http://localhost:23333
 # server_ip and server_port here are for gradio ui
-# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
-lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
+# example: lmdeploy serve gradio http://localhost:23333 --server-name localhost --server-port 6006
+lmdeploy serve gradio api_server_url --server-name ${gradio_ui_ip} --server-port ${gradio_ui_port}
 ```

-Regarding the detailed information of RESTful API, you can refer to [restful_api.md](../restful_api.md).
+Regarding the detailed information of RESTful API, you can refer to the [guide](../serving/api_server.md).
--- a/docs/en/turbomind.md
+++ b/docs/en/turbomind.md
-# Architecture of TurboMind
-
-TurboMind is an inference engine that supports high throughput inference for conversational LLMs. It's based on NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer). Major features of TurboMind include an efficient LLaMa implementation, the persistent batch inference model and an extendable KV cache manager.
-
-## High level overview of TurboMind
-
-```
-  +--------------------+
-  |        API         |
-  +--------------------+
-          |    ^
-  request |    | stream callback
-          v    |
-  +--------------------+   fetch   +-------------------+
-  |  Persistent Batch  | <-------> |  KV Cache Manager |
-  +--------------------+   update  +-------------------+
-             ^
-             |
-             v
-+------------------------+
-|  LLaMA implementation  |
-+------------------------+
-| FT kernels & utilities |
-+------------------------+
-```
-
-## Persistent Batch
-
-You may recognize this feature as "continuous batching" in other repos. But during the concurrent development of the feature, we modeled the inference of a conversational LLM as a persistently running batch whose lifetime spans the entire serving process, hence the name "persistent batch". To put it simply
-
- The persistent batch as N pre-configured batch slots.
- Requests join the batch when there are free slots available. A batch slot is released and can be reused once the generation of the requested tokens is finished.
- __On cache-hits (see below), history tokens don't need to be decoded in every round of a conversation; generation of response tokens will start instantly.__
- The batch grows or shrinks automatically to minimize unnecessary computations.
-
-## KV Cache Manager
-
-The [KV cache manager](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/SequenceManager.h) of TurboMind is a memory-pool-liked object that also implements LRU policy so that it can be viewed as a form of __cache of KV caches__. It works in the following way
-
- All device memory required for KV cache is allocated by the manager. A fixed number of slots is pre-configured to match the memory size of the system. Each slot corresponds to the memory required by the KV cache of a single sequence. Allocation chunk-size can be configure to implement pre-allocate/on-demand style allocation policy (or something in-between).
- When space for the KV cache of a new sequence is requested but no free slots left in the pool, the least recently used sequence is evicted from the cache and its device memory is directly reused by the new sequence. However, this is not the end of the story.
- Fetching sequence currently resides in one of the slots resembles a _cache-hit_, the history KV cache is returned directly and no context decoding is needed.
- Victim (evicted) sequences are not erased entirely but converted to its most compact form, i.e. token IDs. When the same sequence id is fetched later (_cache-miss_) the token IDs will be decoded by FMHA backed context decoder and converted back to KV cache.
- The eviction and conversion are handled automatically inside TurboMind and thus transparent to the users. __From the user's aspect, system that use TurboMind has access to infinite device memory.__
-
-## LLaMa implementation
-
-Our implementation of the LLaMa family models is modified from Gpt-NeoX model in FasterTransformer. In addition to basic refactoring and modifications to support the LLaMa family, we made some improvements to enable high performance inference of conversational models, most importantly:
-
- To support fast context decoding in multi-round conversations. We replaced the attention implementation in context decoder with a [cutlass](https://github.com/NVIDIA/cutlass)-based FMHA implementation that supports mismatched Q/K lengths.
- We introduced indirect buffer pointers in both context FMHA and generation FMHA to support the discontinuity in KV cache within the batch.
- To support concurrent inference with persistent batch, new synchronization mechanism was designed to orchestrate the worker threads running in tensor parallel mode.
- To maximize the throughput, we implement INT8 KV cache support to increase the max batch size. It's effective because in real-world serving scenarios, KV cache costs more memory and consumes more memory bandwidth than weights or other activations.
- We resolved an NCCL hang issue when running multiple model instances in TP mode within a single process, NCCL APIs are now guarded by host-side synchronization barriers.
-
-## API
-
-TurboMind supports a Python API that enables streaming output and tensor parallel mode.
-
-The ability to use [tritonserver](https://github.com/triton-inference-server/server) for serving is also inherited from FasterTransformer. However, to support submitting concurrent requests into our persistent batch model, we no longer use sequence batching or dynamic batching as FasterTransformer does. The bookkeeping of request and sequence states are managed by TurboMind instead.
-
-## Difference between FasterTransformer and TurboMind
-
-Apart of the features described above, there are still many minor differences that we don't cover in this document. Notably, many capabilities of FT are dropped in TurboMind because of the difference in objectives (e.g. prefix prompt, beam search, context embedding, sparse GEMM, GPT/T5/other model families, etc)
-
-## FAQ
-
-### Supporting Huggingface models
-
-For historical reasons, TurboMind's weight layout is based on [the original LLaMa implementation](https://github.com/facebookresearch/llama) (differ only by a transpose). The implementation in huggingface transformers uses a [different layout](https://github.com/huggingface/transformers/blob/45025d92f815675e483f32812caa28cce3a960e7/src/transformers/models/llama/convert_llama_weights_to_hf.py#L123C76-L123C76) for `W_q` and `W_k` which is handled in [deploy.py](https://github.com/InternLM/lmdeploy/blob/ff4648a1d09e5aec74cf70efef35bfaeeac552e0/lmdeploy/serve/turbomind/deploy.py#L398).
--- a/docs/en/turbomind_config.md
+++ b/docs/en/turbomind_config.md
-# TurboMind Config
-
-TurboMind is one of the inference engines of LMDeploy. When using it to do model inference, you need to convert the input model into a TurboMind model. In the TurboMind model folder, besides model weight files, the TurboMind model also includes some other files, among which the most important is the configuration file `triton_models/weights/config.ini` that is closely related to inference performance.
-
-If you are using LMDeploy version 0.0.x, please refer to the [turbomind 1.0 config](#turbomind-10-config) section to learn the relevant content in the configuration. Otherwise, please read [turbomind 2.0 config](#turbomind-20-config) to familiarize yourself with the configuration details.
-
-## TurboMind 2.0 config
-
-Take the `llama-2-7b-chat` model as an example. In TurboMind 2.0, its config.ini content is as follows:
-
-```toml
-[llama]
-model_name = llama2
-tensor_para_size = 1
-head_num = 32
-kv_head_num = 32
-vocab_size = 32000
-num_layer = 32
-inter_size = 11008
-norm_eps = 1e-06
-attn_bias = 0
-start_id = 1
-end_id = 2
-session_len = 4104
-weight_type = fp16
-rotary_embedding = 128
-rope_theta = 10000.0
-size_per_head = 128
-group_size = 0
-max_batch_size = 64
-max_context_token_num = 1
-step_length = 1
-cache_max_entry_count = 0.5
-cache_block_seq_len = 128
-cache_chunk_size = 1
-use_context_fmha = 1
-quant_policy = 0
-max_position_embeddings = 2048
-rope_scaling_factor = 0.0
-use_logn_attn = 0
-```
-
-These parameters are composed of model attributes and inference parameters. Model attributes include the number of layers, the number of heads, dimensions, etc., and they are **not modifiable**.
-
-```toml
-model_name = llama2
-head_num = 32
-kv_head_num = 32
-vocab_size = 32000
-num_layer = 32
-inter_size = 11008
-norm_eps = 1e-06
-attn_bias = 0
-start_id = 1
-end_id = 2
-rotary_embedding = 128
-rope_theta = 10000.0
-size_per_head = 128
-```
-
-Comparing to TurboMind 1.0, the model attribute part in the config remains the same with TurboMind 1.0, while the inference parameters have changed
-In the following sections, we will focus on introducing the inference parameters.
-
-### data type
-
-`weight_type` and `group_size` are the relevant parameters, **which cannot be modified**.
-
-`weight_type` represents the data type of weights. Currently, `fp16` and `int4` are supported. `int4` represents 4bit weights. When `weight_type` is `int4`, `group_size` means the group size used when quantizing weights with `awq`. In LMDeploy prebuilt package, kernels with `group size = 128` are included.
-
-### batch size
-
-The maximum batch size is still set through `max_batch_size`. But its default value has been changed from 32 to 64, and `max_batch_size` is no longer related to `cache_max_entry_count`.
-
-### k/v cache size
-
-k/v cache memory is determined by `cache_block_seq_len` and `cache_max_entry_count`.
-
-TurboMind 2.0 has implemented Paged Attention, managing the k/v cache in blocks.
-
-`cache_block_seq_len` represents the length of the token sequence in a k/v block with a default value 128. TurboMind calculates the memory size of the k/v block according to the following formula:
-
-```
-cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_data_type)
-```
-
-For the llama2-7b model, when storing k/v as the `half` type, the memory of a k/v block is: `128 * 32 * 32 * 128 * 2 * sizeof(half) = 64MB`
-
-The meaning of `cache_max_entry_count` varies depending on its value:
-
- When it's a decimal between (0, 1), `cache_max_entry_count` represents the percentage of memory used by k/v blocks. For example, if turbomind launches on a A100-80G GPU with `cache_max_entry_count` being `0.5`, the total memory used by the k/v blocks is `80 * 0.5 = 40G`.
- When it's an integer > 0, it represents the total number of k/v blocks
-
-The `cache_chunk_size` indicates the size of the k/v cache chunk to be allocated each time new k/v cache blocks are needed. Different values represent different meanings:
-
- When it is an integer > 0, `cache_chunk_size` number of k/v cache blocks are allocated.
- When the value is -1, `cache_max_entry_count` number of k/v cache blocks are allocated.
- When the value is 0, `sqrt(cache_max_entry_count)` number of k/v cache blocks are allocated.
-
-### kv int8 switch
-
-When initiating 8bit k/v inference, set `quant_policy = 4`. Please refer to [kv int8](./kv_int8.md) for a guide.
-
-### long context switch
-
-By setting `rope_scaling_factor = 1.0`, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.
-
-Regarding the principle of Dynamic NTK, please refer to:
-
-1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
-2. https://kexue.fm/archives/9675
-
-You can also turn on [LogN attention scaling](https://kexue.fm/archives/8823) by setting `use_logn_attn = 1`.
-
-## TurboMind 1.0 config
-
-Taking the `llama-2-7b-chat` model as an example, in TurboMind 1.0, its `config.ini` content is as follows:
-
-```toml
-[llama]
-model_name = llama2
-tensor_para_size = 1
-head_num = 32
-kv_head_num = 32
-vocab_size = 32000
-num_layer = 32
-inter_size = 11008
-norm_eps = 1e-06
-attn_bias = 0
-start_id = 1
-end_id = 2
-session_len = 4104
-weight_type = fp16
-rotary_embedding = 128
-rope_theta = 10000.0
-size_per_head = 128
-group_size = 0
-max_batch_size = 32
-max_context_token_num = 4
-step_length = 1
-cache_max_entry_count = 48
-cache_chunk_size = 1
-use_context_fmha = 1
-quant_policy = 0
-max_position_embeddings = 2048
-use_dynamic_ntk = 0
-use_logn_attn = 0
-```
-
-These parameters are composed of model attributes and inference parameters. Model attributes include the number of layers, the number of heads, dimensions, etc., and they are **not modifiable**.
-
-```toml
-model_name = llama2
-head_num = 32
-kv_head_num = 32
-vocab_size = 32000
-num_layer = 32
-inter_size = 11008
-norm_eps = 1e-06
-attn_bias = 0
-start_id = 1
-end_id = 2
-rotary_embedding = 128
-rope_theta = 10000.0
-size_per_head = 128
-```
-
-In the following sections, we will focus on introducing the inference parameters.
-
-### data type
-
-`weight_type` and `group_size` are the relevant parameters, **which cannot be modified**.
-
-`weight_type` represents the data type of weights. Currently, `fp16` and `int4` are supported. `int4` represents 4bit weights. When `weight_type` is `int4`, `group_size` means the group size used when quantizing weights with `awq`. In LMDeploy prebuilt package, kernels with `group size = 128` are included.
-
-### batch size
-
-`max_batch_size` determines the max size of a batch during inference. In general, the larger the batch size is, the higher the throughput is. But make sure that `max_batch_size <= cache_max_entry_count`
-
-### k/v cache size
-
-TurboMind allocates k/v cache memory based on `session_len`, `cache_chunk_size`, and `cache_max_entry_count`.
-
- `session_len` denotes the maximum length of a sequence, i.e., the size of the context window.
- `cache_chunk_size` indicates the size of k/v sequences to be allocated when new sequences are added.
- `cache_max_entry_count` signifies the maximum number of k/v sequences that can be cached.
-
-### kv int8 switch
-
-When initiating 8bit k/v inference, change `quant_policy = 4` and `use_context_fmha = 0`. Please refer to [kv int8](./kv_int8.md) for a guide.
-
-### long context switch
-
-By setting `use_dynamic_ntk = 1`, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.
-
-Regarding the principle of Dynamic NTK, please refer to:
-
-1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
-2. https://kexue.fm/archives/9675
-
-You can also turn on [LogN attention scaling](https://kexue.fm/archives/8823) by setting `use_logn_attn = 1`.
--- a/docs/en/w4a16.md
+++ b/docs/en/w4a16.md
-# W4A16 LLM Model Deployment
-
-LMDeploy supports LLM model inference of 4-bit weight, with the minimum requirement for NVIDIA graphics cards being sm80, such as A10, A100, Geforce 30/40 series.
-
-Before proceeding with the inference, please ensure that lmdeploy is installed.
-
-```shell
-pip install lmdeploy[all]
-```
-
-## 4-bit LLM model Inference
-
-You can download the pre-quantized 4-bit weight models from LMDeploy's [model zoo](https://huggingface.co/lmdeploy) and conduct inference using the following command.
-
-Alternatively, you can quantize 16-bit weights to 4-bit weights following the ["4-bit Weight Quantization"](#4-bit-weight-quantization) section, and then perform inference as per the below instructions.
-
-Take the 4-bit Llama-2-chat-7B model from the model zoo as an example:
-
-```shell
-git-lfs install
-git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
-```
-
-As demonstrated in the command below, first convert the model's layout using `turbomind.deploy`, and then you can interact with the AI assistant in the terminal
-
-```shell
-
-## Convert the model's layout and store it in the default path, ./workspace.
-lmdeploy convert \
-    --model-name llama2 \
-    --model-path ./llama2-chat-7b-w4 \
-    --model-format awq \
-    --group-size 128
-
-## inference
-lmdeploy chat turbomind ./workspace
-```
-
-## Serve with gradio
-
-If you wish to interact with the model via web ui, please initiate the gradio server as indicated below:
-
-```shell
-lmdeploy serve gradio ./workspace --server_name {ip_addr} --server_port {port}
-```
-
-Subsequently, you can open the website `http://{ip_addr}:{port}` in your browser and interact with the model
-
-## Inference Performance
-
-We benchmarked the Llama-2-7B-chat and Llama-2-13B-chat models with 4-bit quantization on NVIDIA GeForce RTX 4090 using [profile_generation.py](https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py). And we measure the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. All the results are measured for single batch inference.
-
-| model            | llm-awq | mlc-llm | turbomind |
-| ---------------- | ------- | ------- | --------- |
-| Llama-2-7B-chat  | 112.9   | 159.4   | 206.4     |
-| Llama-2-13B-chat | N/A     | 90.7    | 115.8     |
-
-Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively,
-
-| model            | 16bit(2048) | 4bit(2048) | 16bit(4096) | 4bit(4096) |
-| ---------------- | ----------- | ---------- | ----------- | ---------- |
-| Llama-2-7B-chat  | 15.1        | 6.3        | 16.2        | 7.5        |
-| Llama-2-13B-chat | OOM         | 10.3       | OOM         | 12.0       |
-
-```
-pip install nvidia-ml-py
-```
-
-```shell
-python benchmark/profile_generation.py \
- --model-path ./workspace \
- --concurrency 1 8 --prompt-tokens 1 512 --completion-tokens 2048 512
-```
-
-## 4-bit Weight Quantization
-
-It includes two steps:
-
- generate quantization parameter
- quantize model according to the parameter
-
-### Step 1: Generate Quantization Parameter
-
-```shell
-lmdeploy lite calibrate \
-  --model $HF_MODEL \
-  --calib_dataset 'c4' \             # Calibration dataset, supports c4, ptb, wikitext2, pileval
-  --calib_samples 128 \              # Number of samples in the calibration set, if memory is insufficient, you can appropriately reduce this
-  --calib_seqlen 2048 \              # Length of a single piece of text, if memory is insufficient, you can appropriately reduce this
-  --work_dir $WORK_DIR \             # Folder storing Pytorch format quantization statistics parameters and post-quantization weight
-```
-
-### Step2: Quantize Weights
-
-LMDeploy employs AWQ algorithm for model weight quantization.
-
-```shell
-lmdeploy lite auto_awq \
-  --model $HF_MODEL \
-  --w_bits 4 \                       # Bit number for weight quantization
-  --w_group_size 128 \               # Group size for weight quantization statistics
-  --work_dir $WORK_DIR \             # Directory saving quantization parameters from Step 1
-```
-
-After the quantization is complete, the quantized model is saved to `$WORK_DIR`. Then you can proceed with model inference according to the instructions in the ["4-Bit Weight Model Inference"](#4-bit-llm-model-inference) section.
--- a/docs/zh_cn/_static/css/readthedocs.css
+++ b/docs/zh_cn/_static/css/readthedocs.css
 .header-logo {
-    background-image: url("../image/lmdeploy-logo.png");
-    background-size: 150px 60px;
+    background-image: url("../image/lmdeploy-logo.svg");
+    background-size: 257px 60px;
    height: 60px;
-    width: 150px;
+    width: 257px;
+}
+
+@media screen and (min-width: 1100px) {
+  .header-logo {
+    top: -15px;
+  }
+}
+
+pre {
+    white-space: pre;
+}
+
+@media screen and (min-width: 2000px) {
+  .pytorch-content-left {
+    width: 1200px;
+    margin-left: 30px;
+  }
+  article.pytorch-article {
+    max-width: 1200px;
+  }
+  .pytorch-breadcrumbs-wrapper {
+    width: 1200px;
+  }
+  .pytorch-right-menu.scrolling-fixed {
+    position: fixed;
+    top: 45px;
+    left: 1580px;
+  }
+}
+
+
+article.pytorch-article section code {
+  padding: .2em .4em;
+  background-color: #f3f4f7;
+  border-radius: 5px;
+}
+
+/* Disable the change in tables */
+article.pytorch-article section table code {
+  padding: unset;
+  background-color: unset;
+  border-radius: unset;
+}
+
+table.autosummary td {
+  width: 50%
+}
+
+img.align-center {
+  display: block;
+  margin-left: auto;
+  margin-right: auto;
+}
+
+article.pytorch-article p.rubric {
+  font-weight: bold;
 }
--- a/docs/zh_cn/_static/image/lmdeploy-logo.png
+++ b/docs/zh_cn/_static/image/lmdeploy-logo.png
-resources/lmdeploy-logo.png
\ No newline at end of file