同步0.2.6代码

d7117b95 · zhouxiang · 5f83e392 · 5f83e392 · d7117b95 · d7117b95
Commit d7117b95 authored Mar 22, 2024 by zhouxiang
20 changed files
--- a/docs/zh_cn/api.rst
+++ b/docs/zh_cn/api.rst
-lmdeploy.lite
-------------
-.. automodule:: lmdeploy.lite
-    :members:
-lmdeploy.pytorch
----------------
-.. automodule:: lmdeploy.pytorch
-    :members:
-lmdeploy.serve
--------------
-.. automodule:: lmdeploy.serve
-    :members:
--- a/docs/zh_cn/benchmark/profile_api_server.md
+++ b/docs/zh_cn/benchmark/profile_api_server.md
@@ -5,17 +5,12 @@ api_server 的测试方式与[求吞吐量测试方法](./profile_throughput.md)
 测试脚本是 `profile_restful_api.py`。测试之前，请安装 lmdeploy 预编译包，并下载评测脚本和测试数据集。
 ```shell
-pip install 'lmdeploy[serve]>=0.1.0a1'
+pip install lmdeploy
 git clone --depth=1 https://github.com/InternLM/lmdeploy
 cd lmdeploy/benchmark
 wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 ```
-测速时，需输入具体的模型。我们推荐把模型下载到本地，并通过 `lmdeploy convert` 把模型转换为 turbomind 格式，然后再进行测试。
-这么做的原因是，方便调节推理引擎参数，以达到比较好的推理性能，比如批处理大小（max_batch_size），K/V cache缓存大小（max_cache_entry_count）等等。有关这些参数的详细说明，请参考[这里](../turbomind_config.md).
-以下章节中，我们默认模型是 turbomind 格式的。
 ## 测量指标
 LMDeploy 统计首token延时（first_token_latency）、token吞吐量（tokens/s）和请求吞吐量（RPM）。
@@ -36,70 +31,22 @@ $$
 总时间包括 prefill 时间
-## 测试案例
+## 测量方法
-我们用 `internlm-7b` 为例，api_server的速度测试全流程如下：
-```shell
-pip install 'lmdeploy[serve]>=0.1.0a1'
-git clone --depth=1 https://github.com/InternLM/lmdeploy
-cd lmdeploy/benchmark
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-# 从huggingface下载internlm-7b，并转为turbomind模型格式
+我们以 [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) 为例，展示 api_server 的性能测试流程
-lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
-# 启动server
+### 启动服务
-lmdeploy serve api_server ./internlm-7b --server-port 23333
-# 另起终端，在`lmdeploy/benchmark`目录下，执行测速脚本
+```shell
-python3 ./profile_restful_api.py http://0.0.0.0:23333 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
+lmdeploy serve api_server internlm/internlm-7b
 ```
-## 测试方法
+如果你想改变 server 的端口，或者诸如推理引擎、最大批处理值等参数，请运行 `lmdeploy serve api_server -h` 或者阅读[这篇文档](../serving/api_server.md)，查看详细的参数说明。
-请参考[这里](../restful_api.md) 启动推理服务。启动时的参数 `--instance-num` 表示推理服务中的推理实例数量。当同一时刻到达 api_server 的请求数超过它时，请求会在推理队列中等待。
+### 测速
 ```shell
-python3 profile_restful_api.py <server_addr> <tokenizer_path> <dataset> <optional arguments>
+python3 profile_restful_api.py http://0.0.0.0:23333 internlm/internlm-7b ./ShareGPT_V3_unfiltered_cleaned_split.json
 ```
-其中，必填参数是：
+关于 `profile_restful_api.py` 脚本中的参数，比如请求并发数、采样参数等等，可以通过运行命令 `python3 profile_restful_api.py -h` 查阅。
- `server_addr`
-  api_server 的地址，格式是 `http://{server_ip}:{server_port}`
- `tokenizer_path`
-  tokenizer model 的路径。作用是对测试数据集预先 encode，获取对话数据的 token 长度
- `dataset`
-  下载的测试数据集的路径
-可选测试参数如下：
- `--concurrency`
-  客户端请求线程的数量，并发请求会被推理引擎拼成 batch，默认为 64。并发请求会被推理引擎拼成 batch。并发数不能超过api_server的`--instance-num`。否则，超出部分的请求会在推理队列中等待。
- `--num-prompts`
-  从数据集中采样的prompt数量，默认是 2000
- `--top_p` 和 `--temperature`
-  这三个参数用来采样生成的 token_id
- `--stream_output`
-  流式推理的开关。默认值为 `False`
- `--csv`
-  一个 csv 文件路径，用来存放测试结果。默认是 `./profile_api_server.csv`
- `--seed`
-  从测试数据集中随机采样prompt时的种子。默认为0
--- a/docs/zh_cn/benchmark/profile_generation.md
+++ b/docs/zh_cn/benchmark/profile_generation.md
-# 静态推理性能测试方法
+# 静态推理性能测试
 我们把推理引擎在固定 batch、固定输入输出 token 数量的前提下的推理，称之为静态推理。
 评测脚本是 `profile_generation.py`，在运行此脚本前，请安装 lmdeploy 预编译包，并下载评测脚本
 ```shell
-pip install 'lmdeploy>=0.1.0a1'
+pip install lmdeploy
 git clone --depth=1 https://github.com/InternLM/lmdeploy
 ```
-测速时，需输入具体的模型。我们推荐把模型下载到本地，并通过 `lmdeploy convert` 把模型转换为 turbomind 格式，然后再进行测试。
-这么做的原因是，方便调节推理引擎参数，以达到比较好的推理性能，比如批处理大小（max_batch_size），K/V cache缓存大小（max_cache_entry_count）等等。有关这些参数的详细说明，请参考[这里](../turbomind_config.md).
-以下章节中，我们默认模型是 turbomind 格式的。
 ## 测量指标
 LMDeploy 统计首token延时（first_token_latency）、token 吞吐量（tokens/s），每个token延时的百分位数据（P50，P75，P95，P99）、GPU mem 等测试结果。
@@ -30,58 +25,22 @@ $$
 测试过程中，节点上所有的显卡不要运行其他任何程序，否则 GPU mem 的统计会不准确。
-## 测试案例
+## 测量方法
+我们以 [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) 为例，分别介绍测试 LMDeploy 两个推理引擎 turbomind 和 pytorch 的静态推理性能测试方法
-我们用 `internlm-7b` 为例，api_server的速度测试全流程如下：
+### Turbomind 引擎
 ```shell
-pip install 'lmdeploy>=0.1.0a1'
-git clone --depth=1 https://github.com/InternLM/lmdeploy
 cd lmdeploy/benchmark
+python3 profile_generation.py internlm/internlm-7b
-# 从huggingface下载internlm-7b，并转为turbomind模型格式
-lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
-# 执行测速脚本
-python3 profile_generation ./internlm-7b
 ```
-## 测试方法
+### PyTorch 引擎
 ```shell
-python3 profile_generation.py <model_path> <optional arguments>
+cd lmdeploy/benchmark
+python3 profile_generation.py internlm/internlm-7b --backend pytorch
 ```
-其中，`model_path` turbomind格式的模型在 localhost 上的路径。
+关于 `profile_generation` 脚本的参数，比如批处理大小，输入输出token的数量等等，可以通过运行命令 `python3 profile_generation.py -h` 查阅。
-可选测试参数如下：
- `--concurrency`
-  代表请求线程的数量，并发请求会被推理引擎拼成 batch。默认值为`[1, 16, 32, 64]`，意味着默认测试 4 种不同并发度下的性能。并发量不能超过`config.ini`中的`max_batch_size`。否则，超出部分的请求会在推理队列中等待。
- `--prompt-tokens` 和 `--completion-tokens`
-  输入token和输出token数量。它们是一个列表，列表中的元素是一一对应关系，即，`(--prompt-tokens[i]`, `--completion-tokens[i])` 是一组。比如在默认列表中，`[1, 128, 128, 2048, 2048]`和`[128, 128, 2048, 128, 2048]`，测试组合分别是，`(1, 128)`、`(128, 128)`、`(128, 2048)`、`(2048, 128)`和`(2048, 2048)`
- `--tp`
-  模型在张量并行时，使用的显卡数量。必须是2的整数次幂。默认为 1。
- `--top_k`、`--top_p` 和 `--temperature`
-  这三个参数用来采样生成的 token_id。
- `--csv`
-  一个 csv 文件路径，用来存放测试结果。默认是 `./profile_generation.csv`
- `--log-level`
-  日志级别。默认是 `ERROR`
- `--test-round`
-  测试的轮数，默认是 10。表示每组测试设置，会测试 10 轮，统计其平均结果。
-我们把一组 `(并发数, prompt_token数量, completion-token数量)` 称为一组测试用例。所以，脚本执行的`测试用例总数 = 并发数列表长度 x prompt_token 列表长度`，`测试规模 = 测试用例总数 x 测试轮数`。用户可以根据自己的实际情况，灵活的调整测试参数。
--- a/docs/zh_cn/benchmark/profile_throughput.md
+++ b/docs/zh_cn/benchmark/profile_throughput.md
-# 请求吞吐量测试方法
+# 请求吞吐量性能测试
 在真实应用中，用户输入的 prompt 长度以及模型回复的 token 数量是动态变化的。而静态推理能力不足以反映推理引擎对动态输入输出的处理能力。
 所以需要使用真实对话数据，评测推理引擎的动态推理能力。本文将介绍如何在 localhost 上测试 LMDeploy 的动态推理性能。
-测试脚本是 `profile_restful_api.py`。测试之前，请安装 lmdeploy 预编译包，并下载评测脚本和测试数据集。
+测试脚本是 `profile_throughput.py`。测试之前，请安装 lmdeploy 预编译包，并下载评测脚本和测试数据集。
 ```shell
-pip install 'lmdeploy>=0.1.0a1'
+pip install lmdeploy
 git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
 wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 ```
-测速时，需输入具体的模型。我们推荐把模型下载到本地，并通过 `lmdeploy convert` 把模型转换为 turbomind 格式，然后再进行测试。
-这么做的原因是，方便调节推理引擎参数，以达到比较好的推理性能，比如批处理大小（max_batch_size），K/V cache缓存大小（max_cache_entry_count）等等。有关这些参数的详细说明，请参考[这里](../turbomind_config.md).
-以下章节中，我们默认模型是 turbomind 格式的。
 ## 测量指标
 LMDeploy 统计首token延时（first_token_latency）、token吞吐量（tokens/s）和请求吞吐量（RPM）。
@@ -37,69 +33,20 @@ $$
 总时间包括 prefill 时间
-## 测试案例
+## 测量方法
-我们用 `internlm-7b` 为例，api_server的速度测试全流程如下：
-```shell
+我们以 [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) 为例，分别介绍测试 LMDeploy 两个推理引擎 turbomind 和 pytorch 的离线请求处理速度
-pip install 'lmdeploy>=0.1.0a1'
-git clone --depth=1 https://github.com/InternLM/lmdeploy
-cd lmdeploy/benchmark
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-# 从huggingface下载internlm-7b，并转为turbomind模型格式
+### Turbomind 引擎
-lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
-# 执行测速脚本
+```shell
-python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json ./internlm-7b
+python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json internlm/internlm-7b
 ```
-## 测试方法
+### PyTorch 引擎
 ```shell
-python3 profile_throughput.py <dataset> <model_path> <optional arguments>
+python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json internlm/internlm-7b  --backend pytorch
 ```
-其中，必填参数是：
+有关 profile_throughput.py 的详细参数，比如并发数、采样参数、k/v内存分配比例等等，请执行 `python3 profile_throughput.py -h` 查阅
- `dataset`
-  测试数据集的路径
- `model_path`
-  turbomind格式的模型在 localhost 上的路径
-可选测试参数如下：
- `--concurrency`
-  代表请求线程的数量，并发请求会被推理引擎拼成 batch，默认为 64。并发请求会被推理引擎拼成 batch。并发数不能超过`config.ini`中的`max_batch_size`。否则，超出部分的请求会在推理队列中等待。
- `--num-prompts`
-  从数据集中采样的prompt数量。默认是 2000
- `--tp`
-  模型在张量并行时，使用的显卡数量。必须是2的整数次幂。默认为 1
- `--top_k`、`--top_p` 和 `--temperature`
-  这三个参数用来采样生成的 token_id
- `--stream_output`
-  流式推理的开关。默认值为 `True`
- `--csv`
-  一个 csv 文件路径，用来存放测试结果。默认是 `./profile_throughput.csv`
- `--log-level`
-  日志级别。默认是 `ERROR`
- `--seed`
-  从测试数据集中随机采样prompt时的种子。默认为0
--- a/docs/zh_cn/benchmark/profile_triton_server.md
+++ b/docs/zh_cn/benchmark/profile_triton_server.md
-# Triton Inference Server 性能测试方法
+# Triton Inference Server 性能测试
 Triton Inference Server(TIS) 是 LMDeploy 支持的除了 api_server 之外的另一种 serving 方式。它的性能测试方式和测试指标和 [api_server](./profile_api_server.md) 的测试方式类似。
@@ -9,16 +9,11 @@ LMDeploy 尚未实现 Triton Inference Server 的 ensemble 推理模式，所以
 TIS 性能测试脚本是 `profile_serving.py`。测试之前，请安装 lmdeploy 预编译包，并下载评测脚本和测试数据集。
 ```shell
-pip install 'lmdeploy[serve]>=0.1.0a1'
+pip install 'lmdeploy[serve]'
 git clone --depth=1 https://github.com/InternLM/lmdeploy
 wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 ```
-测速时，需输入具体的模型。我们推荐把模型下载到本地，并通过 `lmdeploy convert` 把模型转换为 turbomind 格式，然后再进行测试。
-这么做的原因是，方便调节推理引擎参数，以达到比较好的推理性能，比如批处理大小（max_batch_size），K/V cache缓存大小（max_cache_entry_count）等等。有关这些参数的详细说明，请参考[这里](../turbomind_config.md).
-以下章节中，我们默认模型是 turbomind 格式的。
 ## 测量指标
 LMDeploy 统计首token延时（first_token_latency）、token吞吐量（tokens/s）和请求吞吐量（RPM）。
@@ -39,71 +34,28 @@ $$
 总时间包括 prefill 时间
-## 测试案例
+## 测量方法
-我们用 `internlm-7b` 为例，api_server的速度测试全流程如下：
+我们以 [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) 为例，展示 triton inference server 的性能测试流程
-```shell
-pip install 'lmdeploy[serve]>=0.1.0a1'
-git clone --depth=1 https://github.com/InternLM/lmdeploy
-cd lmdeploy/benchmark
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-# 从huggingface下载internlm-7b，并转为turbomind模型格式
+### 启动服务
-lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
-# 启动server
+启动服务之前，必须先把模型转换为 turbomind 模型格式：
-bash ./internlm-7b/service_docker_up.sh
-# 另起终端，在`lmdeploy/benchmark`目录下，执行测速脚本
+```shell
-python3 ./profile_serving 0.0.0.0:33337 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
+lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
 ```
-## 测试方法
+然后，执行如下命令，启动服务：
-启动服务
 ```shell
-python3 profile_restful_api.py <server_addr> <tokenizer_path> <dataset> <optional arguments>
+bash ./internlm-7b/service_docker_up.sh
 ```
-其中，必填参数是：
+### 测速
- `server_addr`
-  api_server 的地址，格式是 `{server_ip}:{server_port}`
- `tokenizer_path`
-  tokenizer model 的路径。作用是对测试数据集预先 encode，获取对话数据的 token 长度
- `dataset`
+```shell
+python3 profile_serving.py 0.0.0.0:33337 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
-  下载的测试数据集的路径
+```
-可选测试参数如下：
- `--concurrency`
-  客户端请求线程的数量，并发请求会被推理引擎拼成 batch，默认为 32。并发请求会被推理引擎拼成 batch。建议 concurrency 的值不要超过推理引擎的 `max_batch_size`，也不要超过 triton_models 中的推理实例的数量。
-  推理实例数量的配置项是 `instance_group`，在文件 `{model_path}/triton_models/interactive/config.pbtxt` 里，默认是 48。
- `--num-prompts`
-  从数据集中采样的prompt数量，默认是 1000
- `--top_k`、`--top_p` 和 `--temperature`
-  这三个参数用来采样生成的 token_id
- `--stream_output`
-  流式推理的开关。默认值为 `False`
- `--csv`
-  一个 csv 文件路径，用来存放测试结果。默认是 `./profile_tis.csv`
- `--seed`
-  从测试数据集中随机采样prompt时的种子。默认为0
+关于 `profile_serving.py` 脚本中的参数，比如请求并发数、采样参数等等，可以通过运行命令 `python3 profile_serving.py -h` 查阅。
--- a/docs/zh_cn/build.md
+++ b/docs/zh_cn/build.md
@@ -17,7 +17,8 @@ LMDeploy 提供了编译镜像 `openmmlab/lmdeploy-builder:cuda11.8`。使用之
 在 lmdeploy 源码的根目录下，运行以下命令：
 ```shell
-cd lmdeploy # lmdeploy 源码根目录
+# lmdeploy 源码根目录
+cd lmdeploy
 bash builder/manywheel/build_all_wheel.sh
 ```
@@ -67,8 +68,10 @@ wheel 文件存放在目录 `builder/manywheel/cuda11.8_dist` 下。
  ```
 - lmdeploy 编译安装:
  ```shell
-  apt install ninja-build # 安装更快的 Ninja
+  # 安装更快的 Ninja
-  cd lmdeploy # lmdeploy 源码的根目录
+  apt install ninja-build
+  # lmdeploy 源码的根目录
+  cd lmdeploy
  mkdir build && cd build
  sh ../generate.sh
  ninja && ninja install

--- a/docs/zh_cn/conf.py
+++ b/docs/zh_cn/conf.py
@@ -53,6 +53,7 @@ extensions = [
    'sphinx.ext.napoleon',
    'sphinx.ext.viewcode',
    'sphinx.ext.autosectionlabel',
+    'sphinx_tabs.tabs',
    'sphinx_markdown_tables',
    'myst_parser',
    'sphinx_copybutton',
@@ -106,10 +107,10 @@ html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()]
 # documentation.
 #
 html_theme_options = {
-    'logo_url': 'https://lmdeploy.readthedocs.io/zh_CN/latest/',
+    'logo_url': 'https://lmdeploy.readthedocs.io/zh-cn/latest/',
    'menu': [{
        'name': 'GitHub',
-        'url': 'https://github.com/open-mmlab/lmdeploy'
+        'url': 'https://github.com/InternLM/lmdeploy'
    }],
    'menu_lang': 'cn',
 }

--- a/docs/zh_cn/faq.md
+++ b/docs/zh_cn/faq.md
@@ -41,9 +41,34 @@ export LD_LIBRARY_PATH={Location}/nvidia/nccl/lib:$LD_LIBRARY_PATH
 很可能是机器上的 cuda 版本太低导致的。LMDeploy运行时要求 cuda 不低于 11.2
-## Turbomind 推理
+## 推理
-## Pytorch 推理
+### RuntimeError: \[TM\]\[ERROR\] CUDA runtime error: out of memory /workspace/lmdeploy/src/turbomind/utils/allocator.h
+通常这是因为 k/v cache内存比例过大导致的。比例的控制参数是 `TurbomindEngineConfig.cache_max_entry_count`。该参数在不同版本的 lmdeploy中，含义略有不同。具体请参考代码中的[演进说明](https://github.com/InternLM/lmdeploy/blob/52419bd5b6fb419a5e3aaf3c3b4dea874b17e094/lmdeploy/messages.py#L107)
+如果在使用 pipeline 接口遇到该问题，请调低比例，比如
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
+pipe = pipeline('internlm/internlm2-chat-7b',
+                backend_config=backend_config)
+response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
+print(response)
+```
+如果在使用 CLI 工具时遇到此问题，请传入参数`--cache-max-entry-count`，调低 k/v cache缓存使用比例。比如，
+```shell
+# chat 命令
+lmdeploy chat turbomind internlm/internlm2-chat-7b --cache-max-entry-count 0.2
+# server 命令
+lmdeploy serve api_server internlm/internlm2-chat-7b --cache-max-entry-count 0.2
+```
 ## 服务
@@ -54,3 +79,7 @@ export LD_LIBRARY_PATH={Location}/nvidia/nccl/lib:$LD_LIBRARY_PATH
 请检查你的硬盘空间。
 这个错误是因为保存权重时硬盘空间不足导致的，在量化 70B 模型时可能会遇到
+### ModuleNotFoundError: No module named 'flash_attn'
+量化 `qwen` 模型需要安装 `flash-attn`。但是，根据社区用户的反馈，`flash-attn` 比较难安装。所以，lmdeploy 从依赖列表中移除 `flash-attn`，用户在用到的时候，可以进行手动安装。
--- a/docs/zh_cn/index.rst
+++ b/docs/zh_cn/index.rst
-欢迎来到 LMDeploy 的中文文档！
+欢迎来到 LMDeploy 的中文教程！
 ====================================
-点击页面左下角切换中英文。
+.. _快速上手:
 .. toctree::
   :maxdepth: 2
-   :caption: 编译
+   :caption: 快速上手
+   get_started.md
+.. _编译和安装:
+.. toctree::
+   :maxdepth: 1
+   :caption: 编译和安装
   build.md
+.. _测试基准:
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
-   :caption: 使用PyTorch对话
+   :caption: 测试基准
-   pytorch.md
+   benchmark/profile_generation.md
+   benchmark/profile_throughput.md
+   benchmark/profile_api_server.md
+   benchmark/profile_triton_server.md
+   benchmark/evaluate_with_opencompass.md
+.. _支持的模型:
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
-   :caption: 量化
+   :caption: 模型列表
-   quantization.md
+   supported_models/supported_models.md
+.. _推理:
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
+   :caption: 推理
+   inference/pipeline.md
+   inference/vl_pipeline.md
+.. _服务:
+.. toctree::
+   :maxdepth: 1
   :caption: 服务
-   serving.md
+   serving/api_server.md
+   serving/api_server_vl.md
+   serving/gradio.md
+   serving/proxy_server.md
+.. _量化:
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
-   :caption: TurboMind
+   :caption: 量化
-   turbomind.md
+   quantization/w4a16.md
+   quantization/kv_int8.md
+   quantization/w8a8.md
 .. toctree::
-   :caption: 语言切换
+   :maxdepth: 1
+   :caption: 进阶指南
-   switch_language.md
+   inference/turbomind.md
+   inference/pytorch.md
+   advance/pytorch_new_model.md
+   advance/long_context.md
+   advance/chat_template.md
+   advance/debug_turbomind.md
+   serving/qos.md
+.. toctree::
+   :maxdepth: 1
+   :caption: API 文档
+   api/pipeline.rst
-Indices and tables
+索引与表格
 ==================
 * :ref:`genindex`

--- a/docs/zh_cn/kv_int8.md
+++ b/docs/zh_cn/kv_int8.md
-# KV Cache 量化和测试结果
-对于最大长度是 2048 的 LLaMa-7B fp16 模型，服务端每创建 1 个并发，都需要大约 1030MB 显存保存 kv_cache，即便是 A100 80G，能服务的用户也非常有限。
-为了降低运行时显存，我们实现了 kv cache PTQ 量化，使用的公式如下：
-```bash
-zp = (min+max) / 2
-scale = (max-min) / 255
-quant: q = round( (f-zp) / scale)
-dequant: f = q * scale + zp
-```
-## 如何开启 KV Cache INT8
-### **第一步**
-把 huggingface 格式的模型，转成 turbomind 推理格式，得到一个 workspace 目录
-```bash
-lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
-```
-如果已经有 workspace 目录，可以跳过这步。
-### **第二步**
-通过以下 2 步，获取量化参数
-```bash
-# 计算 minmax
-lmdeploy lite calibrate \
-  --model $HF_MODEL \
-  --calib_dataset 'c4' \             # 校准数据集，支持 c4, ptb, wikitext2, pileval
-  --calib_samples 128 \              # 校准集的样本数，如果显存不够，可以适当调小
-  --calib_seqlen 2048 \              # 单条的文本长度，如果显存不够，可以适当调小
-  --work_dir $WORK_DIR \             # 保存 Pytorch 格式量化统计参数和量化后权重的文件夹
-# 通过 minmax 获取量化参数
-lmdeploy lite kv_qparams \
-  --work_dir $WORK_DIR  \                             # 上一步的结果
-  --turbomind_dir workspace/triton_models/weights/ \ # 保存量化参数的目录，推理要用
-  --kv_sym False \                                    # 对称量化或非对称量化，默认为 False
-  --num_tp 1  \                                       # Tensor 并行使用的 GPU 数，和 deploy.py 保持一致
-```
-`kv_qparams` 会在 `weights` 目录生成 fp32 缩放系数，文件格式是 `numpy.tofile` 产生的二进制。
-也可以先把 `turbomind_dir` 设成私有目录，再把缩放系数拷贝进 `workspace/triton_models/weights/`。
-### **第三步**
-修改 `workspace/triton_models/weights/config.ini`：
- quant_policy 设置为 4。表示打开 kv_cache int8
-### **第四步**
-测试聊天效果
-```bash
-lmdeploy chat turbomind ./workspace
-```
-## 显存测试
-测试对象为 [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) 模型。
-测试方法：
-1. 使用 `deploy.py` 转换模型，修改 `workspace` 配置中的最大并发数；调整 `llama_config.ini` 中的请求数
-2. 编译执行 `bin/llama_triton_example`，获取 fp16 版本在不同 batch_size 的显存情况
-3. 开启量化，重新执行 `bin/llama_triton_example`，获取 int8 版本在不同 batch_size 显存情况
-以下是两个版本的显存对比：
-| batch_size | fp16 memory(MiB) | int8 memory(MiB) | diff(MiB) |
-| :--------: | :--------------: | :--------------: | :-------: |
-|     8      |      22337       |      18241       |   -4096   |
-|     16     |      30593       |      22369       |   -8224   |
-|     32     |      47073       |      30625       |  -16448   |
-|     48     |      63553       |      38881       |  -24672   |
-相对于直接量化 Weight（如 [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/)），我们做了两种方案在 7B 模型中的内存增长对比预估，部分数据来自 [llama.cpp](https://github.com/ggerganov/llama.cpp)。
-![](../../resources/batch_memory.png)
-可以看到，fp16 版本每个并发需要 1030MB 显存，因此量化 kv_cache 能显著降低运行时的显存增长速度。
-## 精度测试
-测试对象为 [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) 指令模型。
-以下是 `kCacheKVInt8` 方法仅从 c4 数据集，随机选择 128 条数据 PTQ 量化。量化前后均使用 [opencompass](https://github.com/InternLM/opencompass) 测试精度。
-|     task      |     dataset     |    metric     | int8  | fp16  | diff  |
-| :-----------: | :-------------: | :-----------: | :---: | :---: | :---: |
-|   Language    |   winogrande    |   accuracy    | 60.77 | 61.48 | -0.71 |
-|   Knowledge   |       nq        |     score     | 2.69  | 2.60  | +0.09 |
-|   Reasoning   |      gsm8k      |   accuracy    | 33.28 | 34.72 | -1.44 |
-|   Reasoning   |       bbh       | naive_average | 20.12 | 20.51 | -0.39 |
-| Understanding | openbookqa_fact |   accuracy    | 82.40 | 82.20 | +0.20 |
-| Understanding |   eprstmt-dev   |   accuracy    | 90.62 | 88.75 | +1.87 |
-|    Safety     |   crows_pairs   |   accuracy    | 32.56 | 31.43 | +1.13 |
-需要注意的是，`kCacheKVInt8` 和 `WeightInt4` 两种方案可以同时开启。
--- a/docs/zh_cn/load_hf.md
+++ b/docs/zh_cn/load_hf.md
-# 直接读取 huggingface 模型
-从 v0.1.0 开始，Turbomid 添加了直接读取 Huggingface 格式权重的能力。
-## 支持的类型
-目前，TurboMind 支持加载三种类型的模型：
-1. 在 huggingface.co 上面通过 lmdeploy 量化的模型，如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit)
-2. huggingface.co 上面其他 LM 模型，如Qwen/Qwen-7B-Chat
-3. 通过 `lmdeploy convert` 命令转换好的模型，兼容旧格式
-## 使用方式
-### 1) 通过 lmdeploy 量化的模型
-对于通过 `lmdeploy.lite` 量化的模型，TurboMind 可以直接加载，比如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit).
-```
-repo_id=internlm/internlm-chat-20b-4bit
-model_name=internlm-chat-20b
-# or
-# repo_id=/path/to/downloaded_model
-# Inference by TurboMind
-lmdeploy chat turbomind $repo_id --model-name $model_name
-# Serving with gradio
-lmdeploy serve gradio $repo_id --model-name $model_name
-# Serving with Restful API
-lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
-```
-### 2) 其他的 LM 模型
-其他 LM 模型比如 Qwen/Qwen-7B-Chat, baichuan-inc/Baichuan2-7B-Chat。LMDeploy 模型支持情况可通过 `lmdeploy list` 查看。
-```
-repo_id=Qwen/Qwen-7B-Chat
-model_name=qwen-7b
-# or
-# repo_id=/path/to/Qwen-7B-Chat/local_path
-# Inference by TurboMind
-lmdeploy chat turbomind $repo_id --model-name $model_name
-# Serving with gradio
-lmdeploy serve gradio $repo_id --model-name $model_name
-# Serving with Restful API
-lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
-```
-### 3) 通过 `lmdeploy convert` 命令转换好的模型
-使用方式与之前相同
-```
-# Convert a model
-lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME
-# Inference by TurboMind
-lmdeploy chat turbomind ./workspace
-# Serving with gradio
-lmdeploy serve gradio ./workspace
-# Serving with Restful API
-lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
-```
--- a/docs/zh_cn/restful_api.md
+++ b/docs/zh_cn/restful_api.md
-# Restful API
-### 启动服务
-用户将下面命令输出的 http url 复制到浏览器打开，详细查看所有的 API 及其使用方法。
-请一定查看`http://{server_ip}:{server_port}`！！！
-请一定查看`http://{server_ip}:{server_port}`！！！
-请一定查看`http://{server_ip}:{server_port}`！！！
-重要的事情说三遍。
-```shell
-lmdeploy serve api_server ./workspace 0.0.0.0 --server_port ${server_port} --instance_num 64 --tp 1
-```
-我们提供的 restful api，其中三个仿照 OpenAI 的形式。
- /v1/chat/completions
- /v1/models
- /v1/completions
-不过，我们建议用户用我们提供的另一个 API: `/v1/chat/interactive`。
-它有更好的性能，提供更多的参数让用户自定义修改。
-### python
-我们将这些服务的客户端功能集成在 `APIClient` 类中。下面是一些例子，展示如何在客户端调用 `api_server` 服务。
-如果你想用 `/v1/chat/completions` 接口，你可以尝试下面代码：
-```python
-from lmdeploy.serve.openai.api_client import APIClient
-api_client = APIClient('http://{server_ip}:{server_port}')
-model_name = api_client.available_models[0]
-messages = [{"role": "user", "content": "Say this is a test!"}]
-for item in api_client.chat_completions_v1(model=model_name, messages=messages):
-    print(item)
-```
-如果你想用 `/v1/completions` 接口，你可以尝试：
-```python
-from lmdeploy.serve.openai.api_client import APIClient
-api_client = APIClient('http://{server_ip}:{server_port}')
-model_name = api_client.available_models[0]
-for item in api_client.completions_v1(model=model_name, prompt='hi'):
-    print(item)
-```
-LMDeploy 的 `/v1/chat/interactive` api 支持将对话内容管理在服务端，但是我们默认关闭。如果想尝试，请阅读以下介绍：
- 交互模式下，对话历史保存在 server。在一次完整的多轮对话中，所有请求设置`interactive_mode = True`, `session_id`保持相同 (不为 -1，这是缺省值)。
- 非交互模式下，server 不保存历史记录。
-交互模式可以通过 `interactive_mode` 布尔量参数控制。下面是一个普通模式的例子，
-如果要体验交互模式，将 `interactive_mode=True` 传入即可。
-```python
-from lmdeploy.serve.openai.api_client import APIClient
-api_client = APIClient('http://{server_ip}:{server_port}')
-for item in api_client.chat_interactive_v1(prompt='hi'):
-    print(item)
-```
-### Java/Golang/Rust
-可以使用代码生成工具 [openapi-generator-cli](https://github.com/OpenAPITools/openapi-generator-cli) 将 `http://{server_ip}:{server_port}/openapi.json` 转成 java/rust/golang 客户端。
-下面是一个使用示例：
-```shell
-$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust
-$ ls rust/*
-rust/Cargo.toml  rust/git_push.sh  rust/README.md
-rust/docs:
-ChatCompletionRequest.md  EmbeddingsRequest.md  HttpValidationError.md  LocationInner.md  Prompt.md
-DefaultApi.md             GenerateRequest.md    Input.md                Messages.md       ValidationError.md
-rust/src:
-apis  lib.rs  models
-```
-### cURL
-cURL 也可以用于查看 API 的输出结果
-查看模型列表：
-```bash
-curl http://{server_ip}:{server_port}/v1/models
-```
-Interactive Chat:
-```bash
-curl http://{server_ip}:{server_port}/v1/chat/interactive \
-  -H "Content-Type: application/json" \
-  -d '{
-    "prompt": "Hello! How are you?",
-    "session_id": 1,
-    "interactive_mode": true
-  }'
-```
-Chat Completions:
-```bash
-curl http://{server_ip}:{server_port}/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "internlm-chat-7b",
-    "messages": [{"role": "user", "content": "Hello! How are you?"}]
-  }'
-```
-Text Completions:
-```shell
-curl http://{server_ip}:{server_port}/v1/completions \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "model": "llama",
-  "prompt": "two steps to build a house:"
-}'
-```
-### CLI client
-restful api 服务可以通过客户端测试，例如
-```shell
-# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
-lmdeploy serve api_client api_server_url
-```
-### webui
-也可以直接用 webui 测试使用 restful-api。
-```shell
-# api_server_url 就是 api_server 产生的，比如 http://localhost:23333
-# server_name 和 server_port 是用来提供 gradio ui 访问服务的
-# 例子: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
-lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
-```
-### FAQ
-1. 当返回结果结束原因为 `"finish_reason":"length"`，这表示回话长度超过最大值。如需调整会话支持的最大长度，可以通过启动`api_server`时，设置`--session_len`参数大小。
-2. 当服务端显存 OOM 时，可以适当减小启动服务时的 `instance_num` 个数
-3. 当同一个 `session_id` 的请求给 `/v1/chat/interactive` 函数后，出现返回空字符串和负值的 `tokens`，应该是 `session_id` 混乱了，可以先将交互模式关闭，再重新开启。
-4. `/v1/chat/interactive` api 支持多轮对话, 但是默认关闭。`messages` 或者 `prompt` 参数既可以是一个简单字符串表示用户的单词提问，也可以是一段对话历史。
--- a/docs/zh_cn/serving.md
+++ b/docs/zh_cn/serving.md
-# 模型服务
-## 部署 [LLaMA-2](https://github.com/facebookresearch/llama) 服务
-请从[这里](https://huggingface.co/meta-llama) 下载 llama2 模型，参考如下命令部署服务：
-<details open>
-<summary><b>7B</b></summary>
-```shell
-lmdeploy convert llama2 /path/to/llama-2-7b-chat-hf
-bash workspace/service_docker_up.sh
-```
-</details>
-<details open>
-<summary><b>13B</b></summary>
-```shell
-lmdeploy convert llama2 /path/to/llama-2-13b-chat-hf --tp 2
-bash workspace/service_docker_up.sh
-```
-</details>
-<details open>
-<summary><b>70B</b></summary>
-```shell
-lmdeploy convert llama2 /path/to/llama-2-70b-chat-hf --tp 8
-bash workspace/service_docker_up.sh
-```
-</details>
-## 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务
-请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)，获取 LLaMA 模型权重
-<details open>
-<summary><b>7B</b></summary>
-```shell
-lmdeploy convert llama /path/to/llama-7b llama \
-    --tokenizer_path /path/to/tokenizer/model
-bash workspace/service_docker_up.sh
-```
-</details>
-<details open>
-<summary><b>13B</b></summary>
-```shell
-lmdeploy convert llama /path/to/llama-13b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 2
-bash workspace/service_docker_up.sh
-```
-</details>
-<details open>
-<summary><b>30B</b></summary>
-```shell
-lmdeploy convert llama /path/to/llama-30b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 4
-bash workspace/service_docker_up.sh
-```
-</details>
-<details open>
-<summary><b>65B</b></summary>
-```shell
-lmdeploy convert llama /path/to/llama-65b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 8
-bash workspace/service_docker_up.sh
-```
-</details>
-### 部署 [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) 服务
-<details open>
-<summary><b>7B</b></summary>
-```shell
-python3 -m pip install fschat
-python3 -m fastchat.model.apply_delta \
-  --base-model-path /path/to/llama-7b \
-  --target-model-path /path/to/vicuna-7b \
-  --delta-path lmsys/vicuna-7b-delta-v1.1
-lmdeploy convert vicuna /path/to/vicuna-7b
-bash workspace/service_docker_up.sh
-```
-</details>
-<details open>
-<summary><b>13B</b></summary>
-```shell
-python3 -m pip install fschat
-python3 -m fastchat.model.apply_delta \
-  --base-model-path /path/to/llama-13b \
-  --target-model-path /path/to/vicuna-13b \
-  --delta-path lmsys/vicuna-13b-delta-v1.1
-lmdeploy convert vicuna /path/to/vicuna-13b
-bash workspace/service_docker_up.sh
-```
-</details>
--- a/docs/zh_cn/supported_models/codellama.md
+++ b/docs/zh_cn/supported_models/codellama.md
@@ -64,10 +64,10 @@ def remove_non_ascii(s: str) -> str:
 ### 对话
 ```
-lmdeploy chat turbomind ./workspace --cap chat --sys-instruct "Provide answers in Python"
+lmdeploy chat turbomind ./workspace --cap chat --meta-instruct "Provide answers in Python"
 ```
-可以把 `--sys-instruct` 的指令换成 codellama 支持的其他变成语言。
+可以把 `--meta-instruct` 的指令换成 codellama 支持的其他变成语言。
 ### Python 专项
@@ -88,9 +88,8 @@ TBD
 启动 sever 的方式是：
 ```shell
-# --instance_num: turbomind推理实例的个数。可理解为支持的最大并发数
 # --tp: 在 tensor parallel时，使用的GPU数量
-lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
+lmdeploy serve api_server ./workspace --server-name 0.0.0.0 --server-port ${server_port} --tp 1
 ```
 打开 `http://{server_ip}:{server_port}`，即可访问 swagger，查阅 RESTful API 的详细信息。
@@ -107,8 +106,8 @@ lmdeploy serve api_client api_server_url
 ```shell
 # api_server_url 就是 api_server 产生的，比如 http://localhost:23333
 # server_ip 和 server_port 是用来提供 gradio ui 访问服务的
-# 例子: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
+# 例子: lmdeploy serve gradio http://localhost:23333 --server-name localhost --server-port 6006
-lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
+lmdeploy serve gradio api_server_url --server-name ${gradio_ui_ip} --server-port ${gradio_ui_port}
 ```
-关于 RESTful API的详细介绍，请参考[这份](../restful_api.md)文档。
+关于 RESTful API的详细介绍，请参考[这份](../serving/api_server.md)文档。
--- a/docs/zh_cn/turbomind.md
+++ b/docs/zh_cn/turbomind.md
-# TurboMind
-TurboMind 是一款关于 LLM 推理的高效推理引擎，基于英伟达的 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) 研发而成。它的主要功能包括：LLaMa 结构模型的支持，persistent batch 推理模式和可扩展的 KV 缓存管理器。
-## TurboMind 结构
-```
-  +--------------------+
-  |        API         |
-  +--------------------+
-          |    ^
-    请 求  |    | 流式回调
-          v    |
-  +--------------------+    获取   +-------------------+
-  |  Persistent Batch  | <-------> |  KV Cache 管理器 |
-  +--------------------+    更新   +-------------------+
-             ^
-             |
-             v
-+------------------------+
-|      LLaMa推理实现      |
-+------------------------+
-| FT kernels & utilities |
-+------------------------+
-```
-## Persistent Batch
-你也许在别的项目中看到这项机制的另一个名字： `continuous batching` 。在开发这个功能时，我们将对话式 LLM 的推理建模为一个持续运行的 batch ，其生命周期跨越整个服务过程，故将其命名为 `persistent batch` 。简单来说是这样实现的：
- 该功能会预先准备好 N 个 batch slots。
- 当有空闲 slots 时， 请求就会加入到 batch 中。当请求对应的 tokens 都生成完毕后，对应的 batch slot 会立刻被释放，接收新的请求。
- **当一个 sequence 命中缓存时（见下文），它的历史 token 不必在每轮中都进行解码，所以它的 token 生成过程会即刻开始**。
- 整个 batch 会自动扩缩容来避免不必要的计算。
-## KV 缓存管理器
-TurboMind 的 [KV 缓存管理器](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/SequenceManager.h) 是一个内存池类型的对象，并且在其中加入了 LRU 的实现，这样整个管理器可以被看作是一个 **KV 缓存的缓存**。大致工作方式如下：
- KV 缓存由管理器分配。管理器会根据预先配置好的 slot 数量开辟空间。每个 slot 对应于一个 sequence 所需的 KV 缓存。分配的内存块大小可通过配置来实现预分配或者按需分配（或介于两者之间）。
- 当有新的请求，但是缓存池中没有空闲 slot时，根据 LRU 机制，管理器会踢除最近使用最少的 sequence，把它占据的 slot 分给新的请求。不仅仅如此，
- sequence获取到了slot，类似缓存命中。它在缓存中的历史KV会被直接返回，而不用再进行context decoding 。
- 被踢除的 sequences 不会被完全的删除，而是会被转换成最简洁的形式，例如 token IDs 。当之后获取到相同的 sequence id 时 (即 _cache-miss_ 状态)，这些 token IDs 将被 FMHA 的 context decoder 解码并被转回 KV 缓存。
- 踢除和转换均由 TurboMind 内部自动管理所以对用户来说是透明的。__从用户的使用角度来看，使用了 TurboMind 的系统就像是可以访问无限的设备内存__。
-## TurboMind 的 LLaMa 实现
-我们对 LLaMa 系列模型的实现是从 FasterTransformer 中的 Gpt-NeX 模型修改而来的。除了对 LLaMa 系列进行基本重构和修改外，我们还做了一些改进以实现会话模型的高性能推理，其中最重要的是：
- 支持多轮对话中的快速文本解码。我们用基于 [cutlass](https://github.com/NVIDIA/cutlass) 的 FMHA 实现替代了 context decoder 中的注意力机制实现，从而支持了 Q/K 长度不匹配的情况。
- 我们在 context FMHA 和 generation FMHA 中都加入了间接缓冲指针，支持 batch 中不连续的 KV 缓存。
- 为了支持 persistent batch 的并发推理，我们设计了新的同步机制来协调在张量并型模式下的工作线程。
- 我们实现了 INT8 KV cache，降低了内存开销，提高了批处理大小和系统吞吐量。这在实际场景中非常有用，因为相比权重和其他激活，KV cache 会消耗更多的内存和内存带宽。
- 我们解决了单个进程内多个模型实例在 TP 模式下运行时 NCCL 卡住的问题。NCCL APIs 现由 host 端的同步 barriers 保护。
-## API
-TurboMind 的 Python API 支持流式结果返回和张量并行模式。
-同时 TurboMind 也继承了 FasterTransformer 能够注册为 [Triton Inference Server](https://github.com/triton-inference-server/server) 推理后端的能力。但是为了支持 persistent batch 中的并发请求，我们不再像 FasterTransformer 那样使用 sequence batching 或者 dynamic batching 。相反，TurboMind 负责记录和管理请求序列的状态。
-## TurboMind 和 FasterTransformer 的区别
-除了上文中提到的功能外，TurboMind 相较于 FasterTransformer 还有不少差别。譬如不少 FasterTransformer 的功能在 TurboMind 中都被去掉了，这其中包括前缀提示词、 beam search 、上下文 embedding、稀疏化 GEMM 操作和对应 GPT 或 T5 等结构的模型的支持等等。
-## FAQ
-### 对 Huggingface 模型的支持
-因为历史因素， TurboMind 的权重设计是基于 [LLaMa 的官方实现](https://github.com/facebookresearch/llama) 完成的，两者只相差一个转置操作。但是 Huggingface 版本的实现却是[另一种形式](https://github.com/huggingface/transformers/blob/45025d92f815675e483f32812caa28cce3a960e7/src/transformers/models/llama/convert_llama_weights_to_hf.py#L123C76-L123C76)，两种权重实现方式在 `W_q` 和 `W_k` 上的区别我们在 [deploy.py](https://github.com/InternLM/lmdeploy/blob/ff4648a1d09e5aec74cf70efef35bfaeeac552e0/lmdeploy/serve/turbomind/deploy.py#L398) 进行了适配处理，用户可前往查看。
--- a/docs/zh_cn/turbomind_config.md
+++ b/docs/zh_cn/turbomind_config.md
-# TurboMind 配置
-TurboMind 是 LMDeploy 的推理引擎，在用它推理 LLM 模型时，需要把输入模型转成 TurboMind 模型。在 TurboMind 的模型文件夹中，除模型权重外，TurboMind 模型还包括其他一些文件，其中最重要的是和推理性能息息相关的配置文件`triton_models/weights/config.ini`。
-如果你使用的是 LMDeploy 0.0.x 版本，请参考[turbomind 1.0 配置](#turbomind-10-配置)章节，了解配置中的相关内容。如果使用的是 LMDeploy 0.1.x 版本，请阅读[turbomind 2.0 配置](#turbomind-20-配置)了解配置细节。
-## TurboMind 2.0 配置
-以 `llama-2-7b-chat` 模型为例，在 TurboMind 2.0 中，它的`config.ini`内容如下：
-```toml
-[llama]
-model_name = llama2
-tensor_para_size = 1
-head_num = 32
-kv_head_num = 32
-vocab_size = 32000
-num_layer = 32
-inter_size = 11008
-norm_eps = 1e-06
-attn_bias = 0
-start_id = 1
-end_id = 2
-session_len = 4104
-weight_type = fp16
-rotary_embedding = 128
-rope_theta = 10000.0
-size_per_head = 128
-group_size = 0
-max_batch_size = 64
-max_context_token_num = 1
-step_length = 1
-cache_max_entry_count = 0.5
-cache_block_seq_len = 128
-cache_chunk_size = 1
-use_context_fmha = 1
-quant_policy = 0
-max_position_embeddings = 2048
-rope_scaling_factor = 0.0
-use_logn_attn = 0
-```
-这些参数由模型属性和推理参数组成。模型属性包括层数、head个数、维度等等，它们**不可修改**
-```toml
-model_name = llama2
-head_num = 32
-kv_head_num = 32
-vocab_size = 32000
-num_layer = 32
-inter_size = 11008
-norm_eps = 1e-06
-attn_bias = 0
-start_id = 1
-end_id = 2
-rotary_embedding = 128
-rope_theta = 10000.0
-size_per_head = 128
-```
-和 TurboMind 1.0 config 相比，TurboMind 2.0 config 中的模型属性部分和 1.0 一致，但推理参数发生了变化。
-在接下来的章节中，我们重点介绍推理参数。
-### 数据类型
-和数据类型相关的参数是 `weight_type` 和 `group_size`。它们**不可被修改**。
-`weight_type` 表示权重的数据类型。目前支持 fp16 和 int4。int4 表示 4bit 权重。当 `weight_type`为 4bit 权重时，`group_size` 表示 `awq` 量化权重时使用的 group 大小。目前，在 LMDeploy 的预编译包中，使用的是 `group_size = 128`。
-### 批处理大小
-仍通过 `max_batch_size` 设置最大批处理量。默认值由原来的 32 改成 64。
-在 TurboMind 2.0 中，`max_batch_size` 和 `cache_max_entry_count`无关。
-### k/v 缓存大小
-`cache_block_seq_len` 和 `cache_max_entry_count` 用来调节 k/v cache 的内存大小。
-TurboMind 2.0 实现了 Paged Attention，按块管理 k/v cache。
-`cache_block_seq_len` 表示一块 k/v block 可以存放的 token 序列长度，默认 128。TurboMind 按照以下公式计算 k/v block 的内存大小：
-```
-cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_data_type)
-```
-对于 llama2-7b 模型来说，以 half 类型存放 k/v 时，一块 k/v block 的内存为：`128 * 32 * 32 * 128 * 2 * sizeof(half) = 64MB`
-`cache_max_entry_count` 根据取值不同，表示不同的含义：
- 当值为 (0, 1) 之间的小数时，`cache_max_entry_count` 表示 k/v block 使用的内存百分比。比如 A100-80G 显卡内存是80G，当`cache_max_entry_count`为0.5时，表示 k/v block 使用的内存总量为 80 * 0.5 = 40G
- 当值为 > 1的整数时，表示 k/v block 数量
-`cache_chunk_size` 表示在每次需要新的 k/v cache 块时，开辟 k/v cache 块的大小。不同的取值，表示不同的含义：
- 当为 > 0 的整数时，开辟 `cache_chunk_size` 个 k/v cache 块
- 当值为 -1 时，开辟 `cache_max_entry_count` 个 k/v cache 块
- 当值为 0 时，时，开辟 `sqrt(cache_max_entry_count)` 个 k/v cache 块
-### kv int8 开关
-`quant_policy`是 KV-int8 推理开关。具体使用方法，请参考 [kv int8](./kv_int8.md) 部署文档
-### 外推能力开关
-默认 `rope_scaling_factor = 0` 不具备外推能力。设置为 1.0，可以开启 RoPE 的 Dynamic NTK 功能，支持长文本推理。
-关于 Dynamic NTK 的原理，详细请参考：
-1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
-2. https://kexue.fm/archives/9675
-设置 `use_logn_attn = 1`，可以开启 [LogN attention scaling](https://kexue.fm/archives/8823)。
-## TurboMind 1.0 配置
-以 `llama-2-7b-chat` 模型为例，在 TurboMind 1.0 中，它的`config.ini`内容如下：
-```toml
-[llama]
-model_name = llama2
-tensor_para_size = 1
-head_num = 32
-kv_head_num = 32
-vocab_size = 32000
-num_layer = 32
-inter_size = 11008
-norm_eps = 1e-06
-attn_bias = 0
-start_id = 1
-end_id = 2
-session_len = 4104
-weight_type = fp16
-rotary_embedding = 128
-rope_theta = 10000.0
-size_per_head = 128
-group_size = 0
-max_batch_size = 32
-max_context_token_num = 4
-step_length = 1
-cache_max_entry_count = 48
-cache_chunk_size = 1
-use_context_fmha = 1
-quant_policy = 0
-max_position_embeddings = 2048
-use_dynamic_ntk = 0
-use_logn_attn = 0
-```
-这些参数由模型属性和推理参数组成。模型属性包括层数、head个数、维度等等，它们**不可修改**
-```toml
-model_name = llama2
-head_num = 32
-kv_head_num = 32
-vocab_size = 32000
-num_layer = 32
-inter_size = 11008
-norm_eps = 1e-06
-attn_bias = 0
-start_id = 1
-end_id = 2
-rotary_embedding = 128
-rope_theta = 10000.0
-size_per_head = 128
-```
-在接下来的章节中，我们重点介绍推理参数。
-### 数据类型
-和数据类型相关的参数是 `weight_type` 和 `group_size`。它们**不可被修改**。
-`weight_type` 表示权重的数据类型。目前支持 fp16 和 int4。int4 表示 4bit 权重。当 `weight_type`为 4bit 权重时，`group_size` 表示 `awq` 量化权重时使用的 group 大小。目前，在 LMDeploy 的预编译包中，使用的是 `group_size = 128`。
-### 批处理大小
-可通过`max_batch_size`调节推理时最大的 batch 数。一般，batch 越大吞吐量越高。但务必保证 `max_batch_size <= cache_max_entry_count`
-### k/v cache 大小
-TurboMind 根据 `session_len`、 `cache_chunk_size` 和 `cache_max_entry_count` 开辟 k/v cache 内存。
- `session_len` 表示一个序列的最大长度，即 context window 的大小。
- `cache_chunk_size` 表示当新增对话序列时，每次要开辟多少个序列的 k/v cache
- `cache_max_entry_count` 表示最多缓存多少个对话序列
-### kv int8 开关
-当启动 8bit k/v 推理时，需要修改参数 `quant_policy` 和 `use_context_fmha`。详细内容请查阅 [kv int8](./kv_int8.md) 部署文档。
-### 外推能力开关
-设置 `use_dynamic_ntk = 1`，可以开启 RoPE 的 Dynamic NTK 选项，支持长文本推理。
-关于 Dynamic NTK 的原理，详细请参考：
-1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
-2. https://kexue.fm/archives/9675
-设置 `use_logn_attn = 1`，可以开启 [LogN attention scaling](https://kexue.fm/archives/8823)。
--- a/docs/zh_cn/w4a16.md
+++ b/docs/zh_cn/w4a16.md
-# W4A16 LLM 模型部署
-LMDeploy 支持 4bit 权重模型的推理，**对 NVIDIA 显卡的最低要求是 sm80**，比如A10，A100，Gerforce 30/40系列。
-在推理之前，请确保安装了 lmdeploy
-```shell
-pip install lmdeploy[all]
-```
-## 4bit 权重模型推理
-你可以直接从 LMDeploy 的 [model zoo](https://huggingface.co/lmdeploy) 下载已经量化好的 4bit 权重模型，直接使用下面的命令推理。也可以根据["4bit 权重量化"](#4bit-权重量化)章节的内容，把 16bit 权重量化为 4bit 权重，然后再按下述说明推理
-以 4bit 的 Llama-2-chat-7B 模型为例，可以从 model zoo 直接下载：
-```shell
-git-lfs install
-git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
-```
-执行以下命令，即可在终端与模型对话：
-```shell
-## 转换模型的layout，存放在默认路径 ./workspace 下
-lmdeploy convert \
-    --model-name llama2 \
-    --model-path ./llama2-chat-7b-w4 \
-    --model-format awq \
-    --group-size 128
-## 推理
-lmdeploy chat turbomind ./workspace
-```
-## 启动 gradio 服务
-如果想通过 webui 与模型对话，请执行以下命令启动 gradio 服务
-```shell
-lmdeploy serve gradio ./workspace --server_name {ip_addr} --server_port {port}
-```
-然后，在浏览器中打开 http://{ip_addr}:{port}，即可在线对话
-## 推理速度
-我们在 NVIDIA GeForce RTX 4090 上使用 [profile_generation.py](https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py)，分别测试了 4-bit Llama-2-7B-chat 和 Llama-2-13B-chat 模型的 token 生成速度。测试配置为 batch size = 1，(prompt_tokens, completion_tokens) = (1, 512)
-| model            | llm-awq | mlc-llm | turbomind |
-| ---------------- | ------- | ------- | --------- |
-| Llama-2-7B-chat  | 112.9   | 159.4   | 206.4     |
-| Llama-2-13B-chat | N/A     | 90.7    | 115.8     |
-上述两个模型的16bit 和 4bit 权重，分别使用 turbomind 推理时，各自在context size 为 2048 和 4096 配置下，所占的显存对比如下：
-| model            | 16bit(2048) | 4bit(2048) | 16bit(4096) | 4bit(4096) |
-| ---------------- | ----------- | ---------- | ----------- | ---------- |
-| Llama-2-7B-chat  | 15.1        | 6.3        | 16.2        | 7.5        |
-| Llama-2-13B-chat | OOM         | 10.3       | OOM         | 12.0       |
-```
-pip install nvidia-ml-py
-```
-```shell
-python benchmark/profile_generation.py \
- --model-path ./workspace \
- --concurrency 1 8 --prompt-tokens 1 512 --completion-tokens 2048 512
-```
-## 4bit 权重量化
-4bit 权重量化包括 2 步：
- 生成量化参数
- 根据量化参数，量化模型权重
-### 第一步：生成量化参数
-```shell
-lmdeploy lite calibrate \
-  --model $HF_MODEL \
-  --calib_dataset 'c4' \             # 校准数据集，支持 c4, ptb, wikitext2, pileval
-  --calib_samples 128 \              # 校准集的样本数，如果显存不够，可以适当调小
-  --calib_seqlen 2048 \              # 单条的文本长度，如果显存不够，可以适当调小
-  --work_dir $WORK_DIR \             # 保存 Pytorch 格式量化统计参数和量化后权重的文件夹
-```
-### 第二步：量化权重模型
-LMDeploy 使用 AWQ 算法对模型权重进行量化。在执行下面的命令时，需要把步骤1的`$WORK_DIR`传入。量化结束后，权重文件也会存放在这个目录中。然后就可以根据 ["4bit权重模型推理"](#4bit-权重模型推理)章节的说明，进行模型推理。
-```shell
-lmdeploy lite auto_awq \
-  --model $HF_MODEL \
-  --w_bits 4 \                       # 权重量化的 bit 数
-  --w_group_size 128 \               # 权重量化分组统计尺寸
-  --work_dir $WORK_DIR \             # 步骤 1 保存量化参数的目录
-```
--- a/examples/cpp/llama/llama_config.ini
+++ b/examples/cpp/llama/llama_config.ini
@@ -3,6 +3,7 @@ data_type=fp16
 enable_custom_all_reduce=0
 pipeline_para_size=1
 tensor_para_size=1
+; update model_dir path according to the actual situation
 model_dir=/workspace/models/triton_models/weights/

--- a/examples/cpp/llama/llama_triton_example.cc
+++ b/examples/cpp/llama/llama_triton_example.cc
@@ -255,7 +255,7 @@ int read_start_ids(size_t            batch_size,
                   std::string       file_name);
 std::vector<std::shared_ptr<std::unordered_map<std::string, triton::Tensor>>>
-prepareRequest(std::string ini_name, const int node_id, const int gpu_count, std::vector<void*>* pointer_record)
+prepareRequest(std::string ini_name, const int node_id, const int gpu_count, std::vector<void*>* pointer_record, const std::string& csv_name)
 {
    INIReader reader = INIReader(ini_name);
    if (reader.ParseError() < 0) {
@@ -279,7 +279,7 @@ prepareRequest(std::string ini_name, const int node_id, const int gpu_count, std
                   max_input_len,
                   end_id,
                   1,
-                   "../examples/cpp/llama/start_ids.csv");
+                   csv_name);
    // drop requests > request_batch_size
    if (v_start_lengths.size() > request_batch_size) {
        v_start_lengths.resize(request_batch_size);
@@ -363,6 +363,7 @@ int main(int argc, char* argv[])
    // Note: Only supports that all nodes have same gpu count
    const int   gpu_count  = ft::getDeviceCount();
    const int   world_size = node_num * gpu_count;
+    printf("Recommend to specify the first parameter on the command line as the path to llama_config.ini\n");
    std::string ini_name   = argc >= 2 ? std::string(argv[1]) : "../examples/cpp/llama/llama_config.ini";
    // step 1: Create model
@@ -372,7 +373,7 @@ int main(int argc, char* argv[])
    printf(
        "world_size=%d tensor_para_size=%d pipeline_para_size=%d\n", world_size, tensor_para_size, pipeline_para_size);
    FT_CHECK_WITH_INFO(world_size == (tensor_para_size * pipeline_para_size),
-                       "World Size != Tensor Parallel Size * Pipeline Parallel Size !");
+                       "World Size != Tensor Parallel Size * Pipeline Parallel Size ! Maybe you can use CUDA_VISIBLE_DEVICES.");
    std::cout << model->toString();
@@ -402,10 +403,12 @@ int main(int argc, char* argv[])
    }
    // step 4: prepare request
+    printf("Recommend to specify the second parameter on the command line as the path to start_ids.csv\n");
+    std::string csv_name = argc >= 3 ? std::string(argv[2]) : "../examples/cpp/llama/start_ids.csv";
    std::vector<void*> pointer_record;  // Used to prevent the pointers are
                                        // release after leaving functions
    std::vector<std::shared_ptr<std::unordered_map<std::string, triton::Tensor>>> request_list =
-        prepareRequest(ini_name, node_id, gpu_count, &pointer_record);
+        prepareRequest(ini_name, node_id, gpu_count, &pointer_record, csv_name);
    printf("[INFO] request is created \n");
    // step 5: Forward

--- a/examples/cpp/llama/tokenizer.py
+++ b/examples/cpp/llama/tokenizer.py
@@ -38,7 +38,8 @@ class Tokenizer:
 def main(model_file: str = '/data/llama/model/tokenizer.model',
         encode_file: str = None,
-         decode_file: str = None):
+         decode_file: str = None,
+         encode_line: str = None):
    tokenizer = Tokenizer(model_file)
    if encode_file:
        with open(encode_file, 'r') as f:
@@ -59,6 +60,13 @@ def main(model_file: str = '/data/llama/model/tokenizer.model',
                _token_ids = [int(token_id) for token_id in _token_ids]
                ys = tokenizer.decode(_token_ids)
                print(ys)
+    elif encode_line:
+        xs = tokenizer.encode(encode_line)
+        xs = ','.join(map(str, xs))
+        print(xs)
+        output_dir = osp.dirname(osp.abspath(__file__))
+        with open(osp.join(output_dir, 'start_ids.csv'), 'w') as f:
+            f.write(xs)
    else:
        first = True
        while True: