vllm 0.9.2 更新README.md，升级docker镜像和vllm版本，简化推理示例代码，移除不必要的环境变量前缀。

3835b62e · laibao · 1a4d07ae · 3835b62e
Commit 3835b62e authored Oct 11, 2025 by laibao
Hide whitespace changes
Inline Side-by-side

Showing with 9 additions and 9 deletions

README.md README.md +9 -9

No files found.
--- a/README.md
+++ b/README.md
@@ -34,7 +34,7 @@ Qwen2.5是阿里云开源的最新一代大型语言模型，标志着Qwen系列
 提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像：

 ```
-docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.8.5-ubuntu22.04-dtk25.04.1-rc5-das1.6-py3.10-20250724
+docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.9.2-ubuntu22.04-dtk25.04.1-rc5-rocblas104381-0915-das1.6-py3.10-20250916-rc2
 # <Image ID>用上面拉取docker镜像的ID替换
 # <Host Path>主机端路径
 # <Container Path>容器映射路径
@@ -67,7 +67,7 @@ conda create -n qwen2.5_vllm python=3.10
 * lmslim: 0.2.1
 * flash_attn: 2.6.1
 * flash_mla: 1.0.0
-* vllm: 0.8.5
+* vllm: 0.9.2
 * python: python3.10

 `Tips：需先安装相关依赖，最后安装vllm包`  
@@ -108,7 +108,7 @@ export VLLM_RANK7_NUMA=7
 ### 离线批量推理

 ```bash
-VLLM_USE_FLASH_ATTN_PA=1 python examples/offline_inference/basic/basic.py
+ python examples/offline_inference/basic/basic.py
 ```

 其中，`prompts`为提示词；`temperature`为控制采样随机性的值，值越小模型生成越确定，值变高模型生成更随机，0表示贪婪采样，默认为1；`max_tokens=16`为生成长度，默认为16；
@@ -119,7 +119,7 @@ VLLM_USE_FLASH_ATTN_PA=1 python examples/offline_inference/basic/basic.py
 1、指定输入输出

 ```bash
-VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model /your/model/path -tp 1 --trust-remote-code --enforce-eager --dtype float16
+ python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model /your/model/path -tp 1 --trust-remote-code --enforce-eager --dtype float16
 ```

 其中 `--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型。若模型权重为 bfloat16，建议设置 `--dtype bfloat16` 或使用 `--dtype auto` 以匹配权重精度。若指定 `--output-len  1`即为首字延迟。
@@ -129,7 +129,7 @@ VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts
 [sharegpt_v3_unfiltered_cleaned_split](https://huggingface.co/datasets/learnanything/sharegpt_v3_unfiltered_cleaned_split)

 ```bash
-VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts 1 --model /your/model/path --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
+ python benchmarks/benchmark_throughput.py --num-prompts 1 --model /your/model/path --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
 ```

 其中 `--num-prompts`是batch数，`--model`为模型路径，`--dataset-name`为使用的数据集名称，`--dataset-path`为数据集路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型。若模型权重为 bfloat16，建议设置 `--dtype bfloat16` 或使用 `--dtype auto` 以匹配权重精度。`-q gptq`为使用gptq量化模型进行推理。
@@ -139,7 +139,7 @@ VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts
 1.启动服务：

 ```bash
-VLLM_USE_FLASH_ATTN_PA=1 vllm serve --model /your/model/path --enforce-eager --dtype float16 --trust-remote-code --tensor-parallel-size 1
+ vllm serve --model /your/model/path --enforce-eager --dtype float16 --trust-remote-code --tensor-parallel-size 1
 ```

 2.启动客户端
@@ -148,14 +148,14 @@ VLLM_USE_FLASH_ATTN_PA=1 vllm serve --model /your/model/path --enforce-eager --d
 python benchmarks/benchmark_serving.py --model /your/model/path --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
 ```

-参数同使用数据集，离线批量推理性能测试，具体参考[benchmarks/benchmark_serving.py](/codes/modelzoo/qwen1.5_vllm/-/blob/master/benchmarks/benchmark_serving.py)
+参数同使用数据集，离线批量推理性能测试，具体参考[benchmarks/benchmark_serving.py](benchmarks/benchmark_serving.py)

 ### OpenAI兼容服务

 启动服务：

 ```bash
-VLLM_USE_FLASH_ATTN_PA=1 vllm serve /your/model/path --enforce-eager --dtype float16 --trust-remote-code
+ vllm serve /your/model/path --enforce-eager --dtype float16 --trust-remote-code
 ```

 这里serve之后为加载模型路径，`--dtype`为数据类型：float16，默认情况使用tokenizer中的预定义聊天模板，`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理,`-q awq`为使用awq量化模型进行推理。
@@ -231,7 +231,7 @@ ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节
 3.启动OpenAI兼容服务

 ```
-VLLM_USE_FLASH_ATTN_PA=1 vllm serve /your/model/path --enforce-eager --dtype float16 --trust-remote-code --host "0.0.0.0"
+ vllm serve /your/model/path --enforce-eager --dtype float16 --trust-remote-code --host "0.0.0.0"
 ```

 4.启动gradio服务