更新model.properties以添加加速卡类型，更新README.md以包含技术报告链接、硬件需求和模型预训练权重信息，同时调整示例命令以支持多卡推理和GPU显存利用率设置。

a75f7b9d · laibao · 3158fd2c · a75f7b9d · a75f7b9d · a75f7b9d
Commit a75f7b9d authored Oct 29, 2025 by laibao
Hide whitespace changes
Inline Side-by-side

Showing with 63 additions and 34 deletions

README.md README.md +43 -31

examples/offline_inference/basic/basic.py examples/offline_inference/basic/basic.py +18 -3

model.properties model.properties +2 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
 ## 论文
-无
+[Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)
 ## 模型结构
@@ -28,6 +28,10 @@ Qwen3是一个decoder-only的transformer模型，使用SwiGLU激活函数、RoPE
 ## 环境配置
+### 硬件需求
+DCU型号:BW1000,节点数量:1台,卡数:8张
 ### Docker（方法一）
 提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像：
@@ -56,12 +60,12 @@ docker run -it --name qwen3_vllm --privileged --shm-size=64G  --device=/dev/kfd
 conda create -n qwen3_vllm python=3.10
 ```
-关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
 * DTK驱动：dtk25.04.01
-* Pytorch: 2.4.0
+* Pytorch: 2.5.1
-* triton: 3.0.0
+* triton: 3.1
-* lmslim: 0.2.1
+* lmslim: 0.3.1
 * flash_attn: 2.6.1
 * flash_mla: 1.0.0
 * vllm: 0.9.2
@@ -85,55 +89,48 @@ export VLLM_RANK7_NUMA=7
 无
-## 推理
+## 训练
-### 模型下载
+无
-| 基座模型                                                    |
+## 推理
-| ----------------------------------------------------------- |
-| [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)           |
+以Qwen3-235B-A22B为例
-| [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)           |
-| [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)               |
-| [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)               |
-| [Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)             |
-| [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)             |
-| [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)     |
-| [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) |
 ### 离线批量推理
 ```bash
- python examples/offline_inference/basic/basic.py
+ python examples/offline_inference/basic/basic.py -tp 8 --model_path xxx
 ```
-其中，本示例脚本在代码中直接定义了 `prompts`，并设置 `temperature=0.8`、`top_p=0.95`、`max_tokens=16`；如需调整请修改脚本中的参数。`model` 在脚本中指定为本地模型路径；`tensor_parallel_size=1` 表示使用 1 卡；`dtype="float16"` 为推理数据类型（若权重为 bfloat16，请相应调整）。本示例未使用 `quantization` 参数，量化推理请参考下文性能测试示例。
+其中，本示例脚本在代码中直接定义了 `prompts`，并设置 `temperature=0.8`、`top_p=0.95`、`max_tokens=16`；如需调整请修改脚本中的参数。`model_path` 在脚本中指定为本地模型路径；`tensor_parallel_size=1` 表示使用 1 卡；`dtype="float16"` 为推理数据类型（若权重为 bfloat16，请相应调整）。本示例未使用 `quantization` 参数，量化推理请参考下文性能测试示例。
 ### 离线批量推理性能测试
 1、指定输入输出
 ```bash
- python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model /your/model/path -tp 1 --trust-remote-code --enforce-eager --dtype float16
+ python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model /your/model/path -tp 8 --trust-remote-code --enforce-eager --dtype float16 --gpu-memory-utilization 0.98
 ```
-其中 `--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型。若模型权重为 bfloat16，建议设置 `--dtype bfloat16` 或使用 `--dtype auto` 以匹配权重精度。若指定 `--output-len  1`即为首字延迟。
+其中 `--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型。若模型权重为 bfloat16，建议设置 `--dtype bfloat16` 或使用 `--dtype auto` 以匹配权重精度。若指定 `--output-len  1`即为首字延迟。--gpu-memory-utilization为gpu显存利用率
 2、使用数据集
 下载数据集：
 [sharegpt_v3_unfiltered_cleaned_split](https://huggingface.co/datasets/learnanything/sharegpt_v3_unfiltered_cleaned_split)
 ```bash
- python benchmarks/benchmark_throughput.py --num-prompts 1 --model /your/model/path --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
+ python benchmarks/benchmark_throughput.py --num-prompts 1 --model /your/model/path --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json -tp 8 --trust-remote-code --enforce-eager --dtype float16 --gpu-memory-utilization 0.98
 ```
-其中 `--num-prompts`是batch数，`--model`为模型路径，`--dataset`为使用的数据集，`-tp`为使用卡数，`dtype="float16"`为推理数据类型。若模型权重为 bfloat16，建议设置 `--dtype bfloat16` 或使用 `--dtype auto` 以匹配权重精度。
+其中 `--num-prompts`是batch数，`--model`为模型路径，`--dataset`为使用的数据集，`-tp`为使用卡数，`dtype="float16"`为推理数据类型。若模型权重为 bfloat16，建议设置 `--dtype bfloat16` 或使用 `--dtype auto` 以匹配权重精度。--gpu-memory-utilization为gpu显存利用率
 ### OpenAI api服务推理性能测试
 1.启动服务：
 ```bash
- vllm serve --model /your/model/path --enforce-eager --dtype float16 --trust-remote-code --tensor-parallel-size 1
+ vllm serve --model /your/model/path --enforce-eager --dtype float16 --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.98
 ```
 2.启动客户端
@@ -149,10 +146,10 @@ python benchmarks/benchmark_serving.py --model /your/model/path --dataset-name s
 启动服务：
 ```bash
- vllm serve /your/model/path --enforce-eager --dtype float16 --trust-remote-code
+ vllm serve /your/model/path --enforce-eager --dtype float16 --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.98
 ```
-这里serve之后为加载模型路径，`--dtype`为数据类型：float16，默认情况使用tokenizer中的预定义聊天模板。
+其中，<模型路径> 为本地模型路径；--dtype  指定推理数据类型；--tensor-parallel-size  表示使用多卡张量并行；--gpu-memory-utilization  设置GPU显存利用率为98%。默认使用模型自带的聊天模板。
 ### OpenAI Chat API和vllm结合使用
@@ -207,7 +204,7 @@ ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节
 3.启动OpenAI兼容服务
 ```
- vllm serve /your/model/path --enforce-eager --dtype float16 --trust-remote-code --host "0.0.0.0"
+ vllm serve /your/model/path --enforce-eager --dtype float16 --trust-remote-code --host "0.0.0.0" --gpu-memory-utilization 0.98 --tensor-parallel-size 8
 ```
 4.启动gradio服务
@@ -222,15 +219,17 @@ python examples/online_serving/gradio_openai_chatbot_webserver.py --model "/your
 ## result
-使用的加速卡:1张 DCU-K100_AI-64G
 ```
 Prompt: 'What is deep learning?', Generated text: ' Deep learning is a subset of machine learning that involves the use of neural networks to model and solve complex problems. Neural networks are a network of interconnected nodes or " neurons" that are designed to recognize patterns in data, learn from examples, and make predictions or decisions.\nThe term "deep" in deep learning refers to the use of multiple layers or hidden layers in these neural networks. Each layer processes the input data in a different way, extracting increasingly abstract features as the data passes through.'
 ```
-### 精度
+## 精度
-无
+| 模型            | 数据集    | 得分  |
+| --------------- | --------- | ----- |
+|                 | gsm8k     | 95.83 |
+| Qwen3-235B-A22B | math500   | 94.2  |
+|                 | humameval | 95.73 |
 ## 应用场景
@@ -242,6 +241,19 @@ Prompt: 'What is deep learning?', Generated text: ' Deep learning is a subset of
 金融,科研,教育
+## 预训练权重
+| 基座模型                                                    |
+| ----------------------------------------------------------- |
+| [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)           |
+| [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)           |
+| [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)               |
+| [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)               |
+| [Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)             |
+| [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)             |
+| [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)     |
+| [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) |
 ## 源码仓库及问题反馈
 * [https://developer.hpccube.com/codes/modelzoo/qwen3_vllm](https://developer.hpccube.com/codes/modelzoo/qwen3_vllm)

--- a/examples/offline_inference/basic/basic.py
+++ b/examples/offline_inference/basic/basic.py
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+import argparse
 from vllm import LLM, SamplingParams
 # Sample prompts.
@@ -15,9 +16,11 @@ prompts = [
 sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=16)
-def main():
+def main(model_path, tensor_parallel_size, gpu_memory_utilization, dtype):
    # Create an LLM.
-    llm = LLM(model="/mnt/data/llm-models/qwen3/Qwen3-8B",tensor_parallel_size=1, dtype="float16",trust_remote_code=True, enforce_eager=True, block_size=64, enable_prefix_caching=False)
+    llm = LLM(model=model_path, tensor_parallel_size=tensor_parallel_size, dtype=dtype,
+              trust_remote_code=True, enforce_eager=True, block_size=64,
+              enable_prefix_caching=False, gpu_memory_utilization=gpu_memory_utilization)
    # Generate texts from the prompts.
    # The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
@@ -33,5 +36,17 @@ def main():
 if __name__ == "__main__":
-    main()
+    parser = argparse.ArgumentParser(description="vLLM Offline Inference Example")
+    parser.add_argument("--model_path", type=str, default="/mnt/data/llm-models/qwen3/Qwen3-8B",
+                        help="Path to the model")
+    parser.add_argument("--tp", "--tensor_parallel_size", type=int, default=1,
+                        help="Tensor parallel size")
+    parser.add_argument("--gpu_memory_utilization", type=float, default=0.98,
+                        help="GPU memory utilization (0.0-1.0)")
+    parser.add_argument("--dtype", type=str, default="float16",
+                        choices=["float16", "float32", "int8", "auto"],
+                        help="Data type for model weights")
+    args = parser.parse_args()
+    main(args.model_path, args.tensor_parallel_size, args.gpu_memory_utilization, args.dtype)
--- a/model.properties
+++ b/model.properties
@@ -8,3 +8,5 @@ modelDescription=Qwen3是阿里云开源大型语言模型系列。
 appScenario=推理,对话问答,科研,教育,政府,金融
 # 框架类型
 frameType=vllm
+#加速卡类型
+accelerateType=BW1000,K100A