更新README.md，修正Mixture of Experts链接格式，更新Docker镜像和vllm库版本，调整推理示例命令以简化使用。

28a51b04 · xuxz · 52f97799 · 28a51b04
Commit 28a51b04 authored Oct 10, 2025 by xuxz
Hide whitespace changes
Inline Side-by-side

Showing with 12 additions and 15 deletions

README.md README.md +12 -15

No files found.
--- a/README.md
+++ b/README.md
@@ -8,9 +8,7 @@
 ## 论文
-`Mixture of Experts`
+[Mixture of Experts](https://arxiv.org/pdf/2401.04088)
-[https://arxiv.org/pdf/2401.04088]()
 ## 模型结构
@@ -38,7 +36,7 @@ Mixtral模型是一种稀疏专家混合（SMoE）语言模型，在每层包含
 提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像：
 ```
-docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.8.5-ubuntu22.04-dtk25.04.1-rc5-das1.6-py3.10-20250724
+docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.9.2-ubuntu22.04-dtk25.04.1-rc5-rocblas104381-0915-das1.6-py3.10-20250916-rc2
 # <Image ID>用上面拉取docker镜像的ID替换
 # <Host Path>主机端路径
 # <Container Path>容器映射路径
@@ -54,7 +52,7 @@ docker run -it --name mixtral_vllm --privileged --shm-size=64G  --device=/dev/kf
 # <Host Path>主机端路径
 # <Container Path>容器映射路径
 docker build -t mixtral:latest .
-docker run -it --name mixtral_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> llama:latest /bin/bash
+docker run -it --name mixtral_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> mixtral:latest /bin/bash
 ```
 ### Anaconda（方法三）
@@ -71,7 +69,7 @@ conda create -n mixtral_vllm python=3.10
 * lmslim: 0.2.1
 * flash_attn: 2.6.1
 * flash_mla: 1.0.0
-* vllm: 0.8.5
+* vllm: 0.9.2
 * python: python3.10
 `Tips：需先安装相关依赖，最后安装vllm包`  
@@ -103,18 +101,17 @@ export VLLM_RANK7_NUMA=7
 ### 离线批量推理
 ```bash
-VLLM_USE_FLASH_ATTN_PA=1 python examples/offline_inference/basic/basic.py
+ python examples/offline_inference/basic/basic.py
 ```
-其中，`prompts`为提示词；`temperature`为控制采样随机性的值，值越小模型生成越确定，值变高模型生成更随机，0表示贪婪采样，默认为1；`max_tokens=16`为生成长度，默认为1；
+其中，本示例脚本在代码中直接定义了 `prompts`，并设置 `temperature=0.8`、`top_p=0.95`、`max_tokens=16`；如需调整请修改脚本中的参数。`model` 在脚本中指定为本地模型路径；`tensor_parallel_size=1` 表示使用 1 卡；`dtype="float16"` 为推理数据类型（若权重为 bfloat16，请相应调整）。本示例未使用 `quantization` 参数，量化推理请参考下文性能测试示例。
-`model`为模型路径；`tensor_parallel_size=1`为使用卡数，默认为1；`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。
 ### 离线批量推理性能测试
 1、指定输入输出
 ```bash
-VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model mixtral/Mixtral-8x7B-Instruct-v0.1 -tp 4 --trust-remote-code --enforce-eager --dtype float16
+ python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model mixtral/Mixtral-8x7B-Instruct-v0.1 -tp 4 --trust-remote-code --enforce-eager --dtype float16
 ```
 其中 `--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。若指定 `--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
@@ -125,7 +122,7 @@ VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts
 [sharegpt_v3_unfiltered_cleaned_split](https://huggingface.co/datasets/learnanything/sharegpt_v3_unfiltered_cleaned_split)
 ```bash
-VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts 1 --model mixtral/Mixtral-8x7B-Instruct-v0.1 --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json -tp 4 --trust-remote-code --enforce-eager --dtype float16
+ python benchmarks/benchmark_throughput.py --num-prompts 1 --model mixtral/Mixtral-8x7B-Instruct-v0.1 --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json -tp 4 --trust-remote-code --enforce-eager --dtype float16
 ```
 其中 `--num-prompts`是batch数，`--model`为模型路径，`--dataset`为使用的数据集，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
@@ -135,7 +132,7 @@ VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts
 1、启动服务端：
 ```bash
-VLLM_USE_FLASH_ATTN_PA=1 vllm serve --model mixtral/Mixtral-8x7B-Instruct-v0.1 --enforce-eager --dtype float16 --trust-remote-code --tensor-parallel-size 4
+ vllm serve --model mixtral/Mixtral-8x7B-Instruct-v0.1 --enforce-eager --dtype float16 --trust-remote-code --tensor-parallel-size 4
 ```
 2、启动客户端：
@@ -144,14 +141,14 @@ VLLM_USE_FLASH_ATTN_PA=1 vllm serve --model mixtral/Mixtral-8x7B-Instruct-v0.1 -
 python benchmarks/benchmark_serving.py --model mixtral/Mixtral-8x7B-Instruct-v0.1 --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
 ```
-参数同使用数据集，离线批量推理性能测试，具体参考[benchmarks/benchmark_serving.py]（benchmarks/benchmark_serving.py）
+参数同使用数据集，离线批量推理性能测试，具体参考[benchmarks/benchmark_serving.py](benchmarks/benchmark_serving.py)
 ### OpenAI兼容服务
 启动服务：
 ```bash
-VLLM_USE_FLASH_ATTN_PA=1 vllm serve mixtral/Mixtral-8x7B-Instruct-v0.1 --enforce-eager --dtype float16 --trust-remote-code
+ vllm serve mixtral/Mixtral-8x7B-Instruct-v0.1 --enforce-eager --dtype float16 --trust-remote-code
 ```
 这里serve之后 为加载模型路径，`--dtype`为数据类型：float16，默认情况使用tokenizer中的预定义聊天模板，`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。
@@ -230,7 +227,7 @@ ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节
 3.启动OpenAI兼容服务
 ```
-VLLM_USE_FLASH_ATTN_PA=1 vllm serve mixtral/Mixtral-8x7B-Instruct-v0.1 --enforce-eager --dtype float16 --trust-remote-code --host "0.0.0.0"
+ vllm serve mixtral/Mixtral-8x7B-Instruct-v0.1 --enforce-eager --dtype float16 --trust-remote-code --host "0.0.0.0"
 ```
 4.启动gradio服务