Add vllm serve

9d6c8867 · chenych · 8011b5aa · 9d6c8867
Commit 9d6c8867 authored Jun 24, 2025 by chenych
Hide whitespace changes
Inline Side-by-side

Showing with 32 additions and 0 deletions

README.md README.md +32 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -61,6 +61,7 @@ pip install transformers>=4.51.0

 ## 推理
 ### vllm推理方法
+#### offline
 ```bash
 ## 必须添加HF_ENDPOINT环境变量
 export HF_ENDPOINT=https://hf-mirror.com
@@ -68,6 +69,37 @@ export HF_ENDPOINT=https://hf-mirror.com
 python infer_vllm.py --model_name_or_path /path/your_model_path/
 ```

+#### server
+官方对于qwen3发布的rerank模型增加了`hf-overrides`，详细原因见[[New Model]: Support Qwen3 Embedding & Reranker  by noooop · Pull Request #19260 · vllm-project/vllm](https://github.com/vllm-project/vllm/pull/19260)
+
+1. 启动命令如下
+```bash
+vllm serve Qwen/Qwen3-Reranker-0.6B               \
+            --host 0.0.0.0 --port 8080  --block-size 16                      \
+            --api-key 123456 --dtype auto                                    \
+            --trust-remote-code                                              \
+            --served-model-name Qwen3-reranker                               \
+            --enable-prefix-caching                                          \
+            --gpu-memory-utilization 0.9                                     \
+            --task score --disable-log-requests                              \
+
+            --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
+```
+
+2. 访问命令如下
+```
+curl -X 'POST' 'http://127.0.0.1:8080/score' \
+  -H 'accept: application/json'              \
+  -H 'Authorization: Bearer 123456'          \
+  -H 'Content-Type: application/json'        \
+  -d '{
+      "model": "Qwen3-reranker",
+      "encoding_format": "float",
+      "text_1": "What is the capital of France?",
+      "text_2": "The capital of France is Paris."
+}'
+```
+
 ## result
 <div align=center>
    <img src="./doc/results-dcu.png"/>