Unverified Commit 8bd58449 authored by Christian Berge's avatar Christian Berge Committed by GitHub
Browse files

correct LWS deployment yaml (#23104)


Signed-off-by: default avatarcberge908 <42270330+cberge908@users.noreply.github.com>
parent ce30dca5
...@@ -22,7 +22,7 @@ Deploy the following yaml file `lws.yaml` ...@@ -22,7 +22,7 @@ Deploy the following yaml file `lws.yaml`
metadata: metadata:
name: vllm name: vllm
spec: spec:
replicas: 2 replicas: 1
leaderWorkerTemplate: leaderWorkerTemplate:
size: 2 size: 2
restartPolicy: RecreateGroupOnPodRestart restartPolicy: RecreateGroupOnPodRestart
...@@ -41,7 +41,7 @@ Deploy the following yaml file `lws.yaml` ...@@ -41,7 +41,7 @@ Deploy the following yaml file `lws.yaml`
- sh - sh
- -c - -c
- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2" vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct --port 8080 --tensor-parallel-size 8 --pipeline_parallel_size 2"
resources: resources:
limits: limits:
nvidia.com/gpu: "8" nvidia.com/gpu: "8"
...@@ -126,8 +126,6 @@ Should get an output similar to this: ...@@ -126,8 +126,6 @@ Should get an output similar to this:
NAME READY STATUS RESTARTS AGE NAME READY STATUS RESTARTS AGE
vllm-0 1/1 Running 0 2s vllm-0 1/1 Running 0 2s
vllm-0-1 1/1 Running 0 2s vllm-0-1 1/1 Running 0 2s
vllm-1 1/1 Running 0 2s
vllm-1-1 1/1 Running 0 2s
``` ```
Verify that the distributed tensor-parallel inference works: Verify that the distributed tensor-parallel inference works:
......
...@@ -11,7 +11,7 @@ ...@@ -11,7 +11,7 @@
# Example usage: # Example usage:
# On the head node machine, start the Ray head node process and run a vLLM server. # On the head node machine, start the Ray head node process and run a vLLM server.
# ./multi-node-serving.sh leader --ray_port=6379 --ray_cluster_size=<SIZE> [<extra ray args>] && \ # ./multi-node-serving.sh leader --ray_port=6379 --ray_cluster_size=<SIZE> [<extra ray args>] && \
# python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2 # vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct --port 8080 --tensor-parallel-size 8 --pipeline_parallel_size 2
# #
# On each worker node, start the Ray worker node process. # On each worker node, start the Ray worker node process.
# ./multi-node-serving.sh worker --ray_address=<HEAD_NODE_IP> --ray_port=6379 [<extra ray args>] # ./multi-node-serving.sh worker --ray_address=<HEAD_NODE_IP> --ray_port=6379 [<extra ray args>]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment