Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
8bd58449
Unverified
Commit
8bd58449
authored
Sep 02, 2025
by
Christian Berge
Committed by
GitHub
Sep 02, 2025
Browse files
correct LWS deployment yaml (#23104)
Signed-off-by:
cberge908
<
42270330+cberge908@users.noreply.github.com
>
parent
ce30dca5
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
3 additions
and
5 deletions
+3
-5
docs/deployment/frameworks/lws.md
docs/deployment/frameworks/lws.md
+2
-4
examples/online_serving/multi-node-serving.sh
examples/online_serving/multi-node-serving.sh
+1
-1
No files found.
docs/deployment/frameworks/lws.md
View file @
8bd58449
...
...
@@ -22,7 +22,7 @@ Deploy the following yaml file `lws.yaml`
metadata:
name: vllm
spec:
replicas:
2
replicas:
1
leaderWorkerTemplate:
size: 2
restartPolicy: RecreateGroupOnPodRestart
...
...
@@ -41,7 +41,7 @@ Deploy the following yaml file `lws.yaml`
- sh
- -c
- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model
meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
vllm serve
meta-llama/Meta-Llama-3.1-405B-Instruct
--port 8080
--tensor-parallel-size 8 --pipeline_parallel_size 2"
resources:
limits:
nvidia.com/gpu: "8"
...
...
@@ -126,8 +126,6 @@ Should get an output similar to this:
NAME READY STATUS RESTARTS AGE
vllm-0 1/1 Running 0 2s
vllm-0-1 1/1 Running 0 2s
vllm-1 1/1 Running 0 2s
vllm-1-1 1/1 Running 0 2s
```
Verify that the distributed tensor-parallel inference works:
...
...
examples/online_serving/multi-node-serving.sh
View file @
8bd58449
...
...
@@ -11,7 +11,7 @@
# Example usage:
# On the head node machine, start the Ray head node process and run a vLLM server.
# ./multi-node-serving.sh leader --ray_port=6379 --ray_cluster_size=<SIZE> [<extra ray args>] && \
#
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model
meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2
#
vllm serve
meta-llama/Meta-Llama-3.1-405B-Instruct
--port 8080
--tensor-parallel-size 8 --pipeline_parallel_size 2
#
# On each worker node, start the Ray worker node process.
# ./multi-node-serving.sh worker --ray_address=<HEAD_NODE_IP> --ray_port=6379 [<extra ray args>]
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment