Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
7297941b
Unverified
Commit
7297941b
authored
Mar 20, 2025
by
Edwin Hernandez
Committed by
GitHub
Mar 20, 2025
Browse files
[Doc] Update LWS docs (#15163)
Signed-off-by:
Edwinhr716
<
Edandres249@gmail.com
>
parent
f8a08cb9
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
189 additions
and
2 deletions
+189
-2
docs/source/deployment/frameworks/lws.md
docs/source/deployment/frameworks/lws.md
+189
-2
No files found.
docs/source/deployment/frameworks/lws.md
View file @
7297941b
...
@@ -7,5 +7,192 @@ A major use case is for multi-host/multi-node distributed inference.
...
@@ -7,5 +7,192 @@ A major use case is for multi-host/multi-node distributed inference.
vLLM can be deployed with
[
LWS
](
https://github.com/kubernetes-sigs/lws
)
on Kubernetes for distributed model serving.
vLLM can be deployed with
[
LWS
](
https://github.com/kubernetes-sigs/lws
)
on Kubernetes for distributed model serving.
Please see
[
this guide
](
https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm
)
for more details on
## Prerequisites
deploying vLLM on Kubernetes using LWS.
*
At least two Kubernetes nodes, each with 8 GPUs, are required.
*
Install LWS by following the instructions found
[
here
](
https://lws.sigs.k8s.io/docs/installation/
)
.
## Deploy and Serve
Deploy the following yaml file
`lws.yaml`
```
yaml
apiVersion
:
leaderworkerset.x-k8s.io/v1
kind
:
LeaderWorkerSet
metadata
:
name
:
vllm
spec
:
replicas
:
2
leaderWorkerTemplate
:
size
:
2
restartPolicy
:
RecreateGroupOnPodRestart
leaderTemplate
:
metadata
:
labels
:
role
:
leader
spec
:
containers
:
-
name
:
vllm-leader
image
:
docker.io/vllm/vllm-openai:latest
env
:
-
name
:
HUGGING_FACE_HUB_TOKEN
value
:
<your-hf-token>
command
:
-
sh
-
-c
-
"
bash
/vllm-workspace/examples/online_serving/multi-node-serving.sh
leader
--ray_cluster_size=$(LWS_GROUP_SIZE);
python3
-m
vllm.entrypoints.openai.api_server
--port
8080
--model
meta-llama/Meta-Llama-3.1-405B-Instruct
--tensor-parallel-size
8
--pipeline_parallel_size
2"
resources
:
limits
:
nvidia.com/gpu
:
"
8"
memory
:
1124Gi
ephemeral-storage
:
800Gi
requests
:
ephemeral-storage
:
800Gi
cpu
:
125
ports
:
-
containerPort
:
8080
readinessProbe
:
tcpSocket
:
port
:
8080
initialDelaySeconds
:
15
periodSeconds
:
10
volumeMounts
:
-
mountPath
:
/dev/shm
name
:
dshm
volumes
:
-
name
:
dshm
emptyDir
:
medium
:
Memory
sizeLimit
:
15Gi
workerTemplate
:
spec
:
containers
:
-
name
:
vllm-worker
image
:
docker.io/vllm/vllm-openai:latest
command
:
-
sh
-
-c
-
"
bash
/vllm-workspace/examples/online_serving/multi-node-serving.sh
worker
--ray_address=$(LWS_LEADER_ADDRESS)"
resources
:
limits
:
nvidia.com/gpu
:
"
8"
memory
:
1124Gi
ephemeral-storage
:
800Gi
requests
:
ephemeral-storage
:
800Gi
cpu
:
125
env
:
-
name
:
HUGGING_FACE_HUB_TOKEN
value
:
<your-hf-token>
volumeMounts
:
-
mountPath
:
/dev/shm
name
:
dshm
volumes
:
-
name
:
dshm
emptyDir
:
medium
:
Memory
sizeLimit
:
15Gi
---
apiVersion
:
v1
kind
:
Service
metadata
:
name
:
vllm-leader
spec
:
ports
:
-
name
:
http
port
:
8080
protocol
:
TCP
targetPort
:
8080
selector
:
leaderworkerset.sigs.k8s.io/name
:
vllm
role
:
leader
type
:
ClusterIP
```
```
bash
kubectl apply
-f
lws.yaml
```
Verify the status of the pods:
```
bash
kubectl get pods
```
Should get an output similar to this:
```
bash
NAME READY STATUS RESTARTS AGE
vllm-0 1/1 Running 0 2s
vllm-0-1 1/1 Running 0 2s
vllm-1 1/1 Running 0 2s
vllm-1-1 1/1 Running 0 2s
```
Verify that the distributed tensor-parallel inference works:
```
bash
kubectl logs vllm-0 |grep
-i
"Loading model weights took"
```
Should get something similar to this:
```
text
INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB
(RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB
```
## Access ClusterIP service
```
bash
# Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service
kubectl port-forward svc/vllm-leader 8080:8080
```
The output should be similar to the following:
```
text
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
```
## Serve the model
Open another terminal and send a request
```
text
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```
The output should be similar to the following
```
text
{
"id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
"object": "text_completion",
"created": 1715138766,
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
"choices": [
{
"index": 0,
"text": " top destination for foodies, with",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 5,
"total_tokens": 12,
"completion_tokens": 7
}
}
```
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment