Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
150a1ffb
Unverified
Commit
150a1ffb
authored
Jul 26, 2024
by
Zhanghao Wu
Committed by
GitHub
Jul 26, 2024
Browse files
[Doc] Update SkyPilot doc for wrong indents and instructions for update service (#4283)
parent
281977bd
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
243 additions
and
187 deletions
+243
-187
docs/source/serving/run_on_sky.rst
docs/source/serving/run_on_sky.rst
+243
-187
No files found.
docs/source/serving/run_on_sky.rst
View file @
150a1ffb
...
...
@@ -159,18 +159,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
--model $MODEL_NAME \
--trust-remote-code \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://localhost:8081/v1 \
--stop-token-ids 128009,128001
2>&1 | tee api_server.log
.. raw:: html
...
...
@@ -203,8 +192,8 @@ Wait until the service is ready:
Service Replicas
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP({'L4': 1}) READY us-east4
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP({'L4': 1}) READY us-east4
vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP(
[Spot]
{'L4': 1}) READY us-east4
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP(
[Spot]
{'L4': 1}) READY us-east4
.. raw:: html
...
...
@@ -232,19 +221,91 @@ After the service is READY, you can find a single endpoint for the service and a
"stop_token_ids": [128009, 128001]
}'
To enable autoscaling, you could
specify additional
configs in `service
s
`:
To enable autoscaling, you could
replace the `replicas` with the following
configs in `service`:
.. code-block:: yaml
service
s
:
service:
replica_policy:
min_replicas:
0
max_replicas:
3
min_replicas:
2
max_replicas:
4
target_qps_per_replica: 2
This will scale the service up to when the QPS exceeds 2 for each replica.
.. raw:: html
<details>
<summary>Click to see the full recipe YAML</summary>
.. code-block:: yaml
service:
replica_policy:
min_replicas: 2
max_replicas: 4
target_qps_per_replica: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
use_spot: True
disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best
ports: 8081 # Expose to internet traffic.
envs:
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
setup: |
conda create -n vllm python=3.10 -y
conda activate vllm
pip install vllm==0.4.0.post1
# Install Gradio for web UI.
pip install gradio openai
pip install flash-attn==2.5.7
run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \
--model $MODEL_NAME \
--trust-remote-code \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
2>&1 | tee api_server.log
.. raw:: html
</details>
To update the service with the new config:
.. code-block:: console
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
To stop the service:
.. code-block:: console
sky serve down vllm
**Optional**: Connect a GUI to the endpoint
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
...
@@ -259,18 +320,15 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
.. code-block:: yaml
envs:
MODEL_NAME: meta-llama/Meta-Llama-3-
70
B-Instruct
MODEL_NAME: meta-llama/Meta-Llama-3-
8
B-Instruct
ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
resources:
cpus: 2
setup: |
conda activate vllm
if [ $? -ne 0 ]; then
conda create -n vllm python=3.10 -y
conda activate vllm
fi
# Install Gradio for web UI.
pip install gradio openai
...
...
@@ -278,9 +336,6 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
run: |
conda activate vllm
export PATH=$PATH:/sbin
WORKER_IP=$(hostname -I | cut -d' ' -f1)
CONTROLLER_PORT=21001
WORKER_PORT=21002
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
...
...
@@ -290,6 +345,7 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
--model-url http://$ENDPOINT/v1 \
--stop-token-ids 128009,128001 | tee ~/gradio.log
.. raw:: html
</details>
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment