vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__.
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__.
...
@@ -21,8 +21,8 @@ Prerequisites
...
@@ -21,8 +21,8 @@ Prerequisites
.. code-block:: console
.. code-block:: console
pip install skypilot-nightly
pip install skypilot-nightly
sky check
sky check
Run on a single instance
Run on a single instance
...
@@ -32,64 +32,64 @@ See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypil
...
@@ -32,64 +32,64 @@ See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypil
.. code-block:: yaml
.. code-block:: yaml
resources:
resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
use_spot: True
use_spot: True
disk_size: 512 # Ensure model checkpoints can fit.
disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best
disk_tier: best
ports: 8081 # Expose to internet traffic.
ports: 8081 # Expose to internet traffic.
envs:
envs:
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
.. code-block:: console
.. code-block:: console
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
**Optional**: Serve the 70B model instead of the default 8B and use more GPU: