skypilot.md

# SkyPilot

<p align="center">
  <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
</p>

vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc., can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).

## Prerequisites

- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`.
- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
- Check that `sky check` shows clouds or Kubernetes are enabled.

```bash
pip install skypilot-nightly
sky check
```

## Run on a single instance

See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).

??? code "Yaml"

    ```yaml
    resources:
      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
      use_spot: True
      disk_size: 512  # Ensure model checkpoints can fit.
      disk_tier: best
      ports: 8081  # Expose to internet traffic.

    envs:
      PYTHONUNBUFFERED: 1
      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

    setup: |
      conda create -n vllm python=3.10 -y
      conda activate vllm

      pip install vllm==0.4.0.post1
      # Install Gradio for web UI.
      pip install gradio openai
      pip install flash-attn==2.5.7

    run: |
      conda activate vllm
      echo 'Starting vllm api server...'
      vllm serve $MODEL_NAME \
        --port 8081 \
        --trust-remote-code \
        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
        2>&1 | tee api_server.log &

      echo 'Waiting for vllm api server to start...'
      while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done

      echo 'Starting gradio server...'
      git clone https://github.com/vllm-project/vllm.git || true
      python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
        -m $MODEL_NAME \
        --port 8811 \
        --model-url http://localhost:8081/v1 \
        --stop-token-ids 128009,128001
    ```

Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):

```bash
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
```

Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.

```console
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
```

**Optional**: Serve the 70B model instead of the default 8B and use more GPU:

```bash
HF_TOKEN="your-huggingface-token" \
  sky launch serving.yaml \
  --gpus A100:8 \
  --env HF_TOKEN \
  --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
```

## Scale up to multiple replicas

SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.

??? code "Yaml"

    ```yaml
    service:
      replicas: 2
      # An actual request for readiness probe.
      readiness_probe:
        path: /v1/chat/completions
        post_data:
        model: $MODEL_NAME
        messages:
          - role: user
            content: Hello! What is your name?
      max_completion_tokens: 1
    ```

??? code "Yaml"

    ```yaml
    service:
      replicas: 2
      # An actual request for readiness probe.
      readiness_probe:
        path: /v1/chat/completions
        post_data:
          model: $MODEL_NAME
          messages:
            - role: user
              content: Hello! What is your name?
          max_completion_tokens: 1

    resources:
      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
      use_spot: True
      disk_size: 512  # Ensure model checkpoints can fit.
      disk_tier: best
      ports: 8081  # Expose to internet traffic.

    envs:
      PYTHONUNBUFFERED: 1
      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

    setup: |
      conda create -n vllm python=3.10 -y
      conda activate vllm

      pip install vllm==0.4.0.post1
      # Install Gradio for web UI.
      pip install gradio openai
      pip install flash-attn==2.5.7

    run: |
      conda activate vllm
      echo 'Starting vllm api server...'
      vllm serve $MODEL_NAME \
        --port 8081 \
        --trust-remote-code \
        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
        2>&1 | tee api_server.log
    ```

Start the serving the Llama-3 8B model on multiple replicas:

```bash
HF_TOKEN="your-huggingface-token" \
  sky serve up -n vllm serving.yaml \
  --env HF_TOKEN
```

Wait until the service is ready:

```bash
watch -n10 sky serve status vllm
```

Example outputs:

```console
Services
NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
vllm  1        35s     READY   2/2       xx.yy.zz.100:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                STATUS  REGION
vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
```

After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:

??? console "Commands"

    ```bash
    ENDPOINT=$(sky serve status --endpoint 8081 vllm)
    curl -L http://$ENDPOINT/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
        "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "Who are you?"
        }
        ],
        "stop_token_ids": [128009,  128001]
      }'
    ```

To enable autoscaling, you could replace the `replicas` with the following configs in `service`:

```yaml
service:
  replica_policy:
    min_replicas: 2
    max_replicas: 4
    target_qps_per_replica: 2
```

This will scale the service up to when the QPS exceeds 2 for each replica.

??? code "Yaml"

    ```yaml
    service:
      replica_policy:
        min_replicas: 2
        max_replicas: 4
        target_qps_per_replica: 2
      # An actual request for readiness probe.
      readiness_probe:
        path: /v1/chat/completions
        post_data:
          model: $MODEL_NAME
          messages:
            - role: user
              content: Hello! What is your name?
          max_completion_tokens: 1

    resources:
      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
      use_spot: True
      disk_size: 512  # Ensure model checkpoints can fit.
      disk_tier: best
      ports: 8081  # Expose to internet traffic.

    envs:
      PYTHONUNBUFFERED: 1
      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

    setup: |
      conda create -n vllm python=3.10 -y
      conda activate vllm

      pip install vllm==0.4.0.post1
      # Install Gradio for web UI.
      pip install gradio openai
      pip install flash-attn==2.5.7

    run: |
      conda activate vllm
      echo 'Starting vllm api server...'
      vllm serve $MODEL_NAME \
        --port 8081 \
        --trust-remote-code \
        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
        2>&1 | tee api_server.log
    ```

To update the service with the new config:

```bash
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
```

To stop the service:

```bash
sky serve down vllm
```

### **Optional**: Connect a GUI to the endpoint

It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.

??? code "Yaml"

    ```yaml
    envs:
      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
      ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.

    resources:
      cpus: 2

    setup: |
      conda create -n vllm python=3.10 -y
      conda activate vllm

      # Install Gradio for web UI.
      pip install gradio openai

    run: |
      conda activate vllm
      export PATH=$PATH:/sbin

      echo 'Starting gradio server...'
      git clone https://github.com/vllm-project/vllm.git || true
      python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
        -m $MODEL_NAME \
        --port 8811 \
        --model-url http://$ENDPOINT/v1 \
        --stop-token-ids 128009,128001 | tee ~/gradio.log
    ```

1. Start the chat web UI:

    ```bash
    sky launch \
      -c gui ./gui.yaml \
      --env ENDPOINT=$(sky serve status --endpoint vllm)
    ```

2. Then, we can access the GUI at the returned gradio link:

    ```console
    | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
    ```