production-stack.md 5.6 KB
Newer Older
1
# Production stack
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:

* **Upstream vLLM compatibility** – It wraps around upstream vLLM without modifying its code.
* **Ease of use** – Simplified deployment via Helm charts and observability through Grafana dashboards.
* **High performance** – Optimized for LLM workloads with features like multi-model support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with [LMCache](https://github.com/LMCache/LMCache), among others.

If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](https://github.com/vllm-project/production-stack), we provide a step-by-step [guide](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) and a [short video](https://www.youtube.com/watch?v=EsTJbQtzj0g) to set up everything and get started in **4 minutes**!

## Pre-requisite

Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine).

## Deployment using vLLM production stack

17
The standard vLLM production stack is installed using a Helm chart. You can run this [bash script](https://github.com/vllm-project/production-stack/blob/main/utils/install-helm.sh) to install Helm on your GPU server.
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

To install the vLLM production stack, run the following commands on your desktop:

```bash
sudo helm repo add vllm https://vllm-project.github.io/production-stack
sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml
```

This will instantiate a vLLM-production-stack-based deployment named `vllm` that runs a small LLM (Facebook opt-125M model).

### Validate Installation

Monitor the deployment status using:

```bash
sudo kubectl get pods
```

And you will see that pods for the `vllm` deployment will transit to `Running` state.

```text
NAME                                           READY   STATUS    RESTARTS   AGE
vllm-deployment-router-859d8fb668-2x2b7        1/1     Running   0          2m38s
vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs   1/1     Running   0          2m38s
```

Cyrus Leung's avatar
Cyrus Leung committed
44
45
!!! note
    It may take some time for the containers to download the Docker images and LLM weights.
46
47
48
49
50
51
52
53
54
55
56
57

### Send a Query to the Stack

Forward the `vllm-router-service` port to the host machine:

```bash
sudo kubectl port-forward svc/vllm-router-service 30080:80
```

And then you can send out a query to the OpenAI-compatible API to check the available models:

```bash
58
curl -o- http://localhost:30080/v1/models
59
60
```

61
??? console "Output"
62

63
    ```json
64
    {
65
66
67
68
69
70
71
72
73
74
      "object": "list",
      "data": [
        {
          "id": "facebook/opt-125m",
          "object": "model",
          "created": 1737428424,
          "owned_by": "vllm",
          "root": null
        }
      ]
75
    }
76
    ```
77
78
79
80

To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint:

```bash
81
curl -X POST http://localhost:30080/v1/completions \
82
83
84
85
86
87
88
89
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Once upon a time,",
    "max_tokens": 10
  }'
```

90
??? console "Output"
91

92
    ```json
93
    {
94
95
96
97
98
99
100
101
102
103
104
      "id": "completion-id",
      "object": "text_completion",
      "created": 1737428424,
      "model": "facebook/opt-125m",
      "choices": [
        {
          "text": " there was a brave knight who...",
          "index": 0,
          "finish_reason": "length"
        }
      ]
105
    }
106
    ```
107
108
109
110
111
112
113
114
115

### Uninstall

To remove the deployment, run:

```bash
sudo helm uninstall vllm
```

116
---
117
118
119
120
121

### (Advanced) Configuring vLLM production stack

The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:

122
??? code "Yaml"
123

124
125
126
127
128
129
130
131
    ```yaml
    servingEngineSpec:
      runtimeClassName: ""
      modelSpec:
      - name: "opt125m"
        repository: "vllm/vllm-openai"
        tag: "latest"
        modelURL: "facebook/opt-125m"
132

133
        replicaCount: 1
134

135
136
137
138
139
140
        requestCPU: 6
        requestMemory: "16Gi"
        requestGPU: 1

        pvcStorage: "10Gi"
    ```
141
142

In this YAML configuration:
143

144
* **`modelSpec`** includes:
145
146
147
148
    * `name`: A nickname that you prefer to call the model.
    * `repository`: Docker repository of vLLM.
    * `tag`: Docker image tag.
    * `modelURL`: The LLM model that you want to use.
149
150
151
152
153
* **`replicaCount`**: Number of replicas.
* **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod.
* **`requestGPU`**: Specifies the number of GPUs required.
* **`pvcStorage`**: Allocates persistent storage for the model.

Cyrus Leung's avatar
Cyrus Leung committed
154
155
!!! note
    If you intend to set up two pods, please refer to this [YAML file](https://github.com/vllm-project/production-stack/blob/main/tutorials/assets/values-01-2pods-minimal-example.yaml).
156

Cyrus Leung's avatar
Cyrus Leung committed
157
158
!!! tip
    vLLM production stack offers many more features (*e.g.* CPU offloading and a wide range of routing algorithms). Please check out these [examples and tutorials](https://github.com/vllm-project/production-stack/tree/main/tutorials) and our [repo](https://github.com/vllm-project/production-stack) for more details!