kthena.md 8.97 KB
Newer Older
raojy's avatar
raojy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
# Kthena

[**Kthena**](https://github.com/volcano-sh/kthena) is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high performance and enterprise-grade scalability for LLM inference workloads.

This guide shows how to deploy a production-grade, **multi-node vLLM** service on Kubernetes.

We’ll:

- Install the required components (Kthena + Volcano).
- Deploy a multi-node vLLM model via Kthena’s `ModelServing` CR.
- Validate the deployment.

---

## 1. Prerequisites

You need:

- A Kubernetes cluster with **GPU nodes**.
- `kubectl` access with cluster-admin or equivalent permissions.
- **Volcano** installed for gang scheduling.
- **Kthena** installed with the `ModelServing` CRD available.
- A valid **Hugging Face token** if loading models from Hugging Face Hub.

### 1.1 Install Volcano

```bash
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
```

This provides the gang-scheduling and network topology features used by Kthena.

### 1.2 Install Kthena

```bash
helm install kthena oci://ghcr.io/volcano-sh/charts/kthena --version v0.1.0 --namespace kthena-system --create-namespace
```

- The `kthena-system` namespace is created.
- Kthena controllers and CRDs, including `ModelServing`, are installed and healthy.

Validate:

```bash
kubectl get crd | grep modelserving
```

You should see:

```text
modelservings.workload.serving.volcano.sh   ...
```

---

## 2. The Multi-Node vLLM `ModelServing` Example

Kthena provides an example manifest to deploy a **multi-node vLLM cluster running Llama**. Conceptually this is equivalent to the vLLM production stack Helm deployment, but expressed with `ModelServing`.

A simplified version of the example (`llama-multinode`) looks like:

- `spec.replicas: 1` – one `ServingGroup` (one logical model deployment).
- `roles`:
    - `entryTemplate` – defines **leader** pods that run:
        - vLLM’s **multi-node cluster bootstrap script** (Ray cluster).
        - vLLM **OpenAI-compatible API server**.
    - `workerTemplate` – defines **worker** pods that join the leader’s Ray cluster.

Key points from the example YAML:

- **Image**: `vllm/vllm-openai:latest` (matches upstream vLLM images).
- **Command** (leader):

  ```yaml
  command:
    - sh
    - -c
    - >
      bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=2;
      python3 -m vllm.entrypoints.openai.api_server
        --port 8080
        --model meta-llama/Llama-3.1-405B-Instruct
        --tensor-parallel-size 8
        --pipeline-parallel-size 2
  ```

- **Command** (worker):

  ```yaml
  command:
    - sh
    - -c
    - >
      bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(ENTRY_ADDRESS)
  ```

---

## 3. Deploying Multi-Node llama vLLM via Kthena

### 3.1 Prepare the Manifest

**Recommended**: use a Secret instead of a raw env var:

```bash
kubectl create secret generic hf-token \
  -n default \
  --from-literal=HUGGING_FACE_HUB_TOKEN='<your-token>'
```

### 3.2 Apply the `ModelServing`

```bash
cat  <<EOF | kubectl apply -f -
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelServing
metadata:
  name: llama-multinode
  namespace: default
spec:
  schedulerName: volcano
  replicas: 1  # group replicas
  template:
    restartGracePeriodSeconds: 60
    gangPolicy:
      minRoleReplicas:
        405b: 1
    roles:
      - name: 405b
        replicas: 2
        entryTemplate:
          spec:
            containers:
              - name: leader
                image: vllm/vllm-openai:latest
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-token
                        key: HUGGING_FACE_HUB_TOKEN
                command:
                  - sh
                  - -c
                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=2; 
                    python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
                resources:
                  limits:
                    nvidia.com/gpu: "8"
                    memory: 1124Gi
                    ephemeral-storage: 800Gi
                  requests:
                    ephemeral-storage: 800Gi
                    cpu: 125
                ports:
                  - containerPort: 8080
                readinessProbe:
                  tcpSocket:
                    port: 8080
                  initialDelaySeconds: 15
                  periodSeconds: 10
                volumeMounts:
                  - mountPath: /dev/shm
                    name: dshm
            volumes:
            - name: dshm
              emptyDir:
                medium: Memory
                sizeLimit: 15Gi
        workerReplicas: 1
        workerTemplate:
          spec:
            containers:
              - name: worker
                image: vllm/vllm-openai:latest
                command:
                  - sh
                  - -c
                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(ENTRY_ADDRESS)"
                resources:
                  limits:
                    nvidia.com/gpu: "8"
                    memory: 1124Gi
                    ephemeral-storage: 800Gi
                  requests:
                    ephemeral-storage: 800Gi
                    cpu: 125
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-token
                        key: HUGGING_FACE_HUB_TOKEN
                volumeMounts:
                  - mountPath: /dev/shm
                    name: dshm   
            volumes:
            - name: dshm
              emptyDir:
                medium: Memory
                sizeLimit: 15Gi
EOF
```

Kthena will:

- Create a `ModelServing` object.
- Derive a `PodGroup` for Volcano gang scheduling.
- Create the leader and worker pods for each `ServingGroup` and `Role`.

---

## 4. Verifying the Deployment

### 4.1 Check ModelServing Status

Use the snippet from the Kthena docs:

```bash
kubectl get modelserving -oyaml | grep status -A 10
```

You should see something like:

```yaml
status:
  availableReplicas: 1
  conditions:
    - type: Available
      status: "True"
      reason: AllGroupsReady
      message: All Serving groups are ready
    - type: Progressing
      status: "False"
      ...
  replicas: 1
  updatedReplicas: 1
```

### 4.2 Check Pods

List pods for your deployment:

```bash
kubectl get pod -owide -l modelserving.volcano.sh/name=llama-multinode
```

Example output (from docs):

```text
NAMESPACE   NAME                          READY   STATUS    RESTARTS   AGE   IP            NODE           ...
default     llama-multinode-0-405b-0-0    1/1     Running   0          15m   10.244.0.56   192.168.5.12   ...
default     llama-multinode-0-405b-0-1    1/1     Running   0          15m   10.244.0.58   192.168.5.43   ...
default     llama-multinode-0-405b-1-0    1/1     Running   0          15m   10.244.0.57   192.168.5.58   ...
default     llama-multinode-0-405b-1-1    1/1     Running   0          15m   10.244.0.53   192.168.5.36   ...
```

Pod name pattern:

- `llama-multinode-<group-idx>-<role-name>-<replica-idx>-<ordinal>`.

The first number indicates `ServingGroup`. The second (`405b`) is the `Role`. The remaining indices identify the pod within the role.

---

## 6. Accessing the vLLM OpenAI-Compatible API

Expose the entry via a Service:

```yaml
apiVersion: v1
kind: Service
metadata:
  name: llama-multinode-openai
  namespace: default
spec:
  selector:
    modelserving.volcano.sh/name: llama-multinode
    modelserving.volcano.sh/entry: "true"
    # optionally further narrow to leader role if you label it
  ports:
    - name: http
      port: 80
      targetPort: 8080
  type: ClusterIP
```

Port-forward from your local machine:

```bash
kubectl port-forward svc/llama-multinode-openai 30080:80 -n default
```

Then:

- List models:

  ```bash
  curl -s http://localhost:30080/v1/models
  ```

- Send a completion request (mirroring vLLM production stack docs):

  ```bash
  curl -X POST http://localhost:30080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Llama-3.1-405B-Instruct",
      "prompt": "Once upon a time,",
      "max_tokens": 10
    }'
  ```

You should see an OpenAI-style response from vLLM.

---

## 7. Clean Up

To remove the deployment and its resources:

```bash
kubectl delete modelserving llama-multinode -n default
```

If you’re done with the entire stack:

```bash
helm uninstall kthena -n kthena-system   # or your Kthena release name
helm uninstall volcano -n volcano-system
```