create_deployment.md 6.99 KB
Newer Older
1
2
# Creating Kubernetes Deployments

3
4
The scripts in the `examples/<backend>/launch` folder like [agg.sh](../../../examples/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
The corresponding YAML files like [agg.yaml](../../../examples/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
5
6
7
8
9

This guide explains how to create your own deployment files.

## Step 1: Choose Your Architecture Pattern

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Before choosing a template, understand the different architecture patterns:

### Aggregated Serving (agg.yaml)

**Pattern**: Prefill and decode on the same GPU in a single process.

**Suggested to use for**:
- Small to medium models (under 70B parameters)
- Development and testing
- Low to moderate traffic
- Simplicity is prioritized over maximum throughput

**Tradeoffs**:
- Simpler setup and debugging
- Lower operational complexity
- GPU utilization may not be optimal (prefill and decode compete for resources)
- Lower throughput ceiling compared to disaggregated

28
**Example**: [`agg.yaml`](../../../examples/backends/vllm/deploy/agg.yaml)
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

### Aggregated + Router (agg_router.yaml)

**Pattern**: Load balancer routing across multiple aggregated worker instances.

**Suggested to use for**:
- Medium traffic requiring high availability
- Need horizontal scaling
- Want some load balancing without disaggregation complexity

**Tradeoffs**:
- Better scalability than plain aggregated
- High availability through multiple replicas
- Still has GPU underutilization issues of aggregated serving
- More complex than plain aggregated but simpler than disaggregated

45
**Example**: [`agg_router.yaml`](../../../examples/backends/vllm/deploy/agg_router.yaml)
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

### Disaggregated Serving (disagg_router.yaml)

**Pattern**: Separate prefill and decode workers with specialized optimization.

**Suggested to use for**:
- Production-style deployments
- High throughput requirements
- Large models (70B+ parameters)
- Maximum GPU utilization needed

**Tradeoffs**:
- Maximum performance and throughput
- Better GPU utilization (prefill and decode specialized)
- Independent scaling of prefill and decode
- More complex setup and debugging
- Requires understanding of prefill/decode separation

64
**Example**: [`disagg_router.yaml`](../../../examples/backends/vllm/deploy/disagg_router.yaml)
65
66
67

### Quick Selection Guide

68
69
Select the architecture pattern as your template that best fits your use case.

70
For example, when using the `vLLM` backend:
71

72
- **Development / Testing**: Use [`agg.yaml`](../../../examples/backends/vllm/deploy/agg.yaml) as the base configuration.
73

74
- **Production with Load Balancing**: Use [`agg_router.yaml`](../../../examples/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
75

76
- **High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](../../../examples/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144


## Step 2: Customize the Template

You can run the Frontend on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
The Frontend serves as a framework-agnostic HTTP entry point and is likely not to need many changes.

It serves the following roles:
1. OpenAI-Compatible HTTP Server
  * Provides `/v1/chat/completions` endpoint
  * Handles HTTP request/response formatting
  * Supports streaming responses
  * Validates incoming requests

2. Service Discovery and Routing
  * Auto-discovers backend workers via etcd
  * Routes requests to the appropriate Processor/Worker components
  * Handles load balancing between multiple workers

3. Request Preprocessing
  * Initial request validation
  * Model name verification
  * Request format standardization

You should then pick a worker and specialize the config. For example,

```yaml
VllmWorker:         # vLLM-specific config
  enforce-eager: true
  enable-prefix-caching: true

SglangWorker:       # SGLang-specific config
  router-mode: kv
  disagg-mode: true

TrtllmWorker:       # TensorRT-LLM-specific config
  engine-config: ./engine.yaml
  kv-cache-transfer: ucx
```

Here's a template structure based on the examples:

```yaml
    YourWorker:
      dynamoNamespace: your-namespace
      componentType: worker
      replicas: N
      envFromSecret: your-secrets  # e.g., hf-token-secret
      # Health checks for worker initialization
      readinessProbe:
        exec:
          command: ["/bin/sh", "-c", 'grep "Worker.*initialized" /tmp/worker.log']
      resources:
        requests:
          gpu: "1"  # GPU allocation
      extraPodSpec:
        mainContainer:
          image: your-image
          command:
            - /bin/sh
            - -c
          args:
            - python -m dynamo.YOUR_INFERENCE_ENGINE --model YOUR_MODEL --your-flags
```

Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
`extraPodSpec: -> mainContainer: -> args:`

145
The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
146
Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
147
If you are a Dynamo contributor the [dynamo run guide](../../reference/cli.md) for details on how to run this command.
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186


## Step 3: Key Customization Points

### Model Configuration

```yaml
   args:
     - "python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flag"
```

### Resource Allocation

```yaml
   resources:
     requests:
       cpu: "N"
       memory: "NGi"
       gpu: "N"
```

### Scaling

```yaml
   replicas: N  # Number of worker instances
```

### Routing Mode
```yaml
   args:
     - --router-mode
     - kv  # Enable KV-cache routing
```

### Worker Specialization

```yaml
   args:
     - --is-prefill-worker  # For disaggregated prefill workers
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
```

### Image Pull Secret Configuration

#### Automatic Discovery and Injection

By default, the Dynamo operator automatically discovers and injects image pull secrets based on container registry host matching. The operator scans Docker config secrets within the same namespace and matches their registry hostnames to the container image URLs, automatically injecting the appropriate secrets into the pod's `imagePullSecrets`.

**Disabling Automatic Discovery:**
To disable this behavior for a component and manually control image pull secrets:

```yaml
    YourWorker:
      dynamoNamespace: your-namespace
      componentType: worker
      annotations:
        nvidia.com/disable-image-pull-secret-discovery: "true"
```

When disabled, you can manually specify secrets as you would for a normal pod spec via:
```yaml
    YourWorker:
      dynamoNamespace: your-namespace
      componentType: worker
      annotations:
        nvidia.com/disable-image-pull-secret-discovery: "true"
      extraPodSpec:
        imagePullSecrets:
          - name: my-registry-secret
          - name: another-secret
        mainContainer:
          image: your-image
```

This automatic discovery eliminates the need to manually configure image pull secrets for each deployment.