"docs/components/frontend/frontend-guide.md" did not exist on "03360b84756931e13113656711817094fb97799e"
footer.md 17.5 KB
Newer Older
1
2
3
4
5
6
# Operator Default Values Injection

The Dynamo operator automatically applies default values to various fields when they are not explicitly specified in your deployments. These defaults include:

- **Health Probes**: Startup, liveness, and readiness probes are configured differently for frontend, worker, and planner components. For example, worker components receive a startup probe with a 2-hour timeout (720 failures × 10 seconds) to accommodate long model loading times.

7
8
- **Security Context**: All components receive `fsGroup: 1000` by default to ensure proper file permissions for mounted volumes. This can be overridden via the `extraPodSpec.securityContext` field.

9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
- **Shared Memory**: All components receive an 8Gi shared memory volume mounted at `/dev/shm` by default (can be disabled or resized via the `sharedMemory` field).

- **Environment Variables**: Components automatically receive environment variables like `DYN_NAMESPACE`, `DYN_PARENT_DGD_K8S_NAME`, `DYNAMO_PORT`, and backend-specific variables.

- **Pod Configuration**: Default `terminationGracePeriodSeconds` of 60 seconds and `restartPolicy: Always`.

- **Autoscaling**: When enabled without explicit metrics, defaults to CPU-based autoscaling with 80% target utilization.

- **Backend-Specific Behavior**: For multinode deployments, probes are automatically modified or removed for worker nodes depending on the backend framework (VLLM, SGLang, or TensorRT-LLM).

## Pod Specification Defaults

All components receive the following pod-level defaults unless overridden:

- **`terminationGracePeriodSeconds`**: `60` seconds
- **`restartPolicy`**: `Always`

26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
## Security Context

The operator automatically applies default security context settings to all components to ensure proper file permissions, particularly for mounted volumes:

- **`fsGroup`**: `1000` - Sets the group ownership of mounted volumes and any files created in those volumes

This default ensures that non-root containers can write to mounted volumes (like model caches or persistent storage) without permission issues. The `fsGroup` setting is particularly important for:
- Model downloads and caching
- Compilation cache directories
- Persistent volume claims (PVCs)
- SSH key generation in multinode deployments

### Overriding Security Context

To override the default security context, specify your own `securityContext` in the `extraPodSpec` of your component:

```yaml
services:
  YourWorker:
    extraPodSpec:
      securityContext:
        fsGroup: 2000  # Custom group ID
        runAsUser: 1000
        runAsGroup: 1000
        runAsNonRoot: true
```

**Important**: When you provide *any* `securityContext` object in `extraPodSpec`, the operator will not inject any defaults. This gives you complete control over the security context, including the ability to run as root (by omitting `runAsNonRoot` or setting it to `false`).

### OpenShift and Security Context Constraints

In OpenShift environments with Security Context Constraints (SCCs), you may need to omit explicit UID/GID values to allow OpenShift's admission controllers to assign them dynamically:

```yaml
services:
  YourWorker:
    extraPodSpec:
      securityContext:
        # Omit fsGroup to let OpenShift assign it based on SCC
        # OpenShift will inject the appropriate UID range
```

Alternatively, if you want to keep the default `fsGroup: 1000` behavior and are certain your cluster allows it, you don't need to specify anything - the operator defaults will work.

70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
## Shared Memory Configuration

Shared memory is enabled by default for all components:

- **Enabled**: `true` (unless explicitly disabled via `sharedMemory.disabled`)
- **Size**: `8Gi`
- **Mount Path**: `/dev/shm`
- **Volume Type**: `emptyDir` with `memory` medium

To disable shared memory or customize the size, use the `sharedMemory` field in your component specification.

## Health Probes by Component Type

The operator applies different default health probes based on the component type.

### Frontend Components

Frontend components receive the following probe configurations:

**Liveness Probe:**
- **Type**: HTTP GET
- **Path**: `/health`
- **Port**: `http` (8000)
- **Initial Delay**: 60 seconds
- **Period**: 60 seconds
- **Timeout**: 30 seconds
- **Failure Threshold**: 10

**Readiness Probe:**
- **Type**: Exec command
- **Command**: `curl -s http://localhost:${DYNAMO_PORT}/health | jq -e ".status == \"healthy\""`
- **Initial Delay**: 60 seconds
- **Period**: 60 seconds
- **Timeout**: 30 seconds
- **Failure Threshold**: 10

### Worker Components

Worker components receive the following probe configurations:

**Liveness Probe:**
- **Type**: HTTP GET
- **Path**: `/live`
- **Port**: `system` (9090)
- **Period**: 5 seconds
- **Timeout**: 30 seconds
- **Failure Threshold**: 1

**Readiness Probe:**
- **Type**: HTTP GET
- **Path**: `/health`
- **Port**: `system` (9090)
- **Period**: 10 seconds
- **Timeout**: 30 seconds
- **Failure Threshold**: 60

**Startup Probe:**
- **Type**: HTTP GET
- **Path**: `/live`
- **Port**: `system` (9090)
- **Period**: 10 seconds
- **Timeout**: 5 seconds
- **Failure Threshold**: 720 (allows up to 2 hours for startup: 10s × 720 = 7200s)

:::{note}
For larger models (typically >70B parameters) or slower storage systems, you may need to increase the `failureThreshold` to allow more time for model loading. Calculate the required threshold based on your expected startup time: `failureThreshold = (expected_startup_seconds / period)`. Override the startup probe in your component specification if the default 2-hour window is insufficient.
:::

### Multinode Deployment Probe Modifications

For multinode deployments, the operator modifies probes based on the backend framework and node role:

#### VLLM Backend
143
144
145

The operator automatically selects between two deployment modes based on parallelism configuration:

146
147
148
149
**Tensor/Pipeline Parallel Mode** (when `world_size > GPUs_per_node`):
- Uses Ray for distributed execution (`--distributed-executor-backend ray`)
- **Leader nodes**: Starts Ray head and runs vLLM; all probes remain active
- **Worker nodes**: Run Ray agents only; all probes (liveness, readiness, startup) are removed
150
151

**Data Parallel Mode** (when `world_size × data_parallel_size > GPUs_per_node`):
152
- **Worker nodes**: All probes (liveness, readiness, startup) are removed
153
- **Leader nodes**: All probes remain active
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

#### SGLang Backend
- **Worker nodes**: All probes (liveness, readiness, startup) are removed

#### TensorRT-LLM Backend
- **Leader nodes**: All probes remain unchanged
- **Worker nodes**:
  - Liveness and startup probes are removed
  - Readiness probe is replaced with a TCP socket check on SSH port (2222):
    - **Initial Delay**: 20 seconds
    - **Period**: 20 seconds
    - **Timeout**: 5 seconds
    - **Failure Threshold**: 10

## Environment Variables

170
The operator automatically injects environment variables into component containers based on component type, backend framework, and operator configuration. User-provided `envs` values always take precedence over operator defaults.
171
172
173

### All Components

174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
These environment variables are injected into every component container regardless of type.

| Variable | Purpose | Default | Type | Source |
| --- | --- | --- | --- | --- |
| `DYN_NAMESPACE` | Dynamo service namespace used for service discovery and routing | Derived from DGD spec | `string` | Downward API annotation on checkpoint-restored pods |
| `DYN_COMPONENT` | Identifies the component type for runtime behavior | One of: `frontend`, `worker`, `prefill`, `decode`, `planner`, `epp` | `string` | Set from component spec |
| `DYN_PARENT_DGD_K8S_NAME` | Kubernetes name of the parent DynamoGraphDeployment resource | — | `string` | Set from DGD metadata |
| `DYN_PARENT_DGD_K8S_NAMESPACE` | Kubernetes namespace of the parent DynamoGraphDeployment resource | — | `string` | Set from DGD metadata |
| `POD_NAME` | Current pod name | — | `string` | Downward API (`metadata.name`) |
| `POD_NAMESPACE` | Current pod namespace | — | `string` | Downward API (`metadata.namespace`) |
| `POD_UID` | Current pod UID | — | `string` | Downward API (`metadata.uid`) |
| `DYN_DISCOVERY_BACKEND` | Service discovery backend for inter-component communication | `kubernetes` | `string` | Options: `kubernetes`, `etcd` |

### Infrastructure (Conditional)

These are injected into all components when the corresponding infrastructure service is configured in the operator's `OperatorConfiguration`.

| Variable | Purpose | Default | Type | Condition |
| --- | --- | --- | --- | --- |
| `NATS_SERVER` | NATS messaging server address | — | `string` | Set when `infrastructure.natsAddress` is configured |
| `ETCD_ENDPOINTS` | etcd endpoint addresses for distributed state | — | `string` | Set when `infrastructure.etcdAddress` is configured |
| `MODEL_EXPRESS_URL` | Model Express service URL for model management | — | `string` | Set when `infrastructure.modelExpressURL` is configured |
| `PROMETHEUS_ENDPOINT` | Prometheus endpoint for metrics collection | — | `string` | Set when `infrastructure.prometheusEndpoint` is configured |
197
198
199

### Frontend Components

200
201
202
203
204
| Variable | Purpose | Default | Type |
| --- | --- | --- | --- |
| `DYNAMO_PORT` | HTTP port the frontend listens on | `8000` | `int` |
| `DYN_HTTP_PORT` | HTTP port for the frontend service (alias) | `8000` | `int` |
| `DYN_NAMESPACE_PREFIX` | Namespace prefix used for frontend request routing | Same as `DYN_NAMESPACE` | `string` |
205
206
207

### Worker Components

208
209
210
211
212
213
214
215
216
217
| Variable | Purpose | Default | Type |
| --- | --- | --- | --- |
| `DYN_SYSTEM_ENABLED` | Enables the system HTTP server for health checks and metrics | `true` | `string` (boolean) |
| `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` | Endpoints whose health status is used for readiness | `["generate"]` | `string` (JSON array) |
| `DYN_SYSTEM_PORT` | Port for the system HTTP server (health, metrics) | `9090` | `int` |
| `DYN_HEALTH_CHECK_ENABLED` | Disables the legacy health check mechanism in favor of the system server | `false` | `string` (boolean) |
| `NIXL_TELEMETRY_ENABLE` | Enables or disables NIXL telemetry collection | `n` | `string` | Options: `y`, `n` |
| `NIXL_TELEMETRY_EXPORTER` | Telemetry exporter format for NIXL metrics | `prometheus` | `string` |
| `NIXL_TELEMETRY_PROMETHEUS_PORT` | Port for NIXL Prometheus metrics endpoint | `19090` | `int` |
| `DYN_NAMESPACE_WORKER_SUFFIX` | Hash suffix appended to worker namespace for rolling updates | — | `string` | Only set during rolling update transitions |
218
219
220

### Planner Components

221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
| Variable | Purpose | Default | Type |
| --- | --- | --- | --- |
| `PLANNER_PROMETHEUS_PORT` | Port for the planner's Prometheus metrics endpoint | `9085` | `int` |

### EPP (Endpoint Picker Plugin) Components

| Variable | Purpose | Default | Type |
| --- | --- | --- | --- |
| `USE_STREAMING` | Enables streaming mode for inference request proxying | `true` | `string` (boolean) |
| `RUST_LOG` | Rust log level and filter configuration | `debug,dynamo_llm::kv_router=trace` | `string` |

### VLLM Backend

| Variable | Purpose | Default | Type | Condition |
| --- | --- | --- | --- | --- |
| `VLLM_CACHE_ROOT` | Directory for vLLM compilation cache artifacts | — | `string` | Set when a volume mount has `useAsCompilationCache: true` |
| `VLLM_NIXL_SIDE_CHANNEL_HOST` | Host IP for the NIXL side channel in multiprocessing mode | Pod IP | `string` | Multinode mp backend only (Downward API: `status.podIP`) |

### TensorRT-LLM Backend
240

241
242
243
| Variable | Purpose | Default | Type | Condition |
| --- | --- | --- | --- | --- |
| `OMPI_MCA_orte_keep_fqdn_hostnames` | Instructs OpenMPI to preserve FQDN hostnames for inter-node communication | `1` | `string` | Multinode deployments only |
244

245
### Checkpoint / Restore
246

247
These environment variables are injected when checkpoint/restore is enabled for a component.
248

249
250
251
252
253
254
| Variable | Purpose | Default | Type | Condition |
| --- | --- | --- | --- | --- |
| `DYN_CHECKPOINT_PATH` | Base directory where checkpoint data is stored | From operator checkpoint config `storage.pvc.basePath` | `string` | PVC storage type |
| `DYN_CHECKPOINT_LOCATION` | Full checkpoint URI (for non-PVC backends) | — | `string` | S3 or OCI storage type |
| `DYN_CHECKPOINT_HASH` | Identity hash that uniquely identifies the checkpoint | — | `string` | Always set when checkpoint is enabled |
| `SKIP_WAIT_FOR_CHECKPOINT` | Skips the checkpoint readiness polling loop; checks once and proceeds | — | `string` | Set on restored and DGD pods |
255

256
257
258
259
260
261
## Service Accounts

The following component types automatically receive dedicated service accounts:

- **Planner**: `planner-serviceaccount`
- **EPP**: `epp-serviceaccount`
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297

## Image Pull Secrets

The operator automatically discovers and injects image pull secrets for container images. When a component specifies a container image, the operator:

1. Scans all Kubernetes secrets of type `kubernetes.io/dockerconfigjson` in the component's namespace
2. Extracts the docker registry server URLs from each secret's authentication configuration
3. Matches the container image's registry host against the discovered registry URLs
4. Automatically injects matching secrets as `imagePullSecrets` in the pod specification

This eliminates the need to manually specify image pull secrets for each component. The operator maintains an internal index of docker secrets and their associated registries, refreshing this index periodically.

**To disable automatic image pull secret discovery** for a specific component, add the following annotation:

```yaml
annotations:
  nvidia.com/disable-image-pull-secret-discovery: "true"
```

## Autoscaling Defaults

When autoscaling is enabled but no metrics are specified, the operator applies:

- **Default Metric**: CPU utilization
- **Target Average Utilization**: `80%`

## Port Configurations

Default container ports are configured based on component type:

### Frontend Components
- **Port**: 8000
- **Protocol**: TCP
- **Name**: `http`

### Worker Components
298
- **Port**: 9090 (system)
299
300
- **Protocol**: TCP
- **Name**: `system`
301
302
303
- **Port**: 19090 (NIXL)
- **Protocol**: TCP
- **Name**: `nixl`
304
305
306
307
308
309

### Planner Components
- **Port**: 9085
- **Protocol**: TCP
- **Name**: `metrics`

310
311
312
313
314
315
316
317
318
319
320
### EPP Components
- **Port**: 9002 (gRPC)
- **Protocol**: TCP
- **Name**: `grpc`
- **Port**: 9003 (gRPC health)
- **Protocol**: TCP
- **Name**: `grpc-health`
- **Port**: 9090 (metrics)
- **Protocol**: TCP
- **Name**: `metrics`

321
322
323
## Backend-Specific Configurations

### VLLM
324
- **Ray Head Port**: 6379 (for Ray cluster coordination in multinode TP/PP deployments)
325
- **Data Parallel RPC Port**: 13445 (for data parallel multinode deployments)
326
327
328
329
330
331
332
333
334
335
336
337

### SGLang
- **Distribution Init Port**: 29500 (for multinode deployments)

### TensorRT-LLM
- **SSH Port**: 2222 (for multinode MPI communication)
- **OpenMPI Environment**: `OMPI_MCA_orte_keep_fqdn_hostnames=1`

## Implementation Reference

For users who want to understand the implementation details or contribute to the operator, the default values described in this document are set in the following source files:

338
- **Health Probes, Security Context & Pod Specifications**: [`internal/dynamo/graph.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/graph.go) - Contains the main logic for applying default probes, security context, environment variables, shared memory, and pod configurations
339
- **Component-Specific Defaults**:
340
  - [`internal/dynamo/component_common.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/component_common.go) - Base container and pod spec shared by all component types
341
342
343
  - [`internal/dynamo/component_frontend.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/component_frontend.go)
  - [`internal/dynamo/component_worker.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/component_worker.go)
  - [`internal/dynamo/component_planner.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/component_planner.go)
344
  - [`internal/dynamo/component_epp.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/component_epp.go)
345
- **Image Pull Secrets**: [`internal/secrets/docker.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/secrets/docker.go) - Implements the docker secret indexer and automatic discovery
346
- **Backend-Specific Behavior**:
347
348
349
  - [`internal/dynamo/backend_vllm.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/backend_vllm.go)
  - [`internal/dynamo/backend_sglang.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/backend_sglang.go)
  - [`internal/dynamo/backend_trtllm.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/dynamo/backend_trtllm.go)
350
- **Checkpoint / Restore**: [`internal/checkpoint/dgd_integration.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/checkpoint/dgd_integration.go) - Checkpoint env var injection and volume setup
351
- **Constants & Annotations**: [`internal/consts/consts.go`](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/internal/consts/consts.go) - Defines annotation keys and other constants
352
353
354
355
356

## Notes

- All these defaults can be overridden by explicitly specifying values in your DynamoComponentDeployment or DynamoGraphDeployment resources
- User-specified probes (via `livenessProbe`, `readinessProbe`, or `startupProbe` fields) take precedence over operator defaults
357
- For security context, if you provide *any* `securityContext` in `extraPodSpec`, no defaults will be injected, giving you full control
358
- For multinode deployments, some defaults are modified or removed as described above to accommodate distributed execution patterns
359
- The `extraPodSpec.mainContainer` field can be used to override probe configurations set by the operator