dynamo_operator.md 7.75 KB
Newer Older
1
# Working with Dynamo Kubernetes Operator
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

## Overview

Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.

## Architecture

- **Operator Deployment:**
  Deployed as a Kubernetes `Deployment` in a specific namespace.

- **Controllers:**
  - `DynamoGraphDeploymentController`: Watches `DynamoGraphDeployment` CRs and orchestrates graph deployments.
  - `DynamoComponentDeploymentController`: Watches `DynamoComponentDeployment` CRs and handles individual component deployments.

- **Workflow:**
  1. A custom resource is created by the user or API server.
  2. The corresponding controller detects the change and runs reconciliation.
  3. Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec.
  4. Status fields are updated to reflect the current state.

22

23
24
25
26
27
28
29
30

## Custom Resource Definitions (CRDs)

### CRD: `DynamoGraphDeployment`


| Field            | Type   | Description                                                                                                                                          | Required | Default |
|------------------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------|----------|---------|
31
32
| `services`       | map    | Map of service names to runtime configurations. This allows the user to override the service configuration defined in the DynamoComponentDeployment. | Yes      |         |
| `envs`           | list   | list of global environment variables.                                                                                                                | No       |         |
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72


**API Version:** `nvidia.com/v1alpha1`
**Scope:** Namespaced

#### Example
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: disagg
spec:
  envs:
  - name: GLOBAL_ENV_VAR
    value: some_global_value
  services:
    Frontend:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
    Processor:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
    VllmWorker:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
    PrefillWorker:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
```

## Installation

73
[See installation steps](dynamo_cloud.md#overview)
74
75


76
77
## GitOps Deployment with FluxCD

78
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../../../components/backends/vllm/README.md) to demonstrate the workflow.
79
80
81
82
83
84
85
86
87
88
89

### Prerequisites

- A Kubernetes cluster with [Dynamo Cloud](dynamo_cloud.md) installed
- [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
- A Git repository to store your deployment configurations

### Workflow Overview

The GitOps workflow for Dynamo deployments consists of three main steps:

90
1. Build and push the Dynamo Operator
91
2. Create and commit a DynamoGraphDeployment custom resource for initial deployment
92
3. Update the graph by building a new version and updating the CR for subsequent updates
93

94
### Step 1: Build and Push Dynamo Cloud Operator
95

96
First, follow to [See Install Dynamo Cloud](README.md).
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137

### Step 2: Create Initial Deployment

Create a new file in your Git repository (e.g., `deployments/llm-agg.yaml`) with the following content:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: llm-agg
spec:
  services:
    Frontend:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
    Processor:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
    VllmWorker:
      replicas: 1
      envs:
      - name: SPECIFIC_ENV_VAR
        value: some_specific_value
      # Add PVC for model storage
      pvc:
        name: vllm-model-storage
        mountPath: /models
        size: 100Gi
```

Commit and push this file to your Git repository. FluxCD will detect the new CR and create the initial deployment in your cluster. The operator will:
- Create the specified PVCs
- Build container images for all components
- Deploy the services with the configured resources

### Step 3: Update Existing Deployment

138
To update your pipeline, just update the associated DynamoGraphDeployment CRD
139

140
The Dynamo operator will automatically reconcile it.
141
142
143
144
145
146
147

### Monitoring the Deployment

You can monitor the deployment status using:

```bash

148
export NAMESPACE=<namespace-with-the-dynamo-cloud-operator>
149

150
151
152
# Check the DynamoGraphDeployment status
kubectl get dynamographdeployment llm-agg -n $NAMESPACE
```
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205


## Reconciliation Logic

### DynamoGraphDeployment

- **Actions:**
  - Create a DynamoComponent CR to build the docker image
  - Create a DynamoComponentDeployment CR for each component defined in the Dynamo graph being deployed
- **Status Management:**
  - `.status.conditions`: Reflects readiness, failure, progress states
  - `.status.state`: overall state of the deployment, based on the state of the DynamoComponentDeployments

### DynamoComponentDeployment

- **Actions:**
  - Create a Deployment, Service, and Ingress for the service
- **Status Management:**
  - `.status.conditions`: Reflects readiness, failure, progress states

## Configuration


- **Environment Variables:**

| Name                                               | Description                          | Default                                                |
|----------------------------------------------------|--------------------------------------|--------------------------------------------------------|
| `LOG_LEVEL`                                        | Logging verbosity level              | `info`                                                 |
| `DYNAMO_SYSTEM_NAMESPACE`                          | System namespace                     | `dynamo`                                               |

- **Flags:**
  | Flag                  | Description                                | Default |
  |-----------------------|--------------------------------------------|---------|
  | `--natsAddr`          | Address of NATS server                     | ""      |
  | `--etcdAddr`          | Address of etcd server                     | ""      |



## Troubleshooting

| Symptom                | Possible Cause                | Solution                          |
|------------------------|-------------------------------|-----------------------------------|
| Resource not created   | RBAC missing                  | Ensure correct ClusterRole/Binding|
| Status not updated     | CRD schema mismatch           | Regenerate CRDs with kubebuilder  |
| Image build hangs      | Misconfigured DynamoComponent | Check image build logs            |


## Development

- **Code Structure:**

The operator is built using Kubebuilder and the operator-sdk, with the following structure:

206
207
208
- `controllers/`: Reconciliation logic
- `api/v1alpha1/`: CRD types
- `config/`: Manifests and Helm charts
209
210
211
212
213
214
215
216


## References

- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
- [Custom Resource Definitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
- [Operator SDK](https://sdk.operatorframework.io/)
- [Helm Best Practices for CRDs](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/)