README.md 7.97 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

18
# Deploying Inference Graphs to Kubernetes
19

20
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
atchernych's avatar
atchernych committed
21

22
23
24
25
26
## Pre-deployment Checks

Before deploying the platform, it is recommended to run the pre-deployment checks to ensure the cluster is ready for deployment. Please refer to the [pre-deployment checks](/deploy/cloud/pre-deployment/README.md) for more details.


27
## 1. Install Platform First
28
29
30

```bash
# 1. Set environment
31
export NAMESPACE=dynamo-system
32
33
34
35
36
37
38
39
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases

# 2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
40
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
41
42
```

43
For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](/docs/kubernetes/installation_guide.md)**.
atchernych's avatar
atchernych committed
44

45
## 2. Choose Your Backend
46

47
Each backend has deployment examples and configuration options:
48

49
50
| Backend | Available Configurations |
|---------|--------------------------|
51
| **[vLLM](/components/backends/vllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner, Disaggregated Multi-node |
52
| **[SGLang](/components/backends/sglang/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node |
53
| **[TensorRT-LLM](/components/backends/trtllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated Multi-node |
atchernych's avatar
atchernych committed
54

55
## 3. Deploy Your First Model
atchernych's avatar
atchernych committed
56

57
58
```bash
export NAMESPACE=dynamo-cloud
59
kubectl create namespace ${NAMESPACE}
atchernych's avatar
atchernych committed
60

Alec's avatar
Alec committed
61
62
63
64
65
66
# to pull model from HF
export HF_TOKEN=<Token-Here>
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="$HF_TOKEN" \
  -n ${NAMESPACE};

67
68
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
atchernych's avatar
atchernych committed
69

70
71
# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
atchernych's avatar
atchernych committed
72

73
# Test it
Alec's avatar
Alec committed
74
kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
75
curl http://localhost:8000/v1/models
atchernych's avatar
atchernych committed
76
77
```

78
## Understanding Dynamo's Custom Resources
atchernych's avatar
atchernych committed
79

80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
Dynamo provides two main Kubernetes Custom Resources for deploying models:

### DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration

The **recommended approach** for generating optimal configurations. DGDR provides a high-level interface where you specify:
- Model name and backend framework
- SLA targets (latency requirements)
- GPU type (optional)

Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:
- SLA-driven configuration generation
- Automated resource optimization
- Users who want simplicity over control

**Note**: DGDR generates a DGD spec which you can then use to deploy.

### DynamoGraphDeployment (DGD) - Direct Configuration

A lower-level interface that defines your complete inference pipeline:
99
100
101
102
- Model configuration
- Resource allocation (GPUs, memory)
- Scaling policies
- Frontend/backend connections
atchernych's avatar
atchernych committed
103

104
105
Use this when you need fine-grained control or have already completed profiling.

106
Refer to the [API Reference and Documentation](/docs/kubernetes/api_reference.md) for more details.
atchernych's avatar
atchernych committed
107

108
109
110
111
## 📖 API Reference & Documentation

For detailed technical specifications of Dynamo's Kubernetes resources:

112
113
- **[API Reference](/docs/kubernetes/api_reference.md)** - Complete CRD field specifications for all Dynamo resources
- **[Create Deployment](/docs/kubernetes/create_deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment
114
- **[Operator Guide](/docs/kubernetes/dynamo_operator.md)** - Dynamo operator configuration and management
115

116
### Choosing Your Architecture Pattern
atchernych's avatar
atchernych committed
117

118
When creating a deployment, select the architecture pattern that best fits your use case:
atchernych's avatar
atchernych committed
119

120
121
122
- **Development / Testing** - Use `agg.yaml` as the base configuration
- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
atchernych's avatar
atchernych committed
123

124
### Frontend and Worker Components
atchernych's avatar
atchernych committed
125

126
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
atchernych's avatar
atchernych committed
127

128
129
130
131
- Provides OpenAI-compatible `/v1/chat/completions` endpoint
- Auto-discovers backend workers via etcd
- Routes requests and handles load balancing
- Validates and preprocesses requests
atchernych's avatar
atchernych committed
132

133
### Customizing Your Deployment
atchernych's avatar
atchernych committed
134

135
136
137
138
139
140
141
142
143
144
145
146
147
Example structure:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-llm
spec:
  services:
    Frontend:
      dynamoNamespace: my-llm
      componentType: frontend
      replicas: 1
      extraPodSpec:
atchernych's avatar
atchernych committed
148
        mainContainer:
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
          image: your-image
    VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
      dynamoNamespace: dynamo-dev
      componentType: worker
      replicas: 1
      envFromSecret: hf-token-secret  # for HuggingFace models
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: your-image
          command: ["/bin/sh", "-c"]
          args:
            - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
atchernych's avatar
atchernych committed
164
165
```

166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
Worker command examples per backend:
```yaml
# vLLM worker
args:
  - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B

# SGLang worker
args:
  - >-
    python3 -m dynamo.sglang
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --tp 1
    --trust-remote-code

# TensorRT-LLM worker
args:
  - python3 -m dynamo.trtllm
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --extra-engine-args engine_configs/agg.yaml
atchernych's avatar
atchernych committed
186
```
187

188
189
190
191
192
193
Key customization points include:
- **Model Configuration**: Specify model in the args command
- **Resource Allocation**: Configure GPU requirements under `resources.limits`
- **Scaling**: Set `replicas` for number of worker instances
- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
194

195
## Additional Resources
196

197
- **[Examples](/examples/README.md)** - Complete working examples
198
199
- **[Create Custom Deployments](/docs/kubernetes/create_deployment.md)** - Build your own CRDs
- **[Operator Documentation](/docs/kubernetes/dynamo_operator.md)** - How the platform works
200
- **[Helm Charts](/deploy/helm/README.md)** - For advanced users
201
202
203
204
205
- **[GitOps Deployment with FluxCD](/docs/kubernetes/fluxcd.md)** - For advanced users
- **[Logging](/docs/kubernetes/logging.md)** - For logging setup
- **[Multinode Deployment](/docs/kubernetes/multinode-deployment.md)** - For multinode deployment
- **[Grove](/docs/kubernetes/grove.md)** - For grove details and custom installation
- **[Monitoring](/docs/kubernetes/metrics.md)** - For monitoring setup
Alec's avatar
Alec committed
206
- **[Model Caching with Fluid](/docs/kubernetes/model_caching_with_fluid.md)** - For model caching with Fluid