"lib/runtime/vscode:/vscode.git/clone" did not exist on "b98188c8f9ebef986bc05f5abf0c51a0ec291193"
README.md 6.95 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

18
# Deploying Inference Graphs to Kubernetes
19

20
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
atchernych's avatar
atchernych committed
21

22
## 1. Install Platform First
23
24
25

```bash
# 1. Set environment
26
export NAMESPACE=dynamo-system
27
28
29
30
31
32
33
34
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases

# 2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
35
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
36
37
```

38
For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](/docs/kubernetes/installation_guide.md)**.
atchernych's avatar
atchernych committed
39

40
## 2. Choose Your Backend
41

42
Each backend has deployment examples and configuration options:
43

44
45
| Backend | Available Configurations |
|---------|--------------------------|
46
| **[vLLM](/components/backends/vllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner, Disaggregated Multi-node |
47
| **[SGLang](/components/backends/sglang/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node |
48
| **[TensorRT-LLM](/components/backends/trtllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated Multi-node |
atchernych's avatar
atchernych committed
49

50
## 3. Deploy Your First Model
atchernych's avatar
atchernych committed
51

52
53
```bash
export NAMESPACE=dynamo-cloud
54
kubectl create namespace ${NAMESPACE}
atchernych's avatar
atchernych committed
55

Alec's avatar
Alec committed
56
57
58
59
60
61
# to pull model from HF
export HF_TOKEN=<Token-Here>
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="$HF_TOKEN" \
  -n ${NAMESPACE};

62
63
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
atchernych's avatar
atchernych committed
64

65
66
# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
atchernych's avatar
atchernych committed
67

68
# Test it
Alec's avatar
Alec committed
69
kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
70
curl http://localhost:8000/v1/models
atchernych's avatar
atchernych committed
71
72
```

73
## What's a DynamoGraphDeployment?
atchernych's avatar
atchernych committed
74

75
76
77
78
79
It's a Kubernetes Custom Resource that defines your inference pipeline:
- Model configuration
- Resource allocation (GPUs, memory)
- Scaling policies
- Frontend/backend connections
atchernych's avatar
atchernych committed
80

81
Refer to the [API Reference and Documentation](/docs/kubernetes/api_reference.md) for more details.
atchernych's avatar
atchernych committed
82

83
84
85
86
## 📖 API Reference & Documentation

For detailed technical specifications of Dynamo's Kubernetes resources:

87
88
89
- **[API Reference](/docs/kubernetes/api_reference.md)** - Complete CRD field specifications for `DynamoGraphDeployment` and `DynamoComponentDeployment`
- **[Operator Guide](/docs/kubernetes/dynamo_operator.md)** - Dynamo operator configuration and management
- **[Create Deployment](/docs/kubernetes/create_deployment.md)** - Step-by-step deployment creation examples
90

91
### Choosing Your Architecture Pattern
atchernych's avatar
atchernych committed
92

93
When creating a deployment, select the architecture pattern that best fits your use case:
atchernych's avatar
atchernych committed
94

95
96
97
- **Development / Testing** - Use `agg.yaml` as the base configuration
- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
atchernych's avatar
atchernych committed
98

99
### Frontend and Worker Components
atchernych's avatar
atchernych committed
100

101
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
atchernych's avatar
atchernych committed
102

103
104
105
106
- Provides OpenAI-compatible `/v1/chat/completions` endpoint
- Auto-discovers backend workers via etcd
- Routes requests and handles load balancing
- Validates and preprocesses requests
atchernych's avatar
atchernych committed
107

108
### Customizing Your Deployment
atchernych's avatar
atchernych committed
109

110
111
112
113
114
115
116
117
118
119
120
121
122
Example structure:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-llm
spec:
  services:
    Frontend:
      dynamoNamespace: my-llm
      componentType: frontend
      replicas: 1
      extraPodSpec:
atchernych's avatar
atchernych committed
123
        mainContainer:
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
          image: your-image
    VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
      dynamoNamespace: dynamo-dev
      componentType: worker
      replicas: 1
      envFromSecret: hf-token-secret  # for HuggingFace models
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: your-image
          command: ["/bin/sh", "-c"]
          args:
            - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
atchernych's avatar
atchernych committed
139
140
```

141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
Worker command examples per backend:
```yaml
# vLLM worker
args:
  - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B

# SGLang worker
args:
  - >-
    python3 -m dynamo.sglang
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --tp 1
    --trust-remote-code

# TensorRT-LLM worker
args:
  - python3 -m dynamo.trtllm
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --extra-engine-args engine_configs/agg.yaml
atchernych's avatar
atchernych committed
161
```
162

163
164
165
166
167
168
Key customization points include:
- **Model Configuration**: Specify model in the args command
- **Resource Allocation**: Configure GPU requirements under `resources.limits`
- **Scaling**: Set `replicas` for number of worker instances
- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
169

170
## Additional Resources
171

172
- **[Examples](/examples/README.md)** - Complete working examples
173
174
- **[Create Custom Deployments](/docs/kubernetes/create_deployment.md)** - Build your own CRDs
- **[Operator Documentation](/docs/kubernetes/dynamo_operator.md)** - How the platform works
175
- **[Helm Charts](/deploy/helm/README.md)** - For advanced users
176
177
178
179
180
- **[GitOps Deployment with FluxCD](/docs/kubernetes/fluxcd.md)** - For advanced users
- **[Logging](/docs/kubernetes/logging.md)** - For logging setup
- **[Multinode Deployment](/docs/kubernetes/multinode-deployment.md)** - For multinode deployment
- **[Grove](/docs/kubernetes/grove.md)** - For grove details and custom installation
- **[Monitoring](/docs/kubernetes/metrics.md)** - For monitoring setup
Alec's avatar
Alec committed
181
- **[Model Caching with Fluid](/docs/kubernetes/model_caching_with_fluid.md)** - For model caching with Fluid