README.md 5.23 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

18
# Deploying Inference Graphs to Kubernetes
19

20
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
atchernych's avatar
atchernych committed
21

22
23
## 1. Install Platform First
**[Dynamo Kubernetes Platform](dynamo_cloud.md)** - Main installation guide with 3 paths
atchernych's avatar
atchernych committed
24

25
## 2. Choose Your Backend
26

27
Each backend has deployment examples and configuration options:
28

29
30
31
32
33
| Backend | Available Configurations |
|---------|--------------------------|
| **[vLLM](../../../components/backends/vllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner |
| **[SGLang](../../../components/backends/sglang/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node |
| **[TensorRT-LLM](../../../components/backends/trtllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router |
atchernych's avatar
atchernych committed
34

35
## 3. Deploy Your First Model
atchernych's avatar
atchernych committed
36

37
38
39
```bash
# Set same namespace from platform install
export NAMESPACE=dynamo-cloud
atchernych's avatar
atchernych committed
40

41
42
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
atchernych's avatar
atchernych committed
43

44
45
# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
atchernych's avatar
atchernych committed
46

47
48
49
# Test it
kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models
atchernych's avatar
atchernych committed
50
51
```

52
## What's a DynamoGraphDeployment?
atchernych's avatar
atchernych committed
53

54
55
56
57
58
It's a Kubernetes Custom Resource that defines your inference pipeline:
- Model configuration
- Resource allocation (GPUs, memory)
- Scaling policies
- Frontend/backend connections
atchernych's avatar
atchernych committed
59

60
The scripts in the `components/<backend>/launch` folder like `agg.sh` demonstrate how you can serve your models locally. The corresponding YAML files like `agg.yaml` show you how you could create a kubernetes deployment for your inference graph.
atchernych's avatar
atchernych committed
61

62
### Choosing Your Architecture Pattern
atchernych's avatar
atchernych committed
63

64
When creating a deployment, select the architecture pattern that best fits your use case:
atchernych's avatar
atchernych committed
65

66
67
68
- **Development / Testing** - Use `agg.yaml` as the base configuration
- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
atchernych's avatar
atchernych committed
69

70
### Frontend and Worker Components
atchernych's avatar
atchernych committed
71

72
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
atchernych's avatar
atchernych committed
73

74
75
76
77
- Provides OpenAI-compatible `/v1/chat/completions` endpoint
- Auto-discovers backend workers via etcd
- Routes requests and handles load balancing
- Validates and preprocesses requests
atchernych's avatar
atchernych committed
78

79
### Customizing Your Deployment
atchernych's avatar
atchernych committed
80

81
82
83
84
85
86
87
88
89
90
91
92
93
Example structure:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-llm
spec:
  services:
    Frontend:
      dynamoNamespace: my-llm
      componentType: frontend
      replicas: 1
      extraPodSpec:
atchernych's avatar
atchernych committed
94
        mainContainer:
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
          image: your-image
    VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
      dynamoNamespace: dynamo-dev
      componentType: worker
      replicas: 1
      envFromSecret: hf-token-secret  # for HuggingFace models
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: your-image
          command: ["/bin/sh", "-c"]
          args:
            - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
atchernych's avatar
atchernych committed
110
111
```

112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
Worker command examples per backend:
```yaml
# vLLM worker
args:
  - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B

# SGLang worker
args:
  - >-
    python3 -m dynamo.sglang
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --tp 1
    --trust-remote-code

# TensorRT-LLM worker
args:
  - python3 -m dynamo.trtllm
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --extra-engine-args engine_configs/agg.yaml
atchernych's avatar
atchernych committed
132
```
133

134
135
136
137
138
139
Key customization points include:
- **Model Configuration**: Specify model in the args command
- **Resource Allocation**: Configure GPU requirements under `resources.limits`
- **Scaling**: Set `replicas` for number of worker instances
- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
140

141
## Additional Resources
142

143
144
145
146
- **[Examples](../../examples/README.md)** - Complete working examples
- **[Create Custom Deployments](create_deployment.md)** - Build your own CRDs
- **[Operator Documentation](dynamo_operator.md)** - How the platform works
- **[Helm Charts](../../../deploy/helm/README.md)** - For advanced users