Unverified Commit 5c7e66ec authored by hhzhang16's avatar hhzhang16 Committed by GitHub
Browse files

docs: add docs for DGDR usage -- golden path (#6946)


Signed-off-by: default avatarHannah Zhang <hannahz@nvidia.com>
parent 38bf037b
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# DynamoGraphDeploymentRequest for AI Configurator-based profiling
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-aic
spec:
model: Qwen/Qwen3-32B
backend: trtllm
image: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# DynamoGraphDeploymentRequest for online profiling (actual deployment testing)
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-online
spec:
model: Qwen/Qwen3-0.6B
backend: vllm
image: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag" # tag must be at least 1.0.0
searchStrategy: thorough
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# DynamoGraphDeploymentRequest for MoE model profiling
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-moe
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:my-tag"
searchStrategy: rapid
modelCache:
pvcName: "model-cache" # Name of PVC containing model weights
pvcModelPath: "deepseek-r1" # Subpath within PVC where model is stored
hardware:
# for h200, sweep over 8-16 GPUs per engine
numGpusPerNode: 8 # Override auto-discovered value if different
......@@ -8,60 +8,45 @@ Complete examples for profiling with DGDRs.
## DGDR Examples
### Dense Model: AIPerf on Real Engines
### Dense Model: Rapid
Standard online profiling with real GPU measurements:
Fast profiling (~30 seconds):
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: vllm-dense-online
name: qwen-0-6b
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
image: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
workload:
isl: 3000
osl: 150
sla:
ttft: 200.0
itl: 20.0
autoApply: true
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
```
### Dense Model: AI Configurator Simulation
### Dense Model: Thorough
Fast offline profiling (~30 seconds, TensorRT-LLM only):
Profiling with real GPU measurements:
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: trtllm-aic-offline
name: vllm-dense-online
spec:
model: "Qwen/Qwen3-32B"
backend: trtllm
image: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
workload:
isl: 4000
osl: 500
sla:
ttft: 300.0
itl: 10.0
autoApply: true
model: "Qwen/Qwen3-0.6B"
backend: vllm
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
searchStrategy: thorough
```
### MoE Model
Multi-node MoE profiling with SGLang:
> [!IMPORTANT]
> The PVC referenced by `modelCache.pvcName` must already exist in the same namespace and contain
> the model weights at the specified `pvcModelPath`. The DGDR controller does not create or
> populate the PVC — it only mounts it into the profiling job and deployed workers.
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
......@@ -70,53 +55,138 @@ metadata:
spec:
model: "deepseek-ai/DeepSeek-R1"
backend: sglang
image: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
workload:
isl: 2048
osl: 512
sla:
ttft: 300.0
itl: 25.0
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
hardware:
numGpusPerNode: 8
autoApply: true
modelCache:
pvcName: "model-cache"
pvcModelPath: "deepseek-r1" # path within the PVC
```
### Using Existing DGD Config (ConfigMap)
### Private Model
Reference a custom DGD configuration via ConfigMap:
For gated or private HuggingFace models, pass your token via an environment variable injected
into the profiling job. Create the secret first:
```bash
# Create ConfigMap from your DGD config file
kubectl create configmap deepseek-r1-config \
--from-file=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="${HF_TOKEN}" \
-n ${NAMESPACE}
```
Then reference it in your DGDR:
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
name: llama-private
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
image: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
model: "meta-llama/Llama-3.1-8B-Instruct"
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
overrides:
profilingJob:
template:
spec:
containers: [] # required placeholder; leave empty to inherit defaults
initContainers:
- name: profiler
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
```
### Custom SLA Targets
Control how the profiler optimizes your deployment by specifying latency targets and workload
characteristics.
**Explicit TTFT + ITL targets** (default mode):
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: low-latency-dense
spec:
model: "Qwen/Qwen3-0.6B"
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
sla:
ttft: 500 # Time To First Token target in milliseconds
itl: 20 # Inter-Token Latency target in milliseconds
workload:
isl: 4000
osl: 500
isl: 2000 # expected input sequence length (tokens)
osl: 500 # expected output sequence length (tokens)
```
**End-to-end latency target** (alternative to ttft+itl):
```yaml
spec:
...
sla:
e2eLatency: 10000 # total request latency budget in milliseconds
```
**Optimization objective without explicit targets** (maximize throughput or minimize latency):
```yaml
spec:
...
sla:
ttft: 300
itl: 10
optimizationType: throughput # or: latency
```
### Overrides
Use `overrides` to customize the profiling job pod spec — for example to add tolerations for
GPU node taints or inject environment variables.
**GPU node toleration** (common on GKE and shared clusters):
autoApply: true
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: dense-with-tolerations
spec:
model: "Qwen/Qwen3-0.6B"
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
overrides:
profilingJob:
template:
spec:
containers: [] # required placeholder; leave empty to inherit defaults
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
```
**Override the generated DynamoGraphDeployment** (e.g., to use a custom worker image):
```yaml
spec:
...
overrides:
dgd:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
services:
VllmWorker:
extraEnvs:
- name: CUSTOM_ENV
value: "my-value"
```
## SGLang Runtime Profiling
......
......@@ -45,6 +45,8 @@ navigation:
contents:
- page: Detailed Installation Guide
path: kubernetes/installation-guide.md
- page: Deploying Your First Model
path: kubernetes/dgdr.md
- page: Dynamo Operator
path: kubernetes/dynamo-operator.md
- page: Service Discovery
......
......@@ -82,26 +82,12 @@ Each backend has deployment examples and configuration options:
## 3. Deploy Your First Model
```bash
export NAMESPACE=dynamo-system
kubectl create namespace ${NAMESPACE}
# to pull model from HF
export HF_TOKEN=<Token-Here>
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="$HF_TOKEN" \
-n ${NAMESPACE};
Follow the **[Deploying Your First Model](dgdr.md)** guide for a complete end-to-end
walkthrough using `DynamoGraphDeploymentRequest` (DGDR) — Dynamo's recommended path that
handles profiling and configuration automatically.
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
# Test it
kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models
```
The tutorial deploys `Qwen/Qwen3-0.6B` with vLLM and walks you through every step: creating
the DGDR, watching the profiling lifecycle, and sending your first inference request.
For SLA-based autoscaling, see [SLA Planner Guide](../components/planner/planner-guide.md).
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Deploying Your First Model
---
# Deploying Your First Model
End-to-end tutorial for deploying `Qwen/Qwen3-0.6B` on Kubernetes using Dynamo's recommended
`DynamoGraphDeploymentRequest` (DGDR) workflow — from zero to your first inference response.
> [!NOTE]
> This guide assumes you have already completed the
> [platform installation](installation-guide.md) and that the Dynamo operator and CRDs are
> running in your cluster.
## What is a DynamoGraphDeploymentRequest?
A `DynamoGraphDeploymentRequest` (DGDR) is Dynamo's **deploy-by-intent** API. You describe what
you want to run and your performance targets; Dynamo's profiler determines the optimal
configuration automatically, then creates the live deployment for you.
| | DGDR (this guide) | DGD (manual) |
|---|---|---|
| **You provide** | Model + optional SLA targets | Full deployment spec |
| **Profiling** | Automated | You bring your own config |
| **Best for** | Getting started, SLA-driven deployments | Fine-grained control |
For a deeper comparison, see [Understanding Dynamo's Custom Resources](README.md#understanding-dynamos-custom-resources).
## Prerequisites
Before starting, confirm:
- Platform installed: `kubectl get pods -n ${NAMESPACE}` shows operator pods `Running`
- CRDs present: `kubectl get crd | grep dynamo` shows `dynamographdeploymentrequests.nvidia.com`
- `kubectl` and `helm` available in your shell
Set these variables once — they are referenced throughout the guide:
```bash
export NAMESPACE=dynamo-system # namespace where the platform is installed
export RELEASE_VERSION=1.x.x # match the installed platform version (e.g. 1.0.0)
export HF_TOKEN=<your-hf-token> # HuggingFace token
```
> [!TIP]
> `Qwen/Qwen3-0.6B` is a public model. A HuggingFace token is not strictly required to download
> it, but is recommended to avoid rate limiting.
## Step 1: Configure Namespace and Secrets
```bash
# Create the namespace (idempotent — safe to run even if it already exists)
kubectl create namespace ${NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
# Create the HuggingFace token secret for model download
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="${HF_TOKEN}" \
-n ${NAMESPACE}
```
Verify the secret was created:
```bash
kubectl get secret hf-token-secret -n ${NAMESPACE}
```
## Step 2: Create the DynamoGraphDeploymentRequest
Save the following as `qwen3-first-model.yaml`:
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: qwen3-first-model
spec:
# Model to profile and deploy
model: Qwen/Qwen3-0.6B
# Container image for the profiling job — must match your installed platform version.
# This is the same dynamo-frontend image used by the deployed inference service.
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:${RELEASE_VERSION}"
```
Apply it (uses `envsubst` to substitute the `RELEASE_VERSION` shell variable into the YAML):
```bash
envsubst < qwen3-first-model.yaml | kubectl apply -f - -n ${NAMESPACE}
```
### Field reference
| Field | Required | Default | Purpose |
|---|---|---|---|
| `model` | Yes | — | HuggingFace model ID (e.g. `Qwen/Qwen3-0.6B`) |
| `image` | No | — | Container image for the profiling job (`dynamo-frontend`) |
| `backend` | No | `auto` | Inference engine (`auto`, `vllm`, `sglang`, `trtllm`) |
| `searchStrategy` | No | `rapid` | Profiling depth — `rapid` (~30s, AIC simulation) or `thorough` (2–4h, real GPUs) |
| `autoApply` | No | `true` | Automatically create and start the deployment after profiling |
| `sla` | No | — | Target latency (TTFT, ITL in ms) for profiler optimization |
| `workload` | No | — | Expected traffic shape (ISL, OSL, request rate) |
| `hardware` | No | auto-detected | GPU SKU and count override; required when GPU discovery is disabled. When not set, the auto-discovered GPU count is capped at 32 — set `hardware.totalGpus` explicitly to use more. |
For the full spec reference, see the [DGDR API Reference](api-reference.md) and
[Profiler Guide](../components/profiler/profiler-guide.md).
> [!IMPORTANT]
> If you are using a **namespace-scoped operator** with GPU discovery disabled, you must also
> provide explicit hardware info or the DGDR will be rejected at admission:
>
> ```yaml
> spec:
> ...
> hardware:
> numGpusPerNode: 1
> gpuSku: "H100-SXM5-80GB"
> vramMb: 81920
> ```
>
> See the [installation guide](installation-guide.md#gpu-discovery-for-dynamographdeploymentrequests-with-namespace-scoped-operators)
> for details.
## Step 3: Monitor Profiling Progress
Profiling is the automated step where Dynamo sweeps across candidate configurations (parallelism, batching, scheduling strategies) to find the one that best meets your SLA and hardware — so you don't have to tune it manually.
Watch the DGDR status in real time:
```bash
kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} -w
```
The `PHASE` column progresses through:
| Phase | What is happening |
|---|---|
| `Pending` (condition: `DiscoveringHardware`) | Spec validated; operator is discovering GPU hardware and preparing the profiling job |
| `Profiling` | Profiling job is running (AIC simulation or real-GPU sweep) |
| `Ready` | Profiling complete; optimal config stored in `.status`. Terminal state when `autoApply: false` |
| `Deploying` | Creating the `DynamoGraphDeployment` (only when `autoApply: true`) |
| `Deployed` | DGD is running and healthy |
| `Failed` | Unrecoverable error — check events for details |
> [!TIP]
> `Deployed` is the success terminal state when `autoApply: true` (the default).
> If you set `autoApply: false`, the phase stops at `Ready` — profiling is complete and the
> generated DGD spec is stored in `.status`, but no deployment is created automatically.
> To inspect and deploy it manually:
>
> ```bash
> # View the generated DGD spec
> kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} \
> -o jsonpath='{.status.profilingResults.selectedConfig}' | python3 -m json.tool
>
> # Save it and apply
> kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} \
> -o jsonpath='{.status.profilingResults.selectedConfig}' > generated-dgd.yaml
> kubectl apply -f generated-dgd.yaml -n ${NAMESPACE}
> ```
For a full status summary and events:
```bash
kubectl describe dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE}
```
To follow the profiling job logs:
```bash
# Find the profiling pod
kubectl get pods -n ${NAMESPACE} -l nvidia.com/dgdr-name=qwen3-first-model
# Stream its logs
kubectl logs -f <profiling-pod-name> -n ${NAMESPACE}
```
> [!TIP]
> With `searchStrategy: rapid`, profiling typically completes in under 15 minutes on a single GPU.
## Step 4: Verify the Deployment
Once the DGDR reaches `Deployed`, the `DynamoGraphDeployment` has been created automatically.
Check that everything is running:
```bash
# See the auto-created DGD
kubectl get dynamographdeployment -n ${NAMESPACE}
# Confirm all pods are Running
kubectl get pods -n ${NAMESPACE}
```
Wait until pods are ready:
```bash
kubectl wait --for=condition=ready pod \
-l nvidia.com/dynamo-deployment=qwen3-first-model \
-n ${NAMESPACE} \
--timeout=600s
```
Find the frontend service name:
```bash
kubectl get svc -n ${NAMESPACE} | grep frontend
```
## Step 5: Send Your First Request
Port-forward to the frontend and send an inference request:
```bash
# Start port-forward (replace <frontend-service-name> with the name from Step 4)
kubectl port-forward svc/<frontend-service-name> 8000:8000 -n ${NAMESPACE} &
# Confirm the model is available
curl http://localhost:8000/v1/models
# Send a chat completion request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "What is NVIDIA Dynamo?"}],
"max_tokens": 200
}'
```
A successful response looks like:
```json
{
"id": "chatcmpl-...",
"object": "chat.completion",
"model": "Qwen/Qwen3-0.6B",
"choices": [{
"message": {
"role": "assistant",
"content": "NVIDIA Dynamo is a high-performance inference framework..."
}
}]
}
```
Your first model is now live.
## Cleanup
To remove the deployment and profiling artifacts:
```bash
kubectl delete dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE}
```
> [!NOTE]
> Deleting a DGDR does **not** delete the `DynamoGraphDeployment` it created. The DGD persists
> independently so it can continue serving traffic.
## Troubleshooting
**DGDR stuck in `Pending`**
```bash
kubectl describe dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE}
# Check the Events section at the bottom
```
Common causes: no available GPU nodes, image pull failure (check image tag; NGC credentials are
optional but may be needed if you hit rate limits pulling from public NGC), missing `hardware`
config for a namespace-scoped operator.
> [!TIP]
> **GPU node taints** are a frequent cause of pods staying `Pending`. Many clusters (including
> GKE by default and most shared/HPC environments) taint GPU nodes with
> `nvidia.com/gpu:NoSchedule` so that only GPU-aware workloads land on them. If the profiling
> job pod is stuck with a `0/N nodes are available: … node(s) had untolerated taint` event,
> add a toleration to your DGDR via `overrides.profilingJob`. The operator and profiler
> automatically forward it to every candidate and deployed pod:
>
> ```yaml
> spec:
> ...
> overrides:
> profilingJob:
> template:
> spec:
> containers: [] # required placeholder; leave empty to inherit defaults
> tolerations:
> - key: nvidia.com/gpu
> operator: Exists
> effect: NoSchedule
> ```
**Profiling job fails**
```bash
kubectl get pods -n ${NAMESPACE} -l nvidia.com/dgdr-name=qwen3-first-model
kubectl logs <profiling-pod-name> -n ${NAMESPACE}
# If the pod has already exited:
kubectl logs <profiling-pod-name> -n ${NAMESPACE} --previous
```
**Pods not starting after profiling**
```bash
kubectl describe pod <pod-name> -n ${NAMESPACE}
# Look for ImagePullBackOff, OOMKilled, or Insufficient resources
```
**Model not responding after port-forward**
```bash
# Check frontend is ready
kubectl get pods -n ${NAMESPACE} | grep frontend
# Check frontend logs
kubectl logs <frontend-pod-name> -n ${NAMESPACE}
```
## Next Steps
- **Tune for production SLAs**: Add `sla` (TTFT, ITL) and `workload` (ISL, OSL) targets to
your DGDR so the profiler optimizes for your specific traffic. See the
[Profiler Guide](../components/profiler/profiler-guide.md) for the full configuration
reference and picking modes. For ready-to-use YAML — including SLA targets, private models,
MoE, and overrides — see [DGDR Examples](../components/profiler/profiler-examples.md).
- **Scale the deployment**: [Autoscaling guide](autoscaling.md)
- **SLA-aware autoscaling**: Enable the Planner via `features.planner` in the DGDR —
see the [Planner Guide](../components/planner/planner-guide.md).
- **Inspect the generated config**: Set `autoApply: false` and extract the DGD spec with
`kubectl get dgdr <name> -o jsonpath='{.status.profilingResults.selectedConfig}'`
before deploying.
- **Direct control**: [Creating Deployments](deployment/create-deployment.md) — write your own
`DynamoGraphDeployment` spec for full customization.
- **Monitor performance**: [Observability](observability/metrics.md)
- **Try specific backends**: [vLLM](../backends/vllm/README.md),
[SGLang](../backends/sglang/README.md), [TensorRT-LLM](../backends/trtllm/README.md)
......@@ -578,6 +578,6 @@ kubectl create secret docker-registry nvcr-imagepullsecret \
## See Also
- [DGDR Examples](../../../components/src/dynamo/profiler/deploy/) - Complete DGDR YAML examples
- [DGDR Examples](../../../docs/components/profiler/profiler-examples.md) - Complete DGDR YAML examples
- [DGDR API Reference](/docs/kubernetes/api-reference.md) - DGDR specification
- [Profiler Arguments Reference](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/profiler/utils/dgdr_v1beta1_types.py) - Full Configuration Reference
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment