"lib/parsers/vscode:/vscode.git/clone" did not exist on "656b4c443dfe1d2b5e80b8be5acf18edd712e089"
Unverified Commit 6a84ffd3 authored by hhzhang16's avatar hhzhang16 Committed by GitHub
Browse files

feat: turn profiling k8s jobs into sample DGDR requests (#3864)


Signed-off-by: default avatarHannah Zhang <hannahz@nvidia.com>
Signed-off-by: default avatarhongkuanz <hongkuanz@nvidia.com>
Signed-off-by: default avatarHongkuan Zhou <tedzhouhk@gmail.com>
Co-authored-by: default avatarhongkuanz <hongkuanz@nvidia.com>
Co-authored-by: default avatarHongkuan Zhou <tedzhouhk@gmail.com>
parent 0d07e2c3
# SLA Planner Quick Start Guide # SLA-Driven Profiling and Planner Deployment Quick Start Guide
Complete workflow to deploy SLA-based autoscaling for Dynamo deployments. This guide consolidates all necessary steps into a clear, sequential process. Complete workflow to deploy SLA-optimized Dynamo models using DynamoGraphDeploymentRequests (DGDR). This guide shows how to automatically profile models and deploy them with optimal configurations that meet your Service Level Agreements (SLAs).
> [!IMPORTANT] > [!IMPORTANT]
> **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](/docs/kubernetes/installation_guide.md). > **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](/docs/kubernetes/installation_guide.md).
## Overview ## Overview
The SLA Planner automatically scales prefill and decode workers to meet your TTFT (Time To First Token) and ITL (Inter-Token Latency) targets. The DGDR workflow automates the entire process from SLA specification to deployment:
The deployment process consists of two mandatory phases: 1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information in a DGDR Custom Resource
2. **Automatic Profiling**: The Dynamo Operator automatically profiles your model to find optimal configurations
1. **Pre-Deployment Profiling** (2-4 hours) - Generates performance data 3. **Auto-Deploy**: The system automatically deploys the optimal configuration that meets your SLAs
2. **SLA Planner Deployment** (5-10 minutes) - Enables autoscaling
> [!TIP]
> **Fast Profiling with AI Configurator**: For TensorRT-LLM users, we provide AI Configurator (AIC) that can complete profiling in 20-30 seconds using performance simulation instead of real deployments. Support for vLLM and SGLang coming soon. See [AI Configurator section](/docs/benchmarks/pre_deployment_profiling.md#running-the-profiling-script-with-ai-configurator) in the Profiling Guide.
```mermaid ```mermaid
flowchart TD flowchart TD
A[Start Setup] --> B{Profiling Done?} A[Create DGDR] --> B[DGDR Controller]
B -->|No| C[Run Profiling<br/>2-4 hours] B --> C{Profiling Method}
C --> D[Verify Results] C -->|Online| D[Run Profiling Job<br/>2-4 hours]
D --> E[Deploy Planner<br/>5-10 minutes] C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
B -->|Yes| E D --> F[Generate DGD Config]
E --> F[Test System] E --> F
F --> G[Ready!] F --> G[Auto-Deploy DGD]
G --> H[Monitor & Scale]
style A fill:#e1f5fe style A fill:#e1f5fe
style C fill:#fff3e0 style D fill:#fff3e0
style E fill:#e8f5e8 style E fill:#e8f5e8
style G fill:#f3e5f5 style G fill:#f3e5f5
style B fill:#fff8e1 style H fill:#fff8e1
``` ```
## Prerequisites ## What is a DynamoGraphDeploymentRequest (DGDR)?
Before deploying the SLA planner, ensure: A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. Think of it as a "deployment order" where you specify:
- **Dynamo platform installed** (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- **[kube-prometheus-stack](/docs/kubernetes/observability/metrics.md) installed and running.** By default, the prometheus server is not deployed in the `monitoring` namespace. If it is deployed to a different namespace, set `dynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090"`.
- **Benchmarking resources setup** (see [Kubernetes utilities for Dynamo Benchmarking and Profiling](../../deploy/utils/README.md)) The script will create a `dynamo-pvc` with `ReadWriteMany` access, if your cluster's default storageClassName does not allow `ReadWriteMany`, you need to specify a different storageClassName in `deploy/utils/manifests/pvc.yaml` which does support `ReadWriteMany`.
- **What** model you want to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
## Pre-Deployment Profiling The Dynamo Operator watches for DGDRs and automatically:
1. Discovers available GPU resources in your cluster
2. Runs profiling (online or offline) to find optimal configurations
3. Generates an optimized DynamoGraphDeployment (DGD) configuration
4. Deploys the DGD to your cluster
Deploying planner starts with running pre-deployment profiling. **Key Benefits:**
- **Declarative**: Specify what you want, not how to achieve it
- **Automated**: No manual profiling job setup or result processing
- **SLA-Driven**: Ensures deployments meet your performance requirements
- **Integrated**: Works seamlessly with the Dynamo Operator
> [!WARNING] ## Prerequisites
> **MANDATORY**: Pre-deployment profiling must be completed before deploying SLA planner. This process analyzes your model's performance characteristics to determine optimal tensor parallelism configurations and scaling parameters.
### Step 1.1: Set Up Profiling Environment Before creating a DGDR, ensure:
- **Dynamo platform installed** with the operator running (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- **[kube-prometheus-stack](/docs/kubernetes/observability/metrics.md) installed and running** (required for SLA planner)
- **Profiling PVC created** (see [Benchmarking Resource Setup](/deploy/utils/README.md#benchmarking-resource-setup#BenchmarkingResourceSetup))
- **Image pull secrets configured** if using private registries (typically `nvcr-imagepullsecret` for NVIDIA images)
- **Sufficient GPU resources** available in your cluster for profiling
- **Runtime images available** that contain both profiler and runtime components
Set up your Kubernetes namespace for profiling (one-time per namespace). If your namespace is already set up, skip this step. ### Container Images
```bash Each DGDR requires you to specify container images for the profiling and deployment process:
export NAMESPACE=your-namespace
```
**Prerequisites**: Ensure all dependencies are installed: **profilingConfig.profilerImage** (Required):
```bash Specifies the container image used for the profiling job itself. This image must contain the profiler code and dependencies needed for SLA-based profiling.
pip install -r deploy/utils/requirements.txt
```
### Step 1.2: Inject Your Configuration **deploymentOverrides.workersImage** (Optional):
Specifies the container image used for DynamoGraphDeployment worker components (frontend, workers, planner). This image is used for:
- Temporary DGDs created during online profiling (for performance measurements)
- The final DGD deployed after profiling completes
Use the injector utility to place your DGD manifest into the PVC: If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. You may use our public images (0.6.1 and later) or build and push your own.
```bash ```yaml
# Use default disagg.yaml config spec:
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
# Or use a custom disagg config file deploymentOverrides:
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Optional
``` ```
> **Note**: All paths must start with `/data/` for security reasons. ## Quick Start: Deploy with DGDR
### Step 1: Create Your DGDR
Dynamo provides sample DGDR configurations in `benchmarks/profiler/deploy/`. You can use these as starting points:
### Step 1.3: Configure SLA Targets **Available Sample DGDRs:**
- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator (TensorRT-LLM)
- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
For dense models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml`: Or, you can create your own DGDR for your own needs:
```yaml ```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-model-deployment # Change the name
namespace: default # Change the namespace
spec: spec:
template: model: "Qwen/Qwen3-0.6B" # Update to your model
spec: backend: vllm # Backend: vllm, sglang, or trtllm
containers:
- name: profile-sla
args:
- --isl
- "3000" # average ISL is 3000 tokens
- --osl
- "150" # average OSL is 150 tokens
- --ttft
- "200" # target TTFT is 200ms
- --itl
- "20" # target ITL is 20ms
- --backend
- <vllm/sglang>
```
For MoE models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml` instead. profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Required
config:
sla:
isl: 3000 # Adjust to your workload
osl: 150 # Adjust to your workload
ttft: 200 # Your target (ms)
itl: 20 # Your target (ms)
### Step 1.4: Run Profiling sweep:
use_ai_configurator: false # Set to true for fast profiling (TensorRT-LLM only)
Set the container image and config path: deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Optional
```bash autoApply: true # Auto-deploy after profiling
export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
export DGD_CONFIG_FILE=/data/configs/disagg.yaml
``` ```
Run profiling: > [!TIP]
> For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
```bash ### Step 2: Apply the DGDR
# for dense models
envsubst < benchmarks/profiler/deploy/profile_sla_job.yaml | kubectl apply -f -
# for MoE models The rest of this quickstart will use the DGDR sample that uses AIC profiling. If you use a different DGDR file and/or name, be sure to adjust the commands accordingly.
envsubst < benchmarks/profiler/deploy/profile_sla_moe_job.yaml | kubectl apply -f -
# using aiconfigurator instead of real sweeping (see below for more details) ```bash
envsubst < benchmarks/profiler/deploy/profile_sla_aic_job.yaml | kubectl apply -f - export NAMESPACE=your-namespace
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
``` ```
### Step 1.5: Monitor Profiling Progress The Dynamo Operator will immediately begin processing your request.
### Step 3: Monitor Progress
Watch the DGDR status:
```bash ```bash
kubectl get jobs -n $NAMESPACE # View status
kubectl logs job/profile-sla -n $NAMESPACE kubectl get dgdr -n $NAMESPACE
# Detailed status
kubectl describe dgdr sla-aic -n $NAMESPACE
# Watch profiling job logs
kubectl logs -f job/profile-sla-aic -n $NAMESPACE
``` ```
**DGDR Status States:**
- `Pending`: Initial state, preparing to profile
- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
- `Deploying`: Generating and applying DGD configuration
- `Ready`: DGD successfully deployed and running
- `Failed`: Error occurred (check events for details)
> [!NOTE] > [!NOTE]
> **Time Investment**: This profiling process is comprehensive and typically takes **2-4 hours** to complete. The script systematically tests multiple tensor parallelism configurations and load conditions to find optimal performance settings. > With AI Configurator, profiling completes in **20-30 seconds**! This is much faster than online profiling which takes 2-4 hours.
### Step 1.6: Download Profiling Results ### Step 4: Access Your Deployment
If you want to view the profiling results and performance plots: Once the DGDR reaches `Ready` state, your model is deployed and ready to serve:
```bash ```bash
# Download to directory # Find the frontend service
python3 -m deploy.utils.download_pvc_results --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results kubectl get svc -n $NAMESPACE | grep trtllm-disagg
```
For detailed information about the output structure, performance plots, and how to analyze the results, see the [Viewing Profiling Results](/docs/benchmarks/pre_deployment_profiling.md#viewing-profiling-results) section in the Profiling Guide. # Port-forward to access locally
kubectl port-forward svc/trtllm-disagg-frontend 8000:8000 -n $NAMESPACE
**Verify Success**: Look for terminal output like: # Test the endpoint
``` curl http://localhost:8000/v1/models
Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
...
Final DGD config with planner: {...}
Deploying the optimized DGD with planner...
``` ```
### Step 1.7: Deploy the DGD with Planner ## DGDR Configuration Details
```bash ### Required Fields
kubectl apply -f ./results/config_with_planner.yaml
```
### Step 1.8: Wait for Deployment to be Ready | Field | Type | Description |
|-------|------|-------------|
| `spec.model` | string | Model identifier (e.g., "meta-llama/Llama-3-70b") |
| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` |
| `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
| `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |
```bash ### Optional Fields
kubectl get pods -n $NAMESPACE
| Field | Type | Description |
|-------|------|-------------|
| `spec.deploymentOverrides.workersImage` | string | Container image for DGD worker components. If omitted, uses image from base config file. |
| `spec.autoApply` | boolean | Automatically deploy DGD after profiling (default: false) |
| `spec.deploymentOverrides` | object | Customize metadata (name, namespace, labels, annotations) and image for auto-created DGD |
### SLA Configuration
The `sla` section defines performance requirements and workload characteristics:
```yaml
sla:
isl: 3000 # Average input sequence length (tokens)
osl: 150 # Average output sequence length (tokens)
ttft: 200 # Target Time To First Token (milliseconds, float)
itl: 20 # Target Inter-Token Latency (milliseconds, float)
``` ```
**Expected pods** (all should be `1/1 Running`): **Choosing SLA Values:**
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed)
- **ITL**: Token generation latency target (lower = more GPUs needed)
- **Trade-offs**: Tighter SLAs require more GPU resources
### Profiling Methods
Choose between **online profiling** (real measurements, 2-4 hours) or **offline profiling** with AI Configurator (estimated, 20-30 seconds):
```yaml
# Online Profiling (Default)
sweep:
use_ai_configurator: false
# Offline Profiling (AI Configurator - TensorRT-LLM only)
sweep:
use_ai_configurator: true
aic:
system: h200_sxm
model_name: QWEN3_32B
backend_version: "0.20.0"
``` ```
vllm-disagg-planner-frontend-* 1/1 Running
vllm-disagg-planner-planner-* 1/1 Running > [!NOTE]
vllm-disagg-planner-backend-* 1/1 Running > For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-methods).
vllm-disagg-planner-prefill-* 1/1 Running
### GPU Discovery
By default, the DGDR controller automatically discovers available GPU resources. Optionally specify preferences:
```yaml
spec:
gpu:
type: h200 # GPU type (e.g., h100, h200)
count: 8 # Number of GPUs to use
memoryGB: 141 # GPU memory in GB
``` ```
### Step 1.9: Test the System ### Advanced Configuration
#### Using Existing DGD Configs (Recommended for Custom Setups)
If you have an existing DynamoGraphDeployment config (e.g., from `components/backends/*/deploy/disagg.yaml` or custom recipes), you can reference it via ConfigMap:
**Step 1: Create ConfigMap from your DGD config file:**
```bash ```bash
# Port forward to frontend kubectl create configmap deepseek-r1-config \
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000 --from-file=disagg.yaml=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
# Send a request --dry-run=client -o yaml | kubectl apply -f -
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
],
"stream":true,
"max_tokens": 30
}'
``` ```
### Step 1.10: Monitor Scaling **Step 2: Reference the ConfigMap in your DGDR:**
```bash ```yaml
# Check planner logs for scaling decisions apiVersion: nvidia.com/v1alpha1
kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-planner --tail=10 kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
configMapRef:
name: deepseek-r1-config
key: disagg.yaml # Must match the key used in --from-file
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
use_ai_configurator: true
aic:
system: h200_sxm
model_name: DEEPSEEK_V3
backend_version: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
autoApply: true
``` ```
**Expected successful output** (after streaming requests): > **What's happening**: The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` into `deployment.model` and `spec.backend` into `engine.backend` in the final configuration.
#### Inline Configuration (Simple Use Cases)
For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler will auto-generate a basic DGD configuration from your `model` and `backend`:
```yaml
profilingConfig:
config:
# SLA targets (required for profiling)
sla:
isl: 8000 # Input sequence length
osl: 200 # Output sequence length
ttft: 200.0 # Time To First Token (ms)
itl: 10.0 # Inter-Token Latency (ms)
# Hardware constraints (optional)
hardware:
min_num_gpus_per_engine: 2
max_num_gpus_per_engine: 8
gpu_type: h200_sxm
# Profiling sweep settings (optional)
sweep:
skip_existing_results: false
force_rerun: false
``` ```
New adjustment interval started!
Observed num_req: X.XXX isl: X.XXX osl: X.XXX > **Note**: `engine.config` is a **file path** to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate.
Observed ttft: X.XXms itl: X.XXms
Number of prefill workers: 1, number of decode workers: 1 #### Planner Configuration Passthrough
Add planner-specific settings. Planner arguments use a `planner_` prefix:
```yaml
profilingConfig:
config:
planner:
planner_min_endpoint: 2
``` ```
## Production Readiness ## Understanding Profiling Results
### Monitoring Metrics For details about the profiling process, performance plots, and interpolation data, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-process-details).
- **Basic metrics** (request count): Available with any request type ## Advanced Topics
- **Latency metrics** (TTFT/ITL): Available for both streaming and non-streaming requests
- **Scaling decisions**: Require sufficient request volume
### Troubleshooting ### DGDR Immutability
**Connection Issues:** DGDRs are **immutable** - if you need to update SLAs or configuration:
```bash
# Verify Prometheus is accessible 1. Delete the existing DGDR: `kubectl delete dgdr sla-aic`
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090 2. Create a new DGDR with updated specifications
curl "http://localhost:9090/api/v1/query?query=up"
### Manual Deployment Control
Disable auto-deployment to review configurations before deploying:
```yaml
spec:
autoApply: false
``` ```
**Missing Metrics:** Then manually apply the generated DGD:
```bash ```bash
# Check frontend metrics # Extract generated config
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000 kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedConfig}' > my-dgd.yaml
curl http://localhost:8000/metrics | grep nv_llm_http_service
# Review and modify if needed
vi my-dgd.yaml
# Deploy manually
kubectl apply -f my-dgd.yaml -n $NAMESPACE
``` ```
**Worker Issues:** ### Relationship to DynamoGraphDeployment (DGD)
- Large models can take 10+ minutes to initialize
- Check worker logs: `kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-backend`
- Ensure GPU resources are available for workers
**Unknown Field subComponentType:** - **DGDR**: High-level "intent" - what you want deployed
- **DGD**: Low-level "implementation" - how it's deployed
The DGDR controller generates a DGD that:
- Uses optimal TP configurations from profiling
- Includes SLA planner for autoscaling
- Has deployment and engine settings tuned for your SLAs
The generated DGD is tracked via labels:
```yaml
metadata:
labels:
dgdr.nvidia.com/name: sla-aic
dgdr.nvidia.com/namespace: your-namespace
```
## Troubleshooting
### Quick Diagnostics
If you encounter the following error when applying the deployment:
```bash ```bash
Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType" # Check DGDR status and events
kubectl describe dgdr sla-aic -n $NAMESPACE
# Check operator logs
kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=dynamo-operator --tail=100
# Check profiling job logs
kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE
``` ```
This is because the `subComponentType` field has only been added in newer versions of the DynamoGraphDeployment CRD (> 0.5.0). You can upgrade the CRD version by following the instructions [here](/docs/kubernetes/installation_guide.md).
## Next Steps ### Common Issues
- **Architecture Details**: See [SLA-based Planner Architecture](/docs/planner/sla_planner.md) for technical details | Issue | Quick Fix |
- **Performance Tuning**: See [Pre-Deployment Profiling Guide](/docs/benchmarks/pre_deployment_profiling.md) for advanced profiling options |-------|-----------|
- **Load Testing**: See [SLA Planner Load Test](/tests/planner/README.md) for comprehensive testing tools | **DGDR stuck in Pending** | Check GPU availability: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'` |
| **Image pull errors** | Verify secret exists: `kubectl get secret nvcr-imagepullsecret -n $NAMESPACE` |
| **Profiling fails** | Check job logs: `kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE` |
| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
## Quick Reference > [!NOTE]
> For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/benchmarks/sla_driven_profiling.md#troubleshooting).
| Phase | Duration | Purpose | Status Check | ## Configuration Reference
|-------|----------|---------|--------------|
| Profiling | 2-4 hours | Generate performance data | `kubectl logs job/profile-sla` |
| Deployment | 5-10 minutes | Enable autoscaling | `kubectl get pods` |
| Testing | 5 minutes | Verify functionality | `kubectl logs deployment/planner` |
--- For comprehensive documentation of all DGDR configuration options, see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
> [!TIP] This includes detailed explanations of:
> **Need Help?** If you encounter issues, check the [troubleshooting section](#troubleshooting) or refer to the detailed guides linked in [Next Steps](#next-steps). - **SLA Configuration**: ISL, OSL, TTFT, ITL with use cases and trade-offs
- **Hardware Configuration**: GPU constraints and search space control
- **Sweep Configuration**: Profiling behavior and interpolation settings
- **AI Configurator Configuration**: System types, model mappings, backend versions
- **Planner Configuration**: Autoscaling and adjustment parameters
- **Complete Examples**: Full DGDRs for online, offline (AIC), and MoE profiling
## Related Documentation
- [DGDR API Reference](/docs/kubernetes/api_reference.md)
- [Pre-Deployment Profiling Details](/docs/benchmarks/sla_driven_profiling.md)
- [SLA Planner Architecture](/docs/planner/sla_planner.md)
- [Dynamo Operator Guide](/docs/kubernetes/dynamo_operator.md)
...@@ -23,7 +23,7 @@ Use the pre-configured test deployment with sample profiling data, we provide th ...@@ -23,7 +23,7 @@ Use the pre-configured test deployment with sample profiling data, we provide th
### Option B: Use Your Own Profiling Results ### Option B: Use Your Own Profiling Results
1. Run pre-deployment profiling for your specific setup. See the [pre-deployment profiling documentation](../../docs/benchmarks/pre_deployment_profiling.md) for detailed instructions. 1. Run pre-deployment profiling for your specific setup. See the [pre-deployment profiling documentation](../../docs/benchmarks/sla_driven_profiling.md) for detailed instructions.
## Interpolator Testing ## Interpolator Testing
......
...@@ -27,6 +27,8 @@ class TestProfileSlaAiconfigurator: ...@@ -27,6 +27,8 @@ class TestProfileSlaAiconfigurator:
def trtllm_args(self): def trtllm_args(self):
class Args: class Args:
def __init__(self): def __init__(self):
self.model = ""
self.dgd_image = ""
self.backend = "trtllm" self.backend = "trtllm"
self.config = "components/backends/trtllm/deploy/disagg.yaml" self.config = "components/backends/trtllm/deploy/disagg.yaml"
self.output_dir = "/tmp/test_profiling_results" self.output_dir = "/tmp/test_profiling_results"
......
...@@ -49,6 +49,8 @@ class TestProfileSLADryRun: ...@@ -49,6 +49,8 @@ class TestProfileSLADryRun:
self.config = "components/backends/vllm/deploy/disagg.yaml" self.config = "components/backends/vllm/deploy/disagg.yaml"
self.output_dir = "/tmp/test_profiling_results" self.output_dir = "/tmp/test_profiling_results"
self.namespace = "test-namespace" self.namespace = "test-namespace"
self.model = ""
self.dgd_image = ""
self.min_num_gpus_per_engine = 1 self.min_num_gpus_per_engine = 1
self.max_num_gpus_per_engine = 8 self.max_num_gpus_per_engine = 8
self.skip_existing_results = False self.skip_existing_results = False
...@@ -83,6 +85,8 @@ class TestProfileSLADryRun: ...@@ -83,6 +85,8 @@ class TestProfileSLADryRun:
self.config = "components/backends/sglang/deploy/disagg.yaml" self.config = "components/backends/sglang/deploy/disagg.yaml"
self.output_dir = "/tmp/test_profiling_results" self.output_dir = "/tmp/test_profiling_results"
self.namespace = "test-namespace" self.namespace = "test-namespace"
self.model = ""
self.dgd_image = ""
self.min_num_gpus_per_engine = 1 self.min_num_gpus_per_engine = 1
self.max_num_gpus_per_engine = 8 self.max_num_gpus_per_engine = 8
self.skip_existing_results = False self.skip_existing_results = False
...@@ -131,6 +135,8 @@ class TestProfileSLADryRun: ...@@ -131,6 +135,8 @@ class TestProfileSLADryRun:
self.config = "components/backends/trtllm/deploy/disagg.yaml" self.config = "components/backends/trtllm/deploy/disagg.yaml"
self.output_dir = "/tmp/test_profiling_results" self.output_dir = "/tmp/test_profiling_results"
self.namespace = "test-namespace" self.namespace = "test-namespace"
self.model = ""
self.dgd_image = ""
self.min_num_gpus_per_engine = 1 self.min_num_gpus_per_engine = 1
self.max_num_gpus_per_engine = 8 self.max_num_gpus_per_engine = 8
self.skip_existing_results = False self.skip_existing_results = False
...@@ -172,6 +178,8 @@ class TestProfileSLADryRun: ...@@ -172,6 +178,8 @@ class TestProfileSLADryRun:
self.config = "recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml" self.config = "recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml"
self.output_dir = "/tmp/test_profiling_results" self.output_dir = "/tmp/test_profiling_results"
self.namespace = "test-namespace" self.namespace = "test-namespace"
self.model = ""
self.dgd_image = ""
self.min_num_gpus_per_engine = 8 self.min_num_gpus_per_engine = 8
self.max_num_gpus_per_engine = 32 self.max_num_gpus_per_engine = 32
self.skip_existing_results = False self.skip_existing_results = False
...@@ -233,6 +241,7 @@ class TestProfileSLADryRun: ...@@ -233,6 +241,7 @@ class TestProfileSLADryRun:
self.output_dir = "/tmp/test_profiling_results" self.output_dir = "/tmp/test_profiling_results"
self.namespace = "test-namespace" self.namespace = "test-namespace"
self.model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # Specify model for autogen self.model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # Specify model for autogen
self.dgd_image = ""
self.min_num_gpus_per_engine = 0 # Will be auto-generated self.min_num_gpus_per_engine = 0 # Will be auto-generated
self.max_num_gpus_per_engine = 0 # Will be auto-generated self.max_num_gpus_per_engine = 0 # Will be auto-generated
self.skip_existing_results = False self.skip_existing_results = False
...@@ -294,6 +303,7 @@ class TestProfileSLADryRun: ...@@ -294,6 +303,7 @@ class TestProfileSLADryRun:
self.output_dir = "/tmp/test_profiling_results" self.output_dir = "/tmp/test_profiling_results"
self.namespace = "test-namespace" self.namespace = "test-namespace"
self.model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # Specify model for autogen self.model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # Specify model for autogen
self.dgd_image = ""
self.min_num_gpus_per_engine = 0 # Will be auto-generated self.min_num_gpus_per_engine = 0 # Will be auto-generated
self.max_num_gpus_per_engine = 0 # Will be auto-generated self.max_num_gpus_per_engine = 0 # Will be auto-generated
self.skip_existing_results = False self.skip_existing_results = False
...@@ -355,6 +365,7 @@ class TestProfileSLADryRun: ...@@ -355,6 +365,7 @@ class TestProfileSLADryRun:
self.output_dir = "/tmp/test_profiling_results" self.output_dir = "/tmp/test_profiling_results"
self.namespace = "test-namespace" self.namespace = "test-namespace"
self.model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # Specify model for autogen self.model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # Specify model for autogen
self.dgd_image = ""
self.min_num_gpus_per_engine = 0 # Will be auto-generated self.min_num_gpus_per_engine = 0 # Will be auto-generated
self.max_num_gpus_per_engine = 0 # Will be auto-generated self.max_num_gpus_per_engine = 0 # Will be auto-generated
self.skip_existing_results = False self.skip_existing_results = False
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment