Unverified Commit 3023c625 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: migrate Profiler docs to three-tier structure (#6003)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Signed-off-by: default avatardagil-nvidia <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
Co-authored-by: default avatarJonathan Tong <jt572@cornell.edu>
parent 4c3eba2a
......@@ -5,6 +5,9 @@ SPDX-License-Identifier: Apache-2.0
# Profiling SGLang Workers in Dynamo
> [!NOTE]
> **See also**: [Profiler Component Overview](/docs/components/profiler/README.md) for SLA-driven profiling and deployment optimization.
Dynamo exposes profiling endpoints for SGLang workers via the system server's `/engine/*` routes. This allows you to start and stop PyTorch profiling on running inference workers without restarting them.
These endpoints wrap SGLang's internal `TokenizerManager.start_profile()` and `stop_profile()` methods. See SGLang's documentation for the full list of supported parameters.
......
......@@ -3,6 +3,9 @@
> [!TIP]
> **New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process.
> [!NOTE]
> **See also**: [Profiler Component Overview](/docs/components/profiler/README.md) for a quick start guide and feature matrix.
## Overview
Dynamo provides automated SLA-driven profiling through **DynamoGraphDeploymentRequests (DGDR)**. Instead of manually running profiling scripts, you declare your performance requirements and let the Dynamo Operator handle profiling and deployment automatically.
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Profiler
The Dynamo Profiler is an automated performance analysis tool that measures model inference characteristics to optimize deployment configurations. It determines optimal tensor parallelism (TP) settings for prefill and decode phases, generates performance interpolation data, and enables SLA-driven autoscaling through the Planner.
## Feature Matrix
| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|------|--------|--------------|
| Dense Model Profiling | ✅ | ✅ | ✅ |
| MoE Model Profiling | 🚧 | ✅ | 🚧 |
| AI Configurator (Offline) | ❌ | ❌ | ✅ |
| Online Profiling (AIPerf) | ✅ | ✅ | ✅ |
| Interactive WebUI | ✅ | ✅ | ✅ |
| Runtime Profiling Endpoints | ❌ | ✅ | ❌ |
## Quick Start
### Prerequisites
- Dynamo platform installed (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- Kubernetes cluster with GPU nodes (for DGDR-based profiling)
- kube-prometheus-stack installed (required for SLA planner)
### Using DynamoGraphDeploymentRequest (Recommended)
The recommended way to profile models is through DGDRs, which automate the entire profiling and deployment workflow.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-model-profiling
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
config:
sla:
isl: 3000 # Average input sequence length
osl: 150 # Average output sequence length
ttft: 200.0 # Target Time To First Token (ms)
itl: 20.0 # Target Inter-Token Latency (ms)
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
autoApply: true
```
```bash
kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE
```
### Using AI Configurator (Fast Offline Profiling)
For TensorRT-LLM, use AI Configurator for rapid profiling (~30 seconds):
```yaml
profilingConfig:
config:
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
```
### Direct Script Usage (Advanced)
For advanced scenarios, run the profiler directly:
```bash
python -m benchmarks.profiler.profile_sla \
--backend vllm \
--config path/to/disagg.yaml \
--model meta-llama/Llama-3-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150
```
## Configuration
| Parameter | Default | Description |
|-----------|---------|-------------|
| `sla.isl` | - | Average input sequence length (tokens) |
| `sla.osl` | - | Average output sequence length (tokens) |
| `sla.ttft` | - | Target Time To First Token (milliseconds) |
| `sla.itl` | - | Target Inter-Token Latency (milliseconds) |
| `sweep.useAiConfigurator` | `false` | Use offline simulation instead of real profiling |
| `hardware.minNumGpusPerEngine` | auto | Minimum GPUs per engine (auto-detected from model size) |
| `hardware.maxNumGpusPerEngine` | 8 | Maximum GPUs per engine |
## Profiling Methods
| Method | Duration | Accuracy | GPU Required | Backends |
|--------|----------|----------|--------------|----------|
| Online (AIPerf) | 2-4 hours | Highest | Yes | All |
| Offline (AI Configurator) | 20-30 seconds | Estimated | No | TensorRT-LLM |
## Output
The profiler generates:
1. **Optimal Configuration**: Recommended TP sizes for prefill and decode engines
2. **Performance Data**: Interpolation models for the SLA Planner
3. **Generated DGD**: Complete deployment manifest with optimized settings
Example recommendations:
```text
Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
```
## Next Steps
| Document | Description |
|----------|-------------|
| [Profiler Guide](profiler_guide.md) | Configuration, methods, and troubleshooting |
| [Profiler Examples](profiler_examples.md) | Complete DGDR YAMLs, WebUI, script examples |
| [SLA Planner Quick Start](/docs/planner/sla_planner_quickstart.md) | End-to-end deployment workflow |
| [SLA Planner Architecture](/docs/planner/sla_planner.md) | How the Planner uses profiling data |
```{toctree}
:hidden:
profiler_guide
profiler_examples
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Profiler Examples
Complete examples for profiling with DGDRs, the interactive WebUI, and direct script usage.
## DGDR Examples
### Dense Model: AIPerf on Real Engines
Standard online profiling with real GPU measurements:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: vllm-dense-online
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
config:
sla:
isl: 3000
osl: 150
ttft: 200.0
itl: 20.0
hardware:
minNumGpusPerEngine: 1
maxNumGpusPerEngine: 8
sweep:
useAiConfigurator: false
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
autoApply: true
```
### Dense Model: AI Configurator Simulation
Fast offline profiling (~30 seconds, TensorRT-LLM only):
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: trtllm-aic-offline
spec:
model: "Qwen/Qwen3-32B"
backend: trtllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
config:
sla:
isl: 4000
osl: 500
ttft: 300.0
itl: 10.0
sweep:
useAiConfigurator: true
aicSystem: h200_sxm # Also supports h100_sxm, b200_sxm, gb200_sxm, a100_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
autoApply: true
```
### MoE Model
Multi-node MoE profiling with SGLang:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sglang-moe
spec:
model: "deepseek-ai/DeepSeek-R1"
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
config:
sla:
isl: 2048
osl: 512
ttft: 300.0
itl: 25.0
hardware:
numGpusPerNode: 8
maxNumGpusPerEngine: 32
engine:
isMoeModel: true
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
autoApply: true
```
### Using Existing DGD Config (ConfigMap)
Reference a custom DGD configuration via ConfigMap:
```bash
# Create ConfigMap from your DGD config file
kubectl create configmap deepseek-r1-config \
--from-file=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
```
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
configMapRef:
name: deepseek-r1-config
key: disagg.yaml
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: deepseek-ai/DeepSeek-V3
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
autoApply: true
```
## Interactive WebUI
Launch an interactive configuration selection interface:
```bash
python -m benchmarks.profiler.profile_sla \
--backend trtllm \
--config path/to/disagg.yaml \
--pick-with-webui \
--use-ai-configurator \
--model Qwen/Qwen3-32B-FP8 \
--aic-system h200_sxm \
--ttft 200 --itl 15
```
The WebUI launches on port 8000 by default (configurable with `--webui-port`).
### Features
- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML
- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets
### Selection Methods
1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each
### Example DGD Config Output
When you click "Show Config", you see a DynamoGraphDeployment configuration:
```yaml
# DynamoGraphDeployment Configuration
# Prefill: 1 GPU(s), TP=1
# Decode: 4 GPU(s), TP=4
# Model: Qwen/Qwen3-32B-FP8
# Backend: trtllm
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
services:
PrefillWorker:
subComponentType: prefill
replicas: 1
extraPodSpec:
mainContainer:
args:
- --tensor-parallel-size=1
DecodeWorker:
subComponentType: decode
replicas: 1
extraPodSpec:
mainContainer:
args:
- --tensor-parallel-size=4
```
Once you select a configuration, the full DGD CRD is saved as `config_with_planner.yaml`.
## Direct Script Examples
### Basic Profiling
```bash
python -m benchmarks.profiler.profile_sla \
--backend vllm \
--config path/to/disagg.yaml \
--model meta-llama/Llama-3-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150
```
### With GPU Constraints
```bash
python -m benchmarks.profiler.profile_sla \
--backend sglang \
--config examples/backends/sglang/deploy/disagg.yaml \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150 \
--min-num-gpus 2 \
--max-num-gpus 8
```
### AI Configurator (Offline)
```bash
python -m benchmarks.profiler.profile_sla \
--backend trtllm \
--config path/to/disagg.yaml \
--use-ai-configurator \
--model Qwen/Qwen3-32B-FP8 \
--aic-system h200_sxm \
--ttft 200 --itl 15 \
--isl 4000 --osl 500
```
## SGLang Runtime Profiling
Profile SGLang workers at runtime via HTTP endpoints:
```bash
# Start profiling
curl -X POST http://localhost:9090/engine/start_profile \
-H "Content-Type: application/json" \
-d '{"output_dir": "/tmp/profiler_output"}'
# Run inference requests to generate profiling data...
# Stop profiling
curl -X POST http://localhost:9090/engine/stop_profile
```
A test script is provided at `examples/backends/sglang/test_sglang_profile.py`:
```bash
python examples/backends/sglang/test_sglang_profile.py
```
View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.
This diff is collapsed.
......@@ -78,6 +78,7 @@ Quickstart
Frontends <_sections/frontends>
Router <router/README>
Planner <planner/planner_intro>
Profiler <components/profiler/README>
KVBM <kvbm/kvbm_intro>
.. toctree::
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment