Unverified Commit cf92be04 authored by hhzhang16's avatar hhzhang16 Committed by GitHub
Browse files
parent 73a9a53f
...@@ -12,7 +12,7 @@ The Dynamo Profiler is an automated performance analysis tool that measures mode ...@@ -12,7 +12,7 @@ The Dynamo Profiler is an automated performance analysis tool that measures mode
|---------|--------|--------------|------| |---------|--------|--------------|------|
| Dense Model Profiling | ✅ | ✅ | ✅ | | Dense Model Profiling | ✅ | ✅ | ✅ |
| MoE Model Profiling | ✅ | 🚧 | 🚧 | | MoE Model Profiling | ✅ | 🚧 | 🚧 |
| AI Configurator (Offline) | | ✅ | | | AI Configurator (Offline) | | ✅ | |
| Online Profiling (AIPerf) | ✅ | ✅ | ✅ | | Online Profiling (AIPerf) | ✅ | ✅ | ✅ |
| Interactive WebUI | ✅ | ✅ | ✅ | | Interactive WebUI | ✅ | ✅ | ✅ |
| Runtime Profiling Endpoints | ✅ | ❌ | ❌ | | Runtime Profiling Endpoints | ✅ | ❌ | ❌ |
...@@ -37,7 +37,7 @@ metadata: ...@@ -37,7 +37,7 @@ metadata:
spec: spec:
model: "Qwen/Qwen3-0.6B" model: "Qwen/Qwen3-0.6B"
backend: vllm backend: vllm
image: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0" image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
workload: workload:
isl: 3000 # Average input sequence length isl: 3000 # Average input sequence length
...@@ -74,7 +74,7 @@ AI Configurator enables rapid offline profiling (~30 seconds) and supports all b ...@@ -74,7 +74,7 @@ AI Configurator enables rapid offline profiling (~30 seconds) and supports all b
| Method | Duration | Accuracy | GPU Required | Backends | | Method | Duration | Accuracy | GPU Required | Backends |
|--------|----------|----------|--------------|----------| |--------|----------|----------|--------------|----------|
| Online (AIPerf) | 2-4 hours | Highest | Yes | All | | Online (AIPerf) | 2-4 hours | Highest | Yes | All |
| Offline (AI Configurator) | 20-30 seconds | Estimated | No | TensorRT-LLM | | Offline (AI Configurator) | 20-30 seconds | Estimated | No | All |
## Output ## Output
......
...@@ -19,7 +19,7 @@ metadata: ...@@ -19,7 +19,7 @@ metadata:
name: qwen-0-6b name: qwen-0-6b
spec: spec:
model: "Qwen/Qwen3-0.6B" model: "Qwen/Qwen3-0.6B"
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0" image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
``` ```
### Dense Model: Thorough ### Dense Model: Thorough
...@@ -34,7 +34,7 @@ metadata: ...@@ -34,7 +34,7 @@ metadata:
spec: spec:
model: "Qwen/Qwen3-0.6B" model: "Qwen/Qwen3-0.6B"
backend: vllm backend: vllm
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0" image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
searchStrategy: thorough searchStrategy: thorough
``` ```
...@@ -55,7 +55,7 @@ metadata: ...@@ -55,7 +55,7 @@ metadata:
spec: spec:
model: "deepseek-ai/DeepSeek-R1" model: "deepseek-ai/DeepSeek-R1"
backend: sglang backend: sglang
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0" image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
hardware: hardware:
numGpusPerNode: 8 numGpusPerNode: 8
...@@ -85,7 +85,7 @@ metadata: ...@@ -85,7 +85,7 @@ metadata:
name: llama-private name: llama-private
spec: spec:
model: "meta-llama/Llama-3.1-8B-Instruct" model: "meta-llama/Llama-3.1-8B-Instruct"
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0" image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
overrides: overrides:
profilingJob: profilingJob:
...@@ -116,7 +116,7 @@ metadata: ...@@ -116,7 +116,7 @@ metadata:
name: low-latency-dense name: low-latency-dense
spec: spec:
model: "Qwen/Qwen3-0.6B" model: "Qwen/Qwen3-0.6B"
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0" image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
sla: sla:
ttft: 500 # Time To First Token target in milliseconds ttft: 500 # Time To First Token target in milliseconds
...@@ -136,15 +136,6 @@ spec: ...@@ -136,15 +136,6 @@ spec:
e2eLatency: 10000 # total request latency budget in milliseconds e2eLatency: 10000 # total request latency budget in milliseconds
``` ```
**Optimization objective without explicit targets** (maximize throughput or minimize latency):
```yaml
spec:
...
sla:
optimizationType: throughput # or: latency
```
### Overrides ### Overrides
Use `overrides` to customize the profiling job pod spec — for example to add tolerations for Use `overrides` to customize the profiling job pod spec — for example to add tolerations for
...@@ -159,7 +150,7 @@ metadata: ...@@ -159,7 +150,7 @@ metadata:
name: dense-with-tolerations name: dense-with-tolerations
spec: spec:
model: "Qwen/Qwen3-0.6B" model: "Qwen/Qwen3-0.6B"
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0" image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
overrides: overrides:
profilingJob: profilingJob:
......
...@@ -155,9 +155,9 @@ The profiler enforces these rules at startup: ...@@ -155,9 +155,9 @@ The profiler enforces these rules at startup:
| Condition | Behavior | | Condition | Behavior |
|-----------|----------| |-----------|----------|
| `searchStrategy: thorough` + `backend: auto` | Rejected. Specify a concrete backend. | | `searchStrategy: thorough` + `backend: auto` | Rejected. Specify a concrete backend. |
| AIC unsupported + `enable_throughput_scaling: true` | Rejected. Throughput planner requires AIC support. | | `enable_throughput_scaling: true` + `pre_deployment_sweeping_mode: none` (or unset) | Rejected. Throughput-based scaling requires pre-deployment sweeping. |
| AIC unsupported + `pre_deployment_sweeping_mode: rapid` | Falls back to `none` with a warning. | | `enable_throughput_scaling: true` + `pre_deployment_sweeping_mode: rapid` + AIC unsupported | Rejected. AIC does not support this model/hardware/backend combination; switch `pre_deployment_sweeping_mode` to `thorough`. |
| `e2eLatency` provided without `ttft: null, itl: null` | Rejected by SLA validator. When using `e2eLatency`, explicitly null out `ttft` and `itl`. | | `e2eLatency` provided together with an explicitly-set `ttft` or `itl` | Rejected by SLA validator. Provide only `e2eLatency`; `ttft` and `itl` do not need to be explicitly nulled. |
| SLA unachievable | Warning logged, SLA updated to best achievable value. | | SLA unachievable | Warning logged, SLA updated to best achievable value. |
| Load-match needs more GPUs than available | Warning logged. | | Load-match needs more GPUs than available | Warning logged. |
...@@ -184,13 +184,7 @@ The profiler sweeps over the following parallelization mappings for prefill and ...@@ -184,13 +184,7 @@ The profiler sweeps over the following parallelization mappings for prefill and
### Kubernetes Deployment (DGDR) ### Kubernetes Deployment (DGDR)
The recommended deployment method is through DGDRs. Sample configurations are provided in `components/src/dynamo/profiler/deploy/`: The recommended deployment method is through DGDRs. See [Profiler Examples](profiler-examples.md) for complete DGDR YAML examples covering rapid, thorough, MoE, custom SLA, and override use cases.
| Sample | Description |
|--------|-------------|
| `profile_sla_dgdr.yaml` | Standard online profiling with AIPerf |
| `profile_sla_aic_dgdr.yaml` | Fast offline profiling with AI Configurator |
| `profile_sla_moe_dgdr.yaml` | MoE model profiling (SGLang) |
#### Container Images #### Container Images
...@@ -200,7 +194,7 @@ Each DGDR requires a container image for profiling and deployment: ...@@ -200,7 +194,7 @@ Each DGDR requires a container image for profiling and deployment:
```yaml ```yaml
spec: spec:
image: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0" image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
``` ```
#### Quick Start: Deploy with DGDR #### Quick Start: Deploy with DGDR
...@@ -217,7 +211,7 @@ metadata: ...@@ -217,7 +211,7 @@ metadata:
spec: spec:
model: "Qwen/Qwen3-0.6B" model: "Qwen/Qwen3-0.6B"
backend: vllm backend: vllm
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0" image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
``` ```
**Step 2: Apply the DGDR** **Step 2: Apply the DGDR**
...@@ -308,7 +302,7 @@ Uses performance simulation to rapidly estimate optimal configurations without r ...@@ -308,7 +302,7 @@ Uses performance simulation to rapidly estimate optimal configurations without r
- **Duration**: 20-30 seconds - **Duration**: 20-30 seconds
- **Accuracy**: Estimated (may have errors for unusual configurations) - **Accuracy**: Estimated (may have errors for unusual configurations)
- **GPU Requirements**: None - **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon) - **Backends**: All (vLLM, SGLang, TensorRT-LLM)
AI Configurator is used by default with `searchStrategy: rapid`: AI Configurator is used by default with `searchStrategy: rapid`:
...@@ -321,7 +315,7 @@ spec: ...@@ -321,7 +315,7 @@ spec:
> `aicBackendVersion` specifies the TensorRT-LLM version that AI Configurator simulates. See the [AI Configurator supported features](https://github.com/ai-dynamo/aiconfigurator#supported-features) for available versions. > `aicBackendVersion` specifies the TensorRT-LLM version that AI Configurator simulates. See the [AI Configurator supported features](https://github.com/ai-dynamo/aiconfigurator#supported-features) for available versions.
**Currently supports:** **Currently supports:**
- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6) - **Backends**: vLLM, SGLang, TensorRT-LLM
- **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM - **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM
- **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more - **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more
...@@ -374,7 +368,7 @@ metadata: ...@@ -374,7 +368,7 @@ metadata:
spec: spec:
model: "Qwen/Qwen3-0.6B" model: "Qwen/Qwen3-0.6B"
backend: vllm backend: vllm
image: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0" image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
searchStrategy: rapid # or thorough searchStrategy: rapid # or thorough
autoApply: true autoApply: true
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment