docs: update docs for 1.1.0 DGDR (#8511)

Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

docs: update docs for 1.1.0 DGDR (#8511)
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
cf92be04 · hhzhang16 · GitHub · 73a9a53f · cf92be04 · cf92be04
Unverified Commit cf92be04 authored Apr 22, 2026 by hhzhang16 Committed by GitHub Apr 22, 2026
3 changed files
--- a/docs/components/profiler/README.md
+++ b/docs/components/profiler/README.md
@@ -12,7 +12,7 @@ The Dynamo Profiler is an automated performance analysis tool that measures mode
 |---------|--------|--------------|------|
 | Dense Model Profiling | ✅ | ✅ | ✅ |
 | MoE Model Profiling | ✅ | 🚧 | 🚧 |
-| AI Configurator (Offline) | ❌ | ✅ | ❌ |
+| AI Configurator (Offline) | ✅ | ✅ | ✅ |
 | Online Profiling (AIPerf) | ✅ | ✅ | ✅ |
 | Interactive WebUI | ✅ | ✅ | ✅ |
 | Runtime Profiling Endpoints | ✅ | ❌ | ❌ |
@@ -37,7 +37,7 @@ metadata:
 spec:
  model: "Qwen/Qwen3-0.6B"
  backend: vllm
-  image: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0"
+  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"

  workload:
    isl: 3000      # Average input sequence length
@@ -74,7 +74,7 @@ AI Configurator enables rapid offline profiling (~30 seconds) and supports all b
 | Method | Duration | Accuracy | GPU Required | Backends |
 |--------|----------|----------|--------------|----------|
 | Online (AIPerf) | 2-4 hours | Highest | Yes | All |
-| Offline (AI Configurator) | 20-30 seconds | Estimated | No | TensorRT-LLM |
+| Offline (AI Configurator) | 20-30 seconds | Estimated | No | All |

 ## Output


--- a/docs/components/profiler/profiler-examples.md
+++ b/docs/components/profiler/profiler-examples.md
@@ -19,7 +19,7 @@ metadata:
  name: qwen-0-6b
 spec:
  model: "Qwen/Qwen3-0.6B"
-  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
+  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
 ```

 ### Dense Model: Thorough
@@ -34,7 +34,7 @@ metadata:
 spec:
  model: "Qwen/Qwen3-0.6B"
  backend: vllm
-  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
+  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
  searchStrategy: thorough
 ```

@@ -55,7 +55,7 @@ metadata:
 spec:
  model: "deepseek-ai/DeepSeek-R1"
  backend: sglang
-  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
+  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"

  hardware:
    numGpusPerNode: 8
@@ -85,7 +85,7 @@ metadata:
  name: llama-private
 spec:
  model: "meta-llama/Llama-3.1-8B-Instruct"
-  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
+  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"

  overrides:
    profilingJob:
@@ -116,7 +116,7 @@ metadata:
  name: low-latency-dense
 spec:
  model: "Qwen/Qwen3-0.6B"
-  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
+  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"

  sla:
    ttft: 500      # Time To First Token target in milliseconds
@@ -136,15 +136,6 @@ spec:
    e2eLatency: 10000    # total request latency budget in milliseconds
 ```

-**Optimization objective without explicit targets** (maximize throughput or minimize latency):
-
-```yaml
-spec:
-  ...
-  sla:
-    optimizationType: throughput    # or: latency
-```
-
 ### Overrides

 Use `overrides` to customize the profiling job pod spec — for example to add tolerations for
@@ -159,7 +150,7 @@ metadata:
  name: dense-with-tolerations
 spec:
  model: "Qwen/Qwen3-0.6B"
-  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
+  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"

  overrides:
    profilingJob:

--- a/docs/components/profiler/profiler-guide.md
+++ b/docs/components/profiler/profiler-guide.md
@@ -155,9 +155,9 @@ The profiler enforces these rules at startup:
 | Condition | Behavior |
 |-----------|----------|
 | `searchStrategy: thorough` + `backend: auto` | Rejected. Specify a concrete backend. |
-| AIC unsupported + `enable_throughput_scaling: true` | Rejected. Throughput planner requires AIC support. |
-| AIC unsupported + `pre_deployment_sweeping_mode: rapid` | Falls back to `none` with a warning. |
-| `e2eLatency` provided without `ttft: null, itl: null` | Rejected by SLA validator. When using `e2eLatency`, explicitly null out `ttft` and `itl`. |
+| `enable_throughput_scaling: true` + `pre_deployment_sweeping_mode: none` (or unset) | Rejected. Throughput-based scaling requires pre-deployment sweeping. |
+| `enable_throughput_scaling: true` + `pre_deployment_sweeping_mode: rapid` + AIC unsupported | Rejected. AIC does not support this model/hardware/backend combination; switch `pre_deployment_sweeping_mode` to `thorough`. |
+| `e2eLatency` provided together with an explicitly-set `ttft` or `itl` | Rejected by SLA validator. Provide only `e2eLatency`; `ttft` and `itl` do not need to be explicitly nulled. |
 | SLA unachievable | Warning logged, SLA updated to best achievable value. |
 | Load-match needs more GPUs than available | Warning logged. |

@@ -184,13 +184,7 @@ The profiler sweeps over the following parallelization mappings for prefill and

 ### Kubernetes Deployment (DGDR)

-The recommended deployment method is through DGDRs. Sample configurations are provided in `components/src/dynamo/profiler/deploy/`:
-
-| Sample | Description |
-|--------|-------------|
-| `profile_sla_dgdr.yaml` | Standard online profiling with AIPerf |
-| `profile_sla_aic_dgdr.yaml` | Fast offline profiling with AI Configurator |
-| `profile_sla_moe_dgdr.yaml` | MoE model profiling (SGLang) |
+The recommended deployment method is through DGDRs. See [Profiler Examples](profiler-examples.md) for complete DGDR YAML examples covering rapid, thorough, MoE, custom SLA, and override use cases.

 #### Container Images

@@ -200,7 +194,7 @@ Each DGDR requires a container image for profiling and deployment:

 ```yaml
 spec:
-  image: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0"
+  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
 ```

 #### Quick Start: Deploy with DGDR
@@ -217,7 +211,7 @@ metadata:
 spec:
  model: "Qwen/Qwen3-0.6B"
  backend: vllm
-  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
+  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
 ```

 **Step 2: Apply the DGDR**
@@ -308,7 +302,7 @@ Uses performance simulation to rapidly estimate optimal configurations without r
 - **Duration**: 20-30 seconds
 - **Accuracy**: Estimated (may have errors for unusual configurations)
 - **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon)
+- **Backends**: All (vLLM, SGLang, TensorRT-LLM)

 AI Configurator is used by default with `searchStrategy: rapid`:

@@ -321,7 +315,7 @@ spec:
 > `aicBackendVersion` specifies the TensorRT-LLM version that AI Configurator simulates. See the [AI Configurator supported features](https://github.com/ai-dynamo/aiconfigurator#supported-features) for available versions.

 **Currently supports:**
- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6)
+- **Backends**: vLLM, SGLang, TensorRT-LLM
 - **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM
 - **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more

@@ -374,7 +368,7 @@ metadata:
 spec:
  model: "Qwen/Qwen3-0.6B"
  backend: vllm
-  image: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0"
+  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"

  searchStrategy: rapid  # or thorough
  autoApply: true