docs: migrate Profiler docs to three-tier structure (#6003)

Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Jonathan Tong <jt572@cornell.edu>

docs: migrate Profiler docs to three-tier structure (#6003)
Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Jonathan Tong <jt572@cornell.edu>
3023c625 · dagil-nvidia · GitHub · 4c3eba2a · 3023c625 · 3023c625
Unverified Commit 3023c625 authored Feb 05, 2026 by dagil-nvidia Committed by GitHub Feb 05, 2026
6 changed files
--- a/docs/backends/sglang/profiling.md
+++ b/docs/backends/sglang/profiling.md
@@ -5,6 +5,9 @@ SPDX-License-Identifier: Apache-2.0

 # Profiling SGLang Workers in Dynamo

+> [!NOTE]
+> **See also**: [Profiler Component Overview](/docs/components/profiler/README.md) for SLA-driven profiling and deployment optimization.
+
 Dynamo exposes profiling endpoints for SGLang workers via the system server's `/engine/*` routes. This allows you to start and stop PyTorch profiling on running inference workers without restarting them.

 These endpoints wrap SGLang's internal `TokenizerManager.start_profile()` and `stop_profile()` methods. See SGLang's documentation for the full list of supported parameters.

--- a/docs/benchmarks/sla_driven_profiling.md
+++ b/docs/benchmarks/sla_driven_profiling.md
@@ -3,6 +3,9 @@
 > [!TIP]
 > **New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process.

+> [!NOTE]
+> **See also**: [Profiler Component Overview](/docs/components/profiler/README.md) for a quick start guide and feature matrix.
+
 ## Overview

 Dynamo provides automated SLA-driven profiling through **DynamoGraphDeploymentRequests (DGDR)**. Instead of manually running profiling scripts, you declare your performance requirements and let the Dynamo Operator handle profiling and deployment automatically.

--- a/docs/components/profiler/README.md
+++ b/docs/components/profiler/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# Profiler
+
+The Dynamo Profiler is an automated performance analysis tool that measures model inference characteristics to optimize deployment configurations. It determines optimal tensor parallelism (TP) settings for prefill and decode phases, generates performance interpolation data, and enables SLA-driven autoscaling through the Planner.
+
+## Feature Matrix
+
+| Feature | vLLM | SGLang | TensorRT-LLM |
+|---------|------|--------|--------------|
+| Dense Model Profiling | ✅ | ✅ | ✅ |
+| MoE Model Profiling | 🚧 | ✅ | 🚧 |
+| AI Configurator (Offline) | ❌ | ❌ | ✅ |
+| Online Profiling (AIPerf) | ✅ | ✅ | ✅ |
+| Interactive WebUI | ✅ | ✅ | ✅ |
+| Runtime Profiling Endpoints | ❌ | ✅ | ❌ |
+
+## Quick Start
+
+### Prerequisites
+
+- Dynamo platform installed (see [Installation Guide](/docs/kubernetes/installation_guide.md))
+- Kubernetes cluster with GPU nodes (for DGDR-based profiling)
+- kube-prometheus-stack installed (required for SLA planner)
+
+### Using DynamoGraphDeploymentRequest (Recommended)
+
+The recommended way to profile models is through DGDRs, which automate the entire profiling and deployment workflow.
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: my-model-profiling
+spec:
+  model: "Qwen/Qwen3-0.6B"
+  backend: vllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+    config:
+      sla:
+        isl: 3000      # Average input sequence length
+        osl: 150       # Average output sequence length
+        ttft: 200.0    # Target Time To First Token (ms)
+        itl: 20.0      # Target Inter-Token Latency (ms)
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+
+  autoApply: true
+```
+
+```bash
+kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE
+```
+
+### Using AI Configurator (Fast Offline Profiling)
+
+For TensorRT-LLM, use AI Configurator for rapid profiling (~30 seconds):
+
+```yaml
+profilingConfig:
+  config:
+    sweep:
+      useAiConfigurator: true
+      aicSystem: h200_sxm
+      aicHfId: Qwen/Qwen3-32B
+      aicBackendVersion: "0.20.0"
+```
+
+### Direct Script Usage (Advanced)
+
+For advanced scenarios, run the profiler directly:
+
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend vllm \
+  --config path/to/disagg.yaml \
+  --model meta-llama/Llama-3-8B \
+  --ttft 200 --itl 15 \
+  --isl 3000 --osl 150
+```
+
+## Configuration
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `sla.isl` | - | Average input sequence length (tokens) |
+| `sla.osl` | - | Average output sequence length (tokens) |
+| `sla.ttft` | - | Target Time To First Token (milliseconds) |
+| `sla.itl` | - | Target Inter-Token Latency (milliseconds) |
+| `sweep.useAiConfigurator` | `false` | Use offline simulation instead of real profiling |
+| `hardware.minNumGpusPerEngine` | auto | Minimum GPUs per engine (auto-detected from model size) |
+| `hardware.maxNumGpusPerEngine` | 8 | Maximum GPUs per engine |
+
+## Profiling Methods
+
+| Method | Duration | Accuracy | GPU Required | Backends |
+|--------|----------|----------|--------------|----------|
+| Online (AIPerf) | 2-4 hours | Highest | Yes | All |
+| Offline (AI Configurator) | 20-30 seconds | Estimated | No | TensorRT-LLM |
+
+## Output
+
+The profiler generates:
+
+1. **Optimal Configuration**: Recommended TP sizes for prefill and decode engines
+2. **Performance Data**: Interpolation models for the SLA Planner
+3. **Generated DGD**: Complete deployment manifest with optimized settings
+
+Example recommendations:
+```text
+Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
+Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
+```
+
+## Next Steps
+
+| Document | Description |
+|----------|-------------|
+| [Profiler Guide](profiler_guide.md) | Configuration, methods, and troubleshooting |
+| [Profiler Examples](profiler_examples.md) | Complete DGDR YAMLs, WebUI, script examples |
+| [SLA Planner Quick Start](/docs/planner/sla_planner_quickstart.md) | End-to-end deployment workflow |
+| [SLA Planner Architecture](/docs/planner/sla_planner.md) | How the Planner uses profiling data |
+
+```{toctree}
+:hidden:
+
+profiler_guide
+profiler_examples
+```
--- a/docs/components/profiler/profiler_examples.md
+++ b/docs/components/profiler/profiler_examples.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# Profiler Examples
+
+Complete examples for profiling with DGDRs, the interactive WebUI, and direct script usage.
+
+## DGDR Examples
+
+### Dense Model: AIPerf on Real Engines
+
+Standard online profiling with real GPU measurements:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: vllm-dense-online
+spec:
+  model: "Qwen/Qwen3-0.6B"
+  backend: vllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200.0
+        itl: 20.0
+
+      hardware:
+        minNumGpusPerEngine: 1
+        maxNumGpusPerEngine: 8
+
+      sweep:
+        useAiConfigurator: false
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+
+  autoApply: true
+```
+
+### Dense Model: AI Configurator Simulation
+
+Fast offline profiling (~30 seconds, TensorRT-LLM only):
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: trtllm-aic-offline
+spec:
+  model: "Qwen/Qwen3-32B"
+  backend: trtllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
+    config:
+      sla:
+        isl: 4000
+        osl: 500
+        ttft: 300.0
+        itl: 10.0
+
+      sweep:
+        useAiConfigurator: true
+        aicSystem: h200_sxm  # Also supports h100_sxm, b200_sxm, gb200_sxm, a100_sxm
+        aicHfId: Qwen/Qwen3-32B
+        aicBackendVersion: "0.20.0"
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
+
+  autoApply: true
+```
+
+### MoE Model
+
+Multi-node MoE profiling with SGLang:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: sglang-moe
+spec:
+  model: "deepseek-ai/DeepSeek-R1"
+  backend: sglang
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
+    config:
+      sla:
+        isl: 2048
+        osl: 512
+        ttft: 300.0
+        itl: 25.0
+
+      hardware:
+        numGpusPerNode: 8
+        maxNumGpusPerEngine: 32
+
+      engine:
+        isMoeModel: true
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
+
+  autoApply: true
+```
+
+### Using Existing DGD Config (ConfigMap)
+
+Reference a custom DGD configuration via ConfigMap:
+
+```bash
+# Create ConfigMap from your DGD config file
+kubectl create configmap deepseek-r1-config \
+  --from-file=/path/to/your/disagg.yaml \
+  --namespace $NAMESPACE \
+  --dry-run=client -o yaml | kubectl apply -f -
+```
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: deepseek-r1
+spec:
+  model: deepseek-ai/DeepSeek-R1
+  backend: sglang
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
+    configMapRef:
+      name: deepseek-r1-config
+      key: disagg.yaml
+    config:
+      sla:
+        isl: 4000
+        osl: 500
+        ttft: 300
+        itl: 10
+      sweep:
+        useAiConfigurator: true
+        aicSystem: h200_sxm
+        aicHfId: deepseek-ai/DeepSeek-V3
+        aicBackendVersion: "0.20.0"
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
+
+  autoApply: true
+```
+
+## Interactive WebUI
+
+Launch an interactive configuration selection interface:
+
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend trtllm \
+  --config path/to/disagg.yaml \
+  --pick-with-webui \
+  --use-ai-configurator \
+  --model Qwen/Qwen3-32B-FP8 \
+  --aic-system h200_sxm \
+  --ttft 200 --itl 15
+```
+
+The WebUI launches on port 8000 by default (configurable with `--webui-port`).
+
+### Features
+
+- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
+- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
+- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML
+- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
+- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets
+
+### Selection Methods
+
+1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
+2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each
+
+### Example DGD Config Output
+
+When you click "Show Config", you see a DynamoGraphDeployment configuration:
+
+```yaml
+# DynamoGraphDeployment Configuration
+# Prefill: 1 GPU(s), TP=1
+# Decode: 4 GPU(s), TP=4
+# Model: Qwen/Qwen3-32B-FP8
+# Backend: trtllm
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+spec:
+  services:
+    PrefillWorker:
+      subComponentType: prefill
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --tensor-parallel-size=1
+    DecodeWorker:
+      subComponentType: decode
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --tensor-parallel-size=4
+```
+
+Once you select a configuration, the full DGD CRD is saved as `config_with_planner.yaml`.
+
+## Direct Script Examples
+
+### Basic Profiling
+
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend vllm \
+  --config path/to/disagg.yaml \
+  --model meta-llama/Llama-3-8B \
+  --ttft 200 --itl 15 \
+  --isl 3000 --osl 150
+```
+
+### With GPU Constraints
+
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend sglang \
+  --config examples/backends/sglang/deploy/disagg.yaml \
+  --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --ttft 200 --itl 15 \
+  --isl 3000 --osl 150 \
+  --min-num-gpus 2 \
+  --max-num-gpus 8
+```
+
+### AI Configurator (Offline)
+
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend trtllm \
+  --config path/to/disagg.yaml \
+  --use-ai-configurator \
+  --model Qwen/Qwen3-32B-FP8 \
+  --aic-system h200_sxm \
+  --ttft 200 --itl 15 \
+  --isl 4000 --osl 500
+```
+
+## SGLang Runtime Profiling
+
+Profile SGLang workers at runtime via HTTP endpoints:
+
+```bash
+# Start profiling
+curl -X POST http://localhost:9090/engine/start_profile \
+  -H "Content-Type: application/json" \
+  -d '{"output_dir": "/tmp/profiler_output"}'
+
+# Run inference requests to generate profiling data...
+
+# Stop profiling
+curl -X POST http://localhost:9090/engine/stop_profile
+```
+
+A test script is provided at `examples/backends/sglang/test_sglang_profile.py`:
+
+```bash
+python examples/backends/sglang/test_sglang_profile.py
+```
+
+View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.
--- a/docs/components/profiler/profiler_guide.md
+++ b/docs/components/profiler/profiler_guide.md
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -78,6 +78,7 @@ Quickstart
   Frontends <_sections/frontends>
   Router <router/README>
   Planner <planner/planner_intro>
+   Profiler <components/profiler/README>
   KVBM <kvbm/kvbm_intro>

 .. toctree::