model_cache_pvc_name: String (name of the PVC to mount the model cache,
dgdImage: String (container image to use for DGD components (frontend, planner, workers), overrides images in config file)
if not provided, model must be HF name and will download from HF, default: "")
modelCache:
model_cache_pvc_path: String (path to the model cache in the PVC, default: "")
pvcName: String (name of the PVC to mount the model cache,
model_cache_pvc_mount_path: String (path to the model cache in the container,
if not provided, model must be HF name and will download from HF, default: "")
note that the PVC must be mounted to the same path for the profiling job,
pvcPath: String (path to the model cache in the PVC, default: "")
default: "/opt/model-cache")
mountPath: String (path to the model cache in the container,
note that the PVC must be mounted to the same path for the profiling job,
default: "/opt/model-cache")
engine:
engine:
backend: String (backend type, currently support [vllm, sglang, trtllm], default: vllm)
backend: String (backend type, currently support [vllm, sglang, trtllm], default: vllm)
config: String (path to the DynamoGraphDeployment config file, default: "")
config: String (path to the DynamoGraphDeployment config file, default: "")
max_context_length: Int (maximum context length supported by the served model, default: 0)
maxContextLength: Int (maximum context length supported by the served model, default: 0)
is_moe_model: Boolean (enable MoE (Mixture of Experts) model support, use TEP for prefill and DEP for decode, default: False)
isMoeModel: Boolean (enable MoE (Mixture of Experts) model support, use TEP for prefill and DEP for decode, default: False)
hardware:
hardware:
min_num_gpus_per_engine: Int (minimum number of GPUs per engine, default: 0)
minNumGpusPerEngine: Int (minimum number of GPUs per engine, default: 0)
max_num_gpus_per_engine: Int (maximum number of GPUs per engine, default: 0)
maxNumGpusPerEngine: Int (maximum number of GPUs per engine, default: 0)
num_gpus_per_node: Int (number of GPUs per node for MoE models - this will be the granularity when searching for the best TEP/DEP size, default: 0)
numGpusPerNode: Int (number of GPUs per node for MoE models - this will be the granularity when searching for the best TEP/DEP size, default: 0)
enableGpuDiscovery: Boolean (enable automatic GPU discovery from Kubernetes cluster nodes, when enabled overrides any manually specified hardware configuration, requires cluster-wide node access permissions, default: False)
sweep:
sweep:
prefill_interpolation_granularity: Int (how many samples to benchmark to interpolate TTFT under different ISL, default: 16)
prefillInterpolationGranularity: Int (how many samples to benchmark to interpolate TTFT under different ISL, default: 16)
decode_interpolation_granularity: Int (how many samples to benchmark to interpolate ITL under different active kv cache size and decode context length, default: 6)
decodeInterpolationGranularity: Int (how many samples to benchmark to interpolate ITL under different active kv cache size and decode context length, default: 6)
use_ai_configurator: Boolean (use ai-configurator to estimate benchmarking results instead of running actual deployment, default: False)
useAiConfigurator: Boolean (use ai-configurator to estimate benchmarking results instead of running actual deployment, default: False)
aic_system: String (target system for use with aiconfigurator, default: None)
aicSystem: String (target system for use with aiconfigurator, default: None)
aic_hf_id: String (aiconfigurator huggingface id of the target model, default: None)
aicHfId: String (aiconfigurator huggingface id of the target model, default: None)
aic_backend: String (aiconfigurator backend of the target model, if not provided, will use args.backend, default: "")
aicBackend: String (aiconfigurator backend of the target model, if not provided, will use args.backend, default: "")
aic_backend_version: String (specify backend version when using aiconfigurator to estimate perf, default: None)
aicBackendVersion: String (specify backend version when using aiconfigurator to estimate perf, default: None)
dry_run: Boolean (dry run the profile job, default: False)
dryRun: Boolean (dry run the profile job, default: False)
pick_with_webui: Boolean (pick the best parallelization mapping using webUI, default: False)
pickWithWebui: Boolean (pick the best parallelization mapping using webUI, default: False)
webui_port: Int (webUI port, default: $PROFILER_WEBUI_PORT or 8000)
webuiPort: Int (webUI port, default: $PROFILER_WEBUI_PORT or 8000)
sla:
sla:
isl: Int (target input sequence length, default: 3000)
isl: Int (target input sequence length, default: 3000)
osl: Int (target output sequence length, default: 500)
osl: Int (target output sequence length, default: 500)
ttft: Float (target Time To First Token in milliseconds, default: 50)
ttft: Float (target Time To First Token in milliseconds, default: 50)
itl: Float (target Inter Token Latency in milliseconds, default: 10)
itl: Float (target Inter Token Latency in milliseconds, default: 10)
planner: (planner-bypass arguments, use hyphens or underscores)
planner: (planner arguments)
i.e., planner-min-endpoint: 2 # or planner_min_endpoint: 2 (both work)
e.g., plannerMinEndpoint: 2
"""
"""
# Step 1: Pre-parse to check if --profile-config is provided
# Step 1: Pre-parse to check if --profile-config is provided
# Only needed when using AI Configurator (sweep.use_ai_configurator: true)
# Only needed when using AI Configurator (sweep.useAiConfigurator: true)
sweep:
sweep:
aic_system:h200_sxm# GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
aicSystem:h200_sxm# GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
```
```
### Automatic GPU Discovery (Optional Feature)
### Automatic GPU Discovery (Optional Feature)
...
@@ -120,7 +120,7 @@ Profiles your model by creating real test deployments in Kubernetes and measurin
...
@@ -120,7 +120,7 @@ Profiles your model by creating real test deployments in Kubernetes and measurin
profilingConfig:
profilingConfig:
config:
config:
sweep:
sweep:
use_ai_configurator:false# Default
useAiConfigurator:false# Default
```
```
### AI Configurator Simulation
### AI Configurator Simulation
...
@@ -138,11 +138,10 @@ Uses performance simulation to rapidly estimate optimal configurations without r
...
@@ -138,11 +138,10 @@ Uses performance simulation to rapidly estimate optimal configurations without r
profilingConfig:
profilingConfig:
config:
config:
sweep:
sweep:
use_ai_configurator:true
useAiConfigurator:true
aic:
aicSystem:h200_sxm# GPU system type
system:h200_sxm# GPU system type
aicHfId:Qwen/Qwen3-32B# HuggingFace model ID
model_name:QWEN3_32B# AIC model identifier
aicBackendVersion:"0.20.0"
backend_version:"0.20.0"
```
```
**Supported Configurations:**
**Supported Configurations:**
...
@@ -290,8 +289,7 @@ spec:
...
@@ -290,8 +289,7 @@ spec:
config:# Profiler configuration
config:# Profiler configuration
sla:{...}
sla:{...}
hardware:{...}
hardware:{...}
sweep:{...}
sweep:{...}# AIC settings go here (aicSystem, aicHfId, etc.)
aic:{...}
planner:{...}
planner:{...}
deploymentOverrides:# Optional
deploymentOverrides:# Optional
...
@@ -326,16 +324,16 @@ Control GPU search space and constraints:
...
@@ -326,16 +324,16 @@ Control GPU search space and constraints:
profilingConfig:
profilingConfig:
config:
config:
hardware:
hardware:
min_num_gpus_per_engine:2# if not provided, will automatically determine based on model and VRAM size
minNumGpusPerEngine:2# if not provided, will automatically determine based on model and VRAM size
max_num_gpus_per_engine:8# Maximum GPUs to test
maxNumGpusPerEngine:8# Maximum GPUs to test
num_gpus_per_node:8# GPUs per node (for multi-node MoE)
numGpusPerNode:8# GPUs per node (for multi-node MoE)
gpu_type:h200_sxm# GPU type hint
gpuType:h200_sxm# GPU type hint
```
```
**When to use:**
**When to use:**
-**min_num_gpus_per_engine**: Skip small TP sizes if your model is large
-**minNumGpusPerEngine**: Skip small TP sizes if your model is large
-**max_num_gpus_per_engine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
-**maxNumGpusPerEngine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
-**num_gpus_per_node**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
-**numGpusPerNode**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
-**gpu_type**: Informational, auto-detected by controller
-**gpu_type**: Informational, auto-detected by controller
> [!TIP]
> [!TIP]
...
@@ -349,17 +347,17 @@ Control profiling behavior:
...
@@ -349,17 +347,17 @@ Control profiling behavior:
profilingConfig:
profilingConfig:
config:
config:
sweep:
sweep:
use_ai_configurator:false# Use offline profiling (default: false)
useAiConfigurator:false# Use offline profiling (default: false)
prefill_interpolation_granularity:16# Samples for prefill TTFT curve
prefillInterpolationGranularity:16# Samples for prefill TTFT curve
decode_interpolation_granularity:6# Samples for decode ITL curve
decodeInterpolationGranularity:6# Samples for decode ITL curve
```
```
**Use cases:**
**Use cases:**
-**use_ai_configurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
-**useAiConfigurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
-**prefill_interpolation_granularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate)
-**prefillInterpolationGranularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate)
-**decode_interpolation_granularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time.
-**decodeInterpolationGranularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time.
### AI Configurator Configuration (Required if `use_ai_configurator: true`)
### AI Configurator Configuration (Required if `useAiConfigurator: true`)
Configure AI Configurator profiling mode:
Configure AI Configurator profiling mode:
...
@@ -367,10 +365,10 @@ Configure AI Configurator profiling mode:
...
@@ -367,10 +365,10 @@ Configure AI Configurator profiling mode:
| `model` _string_ | Model specifies the model to deploy (e.g., "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-70b").<br/>This is a high-level identifier for easy reference in kubectl output and logs.<br/>The controller automatically sets this value in profilingConfig.config.deployment.model. | | Required: \{\}<br/> |
| `model` _string_ | Model specifies the model to deploy (e.g., "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-70b").<br/>This is a high-level identifier for easy reference in kubectl output and logs.<br/>The controller automatically sets this value in profilingConfig.config.deployment.model. | | Required: \{\}<br/> |
| `backend` _string_ | Backend specifies the inference backend for profiling.<br/>The controller automatically sets this value in profilingConfig.config.engine.backend.<br/>Profiling runs on real GPUs or via AIC simulation to collect performance data. | | Enum: [vllm sglang trtllm] <br/>Required: \{\}<br/> |
| `backend` _string_ | Backend specifies the inference backend for profiling.<br/>The controller automatically sets this value in profilingConfig.config.engine.backend.<br/>Profiling runs on real GPUs or via AIC simulation to collect performance data. | | Enum: [vllm sglang trtllm] <br/>Required: \{\}<br/> |
| `useMocker` _boolean_ | UseMocker indicates whether to deploy a mocker DynamoGraphDeployment instead of<br/>a real backend deployment. When true, the deployment uses simulated engines that<br/>don't require GPUs, using the profiling data to simulate realistic timing behavior.<br/>Mocker is available in all backend images and useful for large-scale experiments.<br/>Profiling still runs against the real backend (specified above) to collect performance data. | false | |
| `useMocker` _boolean_ | UseMocker indicates whether to deploy a mocker DynamoGraphDeployment instead of<br/>a real backend deployment. When true, the deployment uses simulated engines that<br/>don't require GPUs, using the profiling data to simulate realistic timing behavior.<br/>Mocker is available in all backend images and useful for large-scale experiments.<br/>Profiling still runs against the real backend (specified above) to collect performance data. | false | |
| `enableGpuDiscovery` _boolean_ | EnableGpuDiscovery controls whether the profiler should automatically discover GPU<br/>resources from the Kubernetes cluster nodes. When enabled, the profiler will override<br/>any manually specified hardware configuration (min_num_gpus_per_engine, max_num_gpus_per_engine,<br/>num_gpus_per_node) with values detected from the cluster.<br/>Requires cluster-wide node access permissions - only available with cluster-scoped operators. | false | Optional: \{\}<br/> |
| `enableGpuDiscovery` _boolean_ | EnableGpuDiscovery controls whether the profiler should automatically discover GPU<br/>resources from the Kubernetes cluster nodes. When enabled, the profiler will override<br/>any manually specified hardware configuration (minNumGpusPerEngine, maxNumGpusPerEngine,<br/>numGpusPerNode) with values detected from the cluster.<br/>Requires cluster-wide node access permissions - only available with cluster-scoped operators. | false | Optional: \{\}<br/> |
| `profilingConfig` _[ProfilingConfigSpec](#profilingconfigspec)_ | ProfilingConfig provides the complete configuration for the profiling job.<br />This configuration is passed directly to the profiler.<br />The structure matches the profile_sla config format exactly (see ProfilingConfigSpec for schema).<br/>Note: deployment.model and engine.backend are automatically set from the high-level<br/>modelName and backend fields and should not be specified in this config. | | Required: \{\}<br/> |
| `profilingConfig` _[ProfilingConfigSpec](#profilingconfigspec)_ | ProfilingConfig provides the complete configuration for the profiling job.<br />This configuration is passed directly to the profiler.<br />The structure matches the profile_sla config format exactly (see ProfilingConfigSpec for schema).<br/>Note: deployment.model and engine.backend are automatically set from the high-level<br/>modelName and backend fields and should not be specified in this config. | | Required: \{\}<br/> |
| `autoApply` _boolean_ | AutoApply indicates whether to automatically create a DynamoGraphDeployment<br/>after profiling completes. If false, only the spec is generated and stored in status.<br/>Users can then manually create a DGD using the generated spec. | false | |
| `autoApply` _boolean_ | AutoApply indicates whether to automatically create a DynamoGraphDeployment<br/>after profiling completes. If false, only the spec is generated and stored in status.<br/>Users can then manually create a DGD using the generated spec. | false | |
| `deploymentOverrides` _[DeploymentOverridesSpec](#deploymentoverridesspec)_ | DeploymentOverrides allows customizing metadata for the auto-created DGD.<br/>Only applicable when AutoApply is true. | | Optional: \{\}<br/> |
| `deploymentOverrides` _[DeploymentOverridesSpec](#deploymentoverridesspec)_ | DeploymentOverrides allows customizing metadata for the auto-created DGD.<br/>Only applicable when AutoApply is true. | | Optional: \{\}<br/> |
> For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
> For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
...
@@ -242,14 +214,14 @@ Choose between **online profiling** (real measurements, 2-4 hours) or **offline
...
@@ -242,14 +214,14 @@ Choose between **online profiling** (real measurements, 2-4 hours) or **offline
prefill_interpolation_granularity:16# Number of samples for prefill ISL sweep
prefillInterpolationGranularity:16# Number of samples for prefill ISL sweep
decode_interpolation_granularity:6# Number of samples for decode sweep
decodeInterpolationGranularity:6# Number of samples for decode sweep
```
```
> **Note**: `engine.config` is a **file path** to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate.
> **Note**: `engine.config` is a **file path** to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate.
#### Planner Configuration Passthrough
#### Planner Configuration Passthrough
Add planner-specific settings. Planner arguments use a `planner_` prefix:
Add planner-specific settings:
```yaml
```yaml
profilingConfig:
profilingConfig:
config:
config:
planner:
planner:
planner_min_endpoint:2
plannerMinEndpoint:2
```
```
## Understanding Profiling Results
## Understanding Profiling Results
...
@@ -378,6 +349,10 @@ spec:
...
@@ -378,6 +349,10 @@ spec:
Profiling still runs against the real backend (via GPUs or AIC) to collect performance data. The mocker deployment then uses this data to simulate realistic timing behavior.
Profiling still runs against the real backend (via GPUs or AIC) to collect performance data. The mocker deployment then uses this data to simulate realistic timing behavior.
### Using a Model Cache PVC
For large models, you can use a pre-populated PVC containing model weights instead of downloading from HuggingFace. See [Model Cache PVC](/docs/benchmarks/sla_driven_profiling.md#model-cache-pvc-advanced) for configuration details.
### DGDR Immutability
### DGDR Immutability
DGDRs are **immutable** - if you need to update SLAs or configuration:
DGDRs are **immutable** - if you need to update SLAs or configuration: