model_cache_pvc_name: String (name of the PVC to mount the model cache,
dgdImage: String (container image to use for DGD components (frontend, planner, workers), overrides images in config file)
modelCache:
pvcName: String (name of the PVC to mount the model cache,
if not provided, model must be HF name and will download from HF, default: "")
model_cache_pvc_path: String (path to the model cache in the PVC, default: "")
model_cache_pvc_mount_path: String (path to the model cache in the container,
pvcPath: String (path to the model cache in the PVC, default: "")
mountPath: String (path to the model cache in the container,
note that the PVC must be mounted to the same path for the profiling job,
default: "/opt/model-cache")
engine:
backend: String (backend type, currently support [vllm, sglang, trtllm], default: vllm)
config: String (path to the DynamoGraphDeployment config file, default: "")
max_context_length: Int (maximum context length supported by the served model, default: 0)
is_moe_model: Boolean (enable MoE (Mixture of Experts) model support, use TEP for prefill and DEP for decode, default: False)
maxContextLength: Int (maximum context length supported by the served model, default: 0)
isMoeModel: Boolean (enable MoE (Mixture of Experts) model support, use TEP for prefill and DEP for decode, default: False)
hardware:
min_num_gpus_per_engine: Int (minimum number of GPUs per engine, default: 0)
max_num_gpus_per_engine: Int (maximum number of GPUs per engine, default: 0)
num_gpus_per_node: Int (number of GPUs per node for MoE models - this will be the granularity when searching for the best TEP/DEP size, default: 0)
minNumGpusPerEngine: Int (minimum number of GPUs per engine, default: 0)
maxNumGpusPerEngine: Int (maximum number of GPUs per engine, default: 0)
numGpusPerNode: Int (number of GPUs per node for MoE models - this will be the granularity when searching for the best TEP/DEP size, default: 0)
enableGpuDiscovery: Boolean (enable automatic GPU discovery from Kubernetes cluster nodes, when enabled overrides any manually specified hardware configuration, requires cluster-wide node access permissions, default: False)
sweep:
prefill_interpolation_granularity: Int (how many samples to benchmark to interpolate TTFT under different ISL, default: 16)
decode_interpolation_granularity: Int (how many samples to benchmark to interpolate ITL under different active kv cache size and decode context length, default: 6)
use_ai_configurator: Boolean (use ai-configurator to estimate benchmarking results instead of running actual deployment, default: False)
aic_system: String (target system for use with aiconfigurator, default: None)
aic_hf_id: String (aiconfigurator huggingface id of the target model, default: None)
aic_backend: String (aiconfigurator backend of the target model, if not provided, will use args.backend, default: "")
aic_backend_version: String (specify backend version when using aiconfigurator to estimate perf, default: None)
dry_run: Boolean (dry run the profile job, default: False)
pick_with_webui: Boolean (pick the best parallelization mapping using webUI, default: False)
webui_port: Int (webUI port, default: $PROFILER_WEBUI_PORT or 8000)
prefillInterpolationGranularity: Int (how many samples to benchmark to interpolate TTFT under different ISL, default: 16)
decodeInterpolationGranularity: Int (how many samples to benchmark to interpolate ITL under different active kv cache size and decode context length, default: 6)
useAiConfigurator: Boolean (use ai-configurator to estimate benchmarking results instead of running actual deployment, default: False)
aicSystem: String (target system for use with aiconfigurator, default: None)
aicHfId: String (aiconfigurator huggingface id of the target model, default: None)
aicBackend: String (aiconfigurator backend of the target model, if not provided, will use args.backend, default: "")
aicBackendVersion: String (specify backend version when using aiconfigurator to estimate perf, default: None)
dryRun: Boolean (dry run the profile job, default: False)
pickWithWebui: Boolean (pick the best parallelization mapping using webUI, default: False)
webuiPort: Int (webUI port, default: $PROFILER_WEBUI_PORT or 8000)
sla:
isl: Int (target input sequence length, default: 3000)
osl: Int (target output sequence length, default: 500)
ttft: Float (target Time To First Token in milliseconds, default: 50)
itl: Float (target Inter Token Latency in milliseconds, default: 10)
planner: (planner-bypass arguments, use hyphens or underscores)
i.e., planner-min-endpoint: 2 # or planner_min_endpoint: 2 (both work)
planner: (planner arguments)
e.g., plannerMinEndpoint: 2
"""
# Step 1: Pre-parse to check if --profile-config is provided
# Only needed when using AI Configurator (sweep.use_ai_configurator: true)
# Only needed when using AI Configurator (sweep.useAiConfigurator: true)
sweep:
aic_system:h200_sxm# GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
aicSystem:h200_sxm# GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
```
### Automatic GPU Discovery (Optional Feature)
...
...
@@ -120,7 +120,7 @@ Profiles your model by creating real test deployments in Kubernetes and measurin
profilingConfig:
config:
sweep:
use_ai_configurator:false# Default
useAiConfigurator:false# Default
```
### AI Configurator Simulation
...
...
@@ -138,11 +138,10 @@ Uses performance simulation to rapidly estimate optimal configurations without r
profilingConfig:
config:
sweep:
use_ai_configurator:true
aic:
system:h200_sxm# GPU system type
model_name:QWEN3_32B# AIC model identifier
backend_version:"0.20.0"
useAiConfigurator:true
aicSystem:h200_sxm# GPU system type
aicHfId:Qwen/Qwen3-32B# HuggingFace model ID
aicBackendVersion:"0.20.0"
```
**Supported Configurations:**
...
...
@@ -290,8 +289,7 @@ spec:
config:# Profiler configuration
sla:{...}
hardware:{...}
sweep:{...}
aic:{...}
sweep:{...}# AIC settings go here (aicSystem, aicHfId, etc.)
planner:{...}
deploymentOverrides:# Optional
...
...
@@ -326,16 +324,16 @@ Control GPU search space and constraints:
profilingConfig:
config:
hardware:
min_num_gpus_per_engine:2# if not provided, will automatically determine based on model and VRAM size
max_num_gpus_per_engine:8# Maximum GPUs to test
num_gpus_per_node:8# GPUs per node (for multi-node MoE)
gpu_type:h200_sxm# GPU type hint
minNumGpusPerEngine:2# if not provided, will automatically determine based on model and VRAM size
maxNumGpusPerEngine:8# Maximum GPUs to test
numGpusPerNode:8# GPUs per node (for multi-node MoE)
gpuType:h200_sxm# GPU type hint
```
**When to use:**
-**min_num_gpus_per_engine**: Skip small TP sizes if your model is large
-**max_num_gpus_per_engine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
-**num_gpus_per_node**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
-**minNumGpusPerEngine**: Skip small TP sizes if your model is large
-**maxNumGpusPerEngine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
-**numGpusPerNode**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
-**gpu_type**: Informational, auto-detected by controller
> [!TIP]
...
...
@@ -349,17 +347,17 @@ Control profiling behavior:
profilingConfig:
config:
sweep:
use_ai_configurator:false# Use offline profiling (default: false)
prefill_interpolation_granularity:16# Samples for prefill TTFT curve
decode_interpolation_granularity:6# Samples for decode ITL curve
useAiConfigurator:false# Use offline profiling (default: false)
prefillInterpolationGranularity:16# Samples for prefill TTFT curve
decodeInterpolationGranularity:6# Samples for decode ITL curve
```
**Use cases:**
-**use_ai_configurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
-**prefill_interpolation_granularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate)
-**decode_interpolation_granularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time.
-**useAiConfigurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
-**prefillInterpolationGranularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate)
-**decodeInterpolationGranularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time.
### AI Configurator Configuration (Required if `use_ai_configurator: true`)
### AI Configurator Configuration (Required if `useAiConfigurator: true`)
Configure AI Configurator profiling mode:
...
...
@@ -367,10 +365,10 @@ Configure AI Configurator profiling mode:
| `model` _string_ | Model specifies the model to deploy (e.g., "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-70b").<br/>This is a high-level identifier for easy reference in kubectl output and logs.<br/>The controller automatically sets this value in profilingConfig.config.deployment.model. | | Required: \{\}<br/> |
| `backend` _string_ | Backend specifies the inference backend for profiling.<br/>The controller automatically sets this value in profilingConfig.config.engine.backend.<br/>Profiling runs on real GPUs or via AIC simulation to collect performance data. | | Enum: [vllm sglang trtllm] <br/>Required: \{\}<br/> |
| `useMocker` _boolean_ | UseMocker indicates whether to deploy a mocker DynamoGraphDeployment instead of<br/>a real backend deployment. When true, the deployment uses simulated engines that<br/>don't require GPUs, using the profiling data to simulate realistic timing behavior.<br/>Mocker is available in all backend images and useful for large-scale experiments.<br/>Profiling still runs against the real backend (specified above) to collect performance data. | false | |
| `enableGpuDiscovery` _boolean_ | EnableGpuDiscovery controls whether the profiler should automatically discover GPU<br/>resources from the Kubernetes cluster nodes. When enabled, the profiler will override<br/>any manually specified hardware configuration (min_num_gpus_per_engine, max_num_gpus_per_engine,<br/>num_gpus_per_node) with values detected from the cluster.<br/>Requires cluster-wide node access permissions - only available with cluster-scoped operators. | false | Optional: \{\}<br/> |
| `enableGpuDiscovery` _boolean_ | EnableGpuDiscovery controls whether the profiler should automatically discover GPU<br/>resources from the Kubernetes cluster nodes. When enabled, the profiler will override<br/>any manually specified hardware configuration (minNumGpusPerEngine, maxNumGpusPerEngine,<br/>numGpusPerNode) with values detected from the cluster.<br/>Requires cluster-wide node access permissions - only available with cluster-scoped operators. | false | Optional: \{\}<br/> |
| `profilingConfig` _[ProfilingConfigSpec](#profilingconfigspec)_ | ProfilingConfig provides the complete configuration for the profiling job.<br />This configuration is passed directly to the profiler.<br />The structure matches the profile_sla config format exactly (see ProfilingConfigSpec for schema).<br/>Note: deployment.model and engine.backend are automatically set from the high-level<br/>modelName and backend fields and should not be specified in this config. | | Required: \{\}<br/> |
| `autoApply` _boolean_ | AutoApply indicates whether to automatically create a DynamoGraphDeployment<br/>after profiling completes. If false, only the spec is generated and stored in status.<br/>Users can then manually create a DGD using the generated spec. | false | |
| `deploymentOverrides` _[DeploymentOverridesSpec](#deploymentoverridesspec)_ | DeploymentOverrides allows customizing metadata for the auto-created DGD.<br/>Only applicable when AutoApply is true. | | Optional: \{\}<br/> |
Or, you can create your own DGDR for your own needs.
> [!TIP]
> For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
...
...
@@ -242,14 +214,14 @@ Choose between **online profiling** (real measurements, 2-4 hours) or **offline
prefill_interpolation_granularity:16# Number of samples for prefill ISL sweep
decode_interpolation_granularity:6# Number of samples for decode sweep
prefillInterpolationGranularity:16# Number of samples for prefill ISL sweep
decodeInterpolationGranularity:6# Number of samples for decode sweep
```
> **Note**: `engine.config` is a **file path** to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate.
#### Planner Configuration Passthrough
Add planner-specific settings. Planner arguments use a `planner_` prefix:
Add planner-specific settings:
```yaml
profilingConfig:
config:
planner:
planner_min_endpoint:2
plannerMinEndpoint:2
```
## Understanding Profiling Results
...
...
@@ -378,6 +349,10 @@ spec:
Profiling still runs against the real backend (via GPUs or AIC) to collect performance data. The mocker deployment then uses this data to simulate realistic timing behavior.
### Using a Model Cache PVC
For large models, you can use a pre-populated PVC containing model weights instead of downloading from HuggingFace. See [Model Cache PVC](/docs/benchmarks/sla_driven_profiling.md#model-cache-pvc-advanced) for configuration details.
### DGDR Immutability
DGDRs are **immutable** - if you need to update SLAs or configuration: