Unverified Commit 36f58e36 authored by hhzhang16's avatar hhzhang16 Committed by GitHub
Browse files

feat: add an optional PVC mounting option to DGDR for profiling (#4503)


Signed-off-by: default avatarHannah Zhang <hannahz@nvidia.com>
Signed-off-by: default avatarhhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Co-authored-by: default avatarHongkuan Zhou <tedzhouhk@gmail.com>
parent 179ee38b
......@@ -183,12 +183,124 @@ spec:
required:
- name
type: object
outputPVC:
description: |-
OutputPVC is an optional PersistentVolumeClaim name for storing profiling output.
If specified, all profiling artifacts (logs, plots, configs, raw data) will be written
to this PVC instead of an ephemeral emptyDir volume. This allows users to access
complete profiling results after the job completes by mounting the PVC.
The PVC must exist in the same namespace as the DGDR.
If not specified, profiling uses emptyDir and only essential data is saved to ConfigMaps.
Note: ConfigMaps are still created regardless of this setting for planner integration.
type: string
profilerImage:
description: |-
ProfilerImage specifies the container image to use for profiling jobs.
This image contains the profiler code and dependencies needed for SLA-based profiling.
Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
type: string
resources:
description: |-
Resources specifies the compute resource requirements for the profiling job container.
If not specified, no resource requests or limits are set.
properties:
claims:
description: |-
Claims lists the names of resources, defined in spec.resourceClaims,
that are used by this container.
This is an alpha field and requires enabling the
DynamicResourceAllocation feature gate.
This field is immutable. It can only be set for containers.
items:
description: ResourceClaim references one entry in PodSpec.ResourceClaims.
properties:
name:
description: |-
Name must match the name of one entry in pod.spec.resourceClaims of
the Pod where this field is used. It makes that resource available
inside a container.
type: string
request:
description: |-
Request is the name chosen for a request in the referenced claim.
If empty, everything from the claim is made available, otherwise
only the result of this request.
type: string
required:
- name
type: object
type: array
x-kubernetes-list-map-keys:
- name
x-kubernetes-list-type: map
limits:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: |-
Limits describes the maximum amount of compute resources allowed.
More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
type: object
requests:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: |-
Requests describes the minimum amount of compute resources required.
If Requests is omitted for a container, it defaults to Limits if that is explicitly specified,
otherwise to an implementation-defined value. Requests cannot exceed Limits.
More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
type: object
type: object
tolerations:
description: |-
Tolerations allows the profiling job to be scheduled on nodes with matching taints.
For example, to schedule on GPU nodes, add a toleration for the nvidia.com/gpu taint.
items:
description: |-
The pod this Toleration is attached to tolerates any taint that matches
the triple <key,value,effect> using the matching operator <operator>.
properties:
effect:
description: |-
Effect indicates the taint effect to match. Empty means match all taint effects.
When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.
type: string
key:
description: |-
Key is the taint key that the toleration applies to. Empty means match all taint keys.
If the key is empty, operator must be Exists; this combination means to match all values and all keys.
type: string
operator:
description: |-
Operator represents a key's relationship to the value.
Valid operators are Exists and Equal. Defaults to Equal.
Exists is equivalent to wildcard for value, so that a pod can
tolerate all taints of a particular category.
type: string
tolerationSeconds:
description: |-
TolerationSeconds represents the period of time the toleration (which must be
of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default,
it is not set, which means tolerate the taint forever (do not evict). Zero and
negative values will be treated as 0 (evict immediately) by the system.
format: int64
type: integer
value:
description: |-
Value is the taint value the toleration matches to.
If the operator is Exists, the value should be empty, otherwise just a regular string.
type: string
type: object
type: array
required:
- profilerImage
type: object
......
......@@ -24,6 +24,7 @@ a high-level, SLA-driven interface for deploying machine learning models on Dyna
package v1alpha1
import (
corev1 "k8s.io/api/core/v1"
apiextensionsv1 "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
runtime "k8s.io/apimachinery/pkg/runtime"
......@@ -66,6 +67,26 @@ type ProfilingConfigSpec struct {
// Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
// +kubebuilder:validation:Required
ProfilerImage string `json:"profilerImage"`
// OutputPVC is an optional PersistentVolumeClaim name for storing profiling output.
// If specified, all profiling artifacts (logs, plots, configs, raw data) will be written
// to this PVC instead of an ephemeral emptyDir volume. This allows users to access
// complete profiling results after the job completes by mounting the PVC.
// The PVC must exist in the same namespace as the DGDR.
// If not specified, profiling uses emptyDir and only essential data is saved to ConfigMaps.
// Note: ConfigMaps are still created regardless of this setting for planner integration.
// +kubebuilder:validation:Optional
OutputPVC string `json:"outputPVC,omitempty"`
// Resources specifies the compute resource requirements for the profiling job container.
// If not specified, no resource requests or limits are set.
// +kubebuilder:validation:Optional
Resources *corev1.ResourceRequirements `json:"resources,omitempty"`
// Tolerations allows the profiling job to be scheduled on nodes with matching taints.
// For example, to schedule on GPU nodes, add a toleration for the nvidia.com/gpu taint.
// +kubebuilder:validation:Optional
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
}
// DeploymentOverridesSpec allows users to customize metadata for auto-created DynamoGraphDeployments.
......
......@@ -1009,6 +1009,18 @@ func (in *ProfilingConfigSpec) DeepCopyInto(out *ProfilingConfigSpec) {
*out = new(ConfigMapKeySelector)
**out = **in
}
if in.Resources != nil {
in, out := &in.Resources, &out.Resources
*out = new(v1.ResourceRequirements)
(*in).DeepCopyInto(*out)
}
if in.Tolerations != nil {
in, out := &in.Tolerations, &out.Tolerations
*out = make([]v1.Toleration, len(*in))
for i := range *in {
(*in)[i].DeepCopyInto(&(*out)[i])
}
}
}
// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new ProfilingConfigSpec.
......
......@@ -183,12 +183,124 @@ spec:
required:
- name
type: object
outputPVC:
description: |-
OutputPVC is an optional PersistentVolumeClaim name for storing profiling output.
If specified, all profiling artifacts (logs, plots, configs, raw data) will be written
to this PVC instead of an ephemeral emptyDir volume. This allows users to access
complete profiling results after the job completes by mounting the PVC.
The PVC must exist in the same namespace as the DGDR.
If not specified, profiling uses emptyDir and only essential data is saved to ConfigMaps.
Note: ConfigMaps are still created regardless of this setting for planner integration.
type: string
profilerImage:
description: |-
ProfilerImage specifies the container image to use for profiling jobs.
This image contains the profiler code and dependencies needed for SLA-based profiling.
Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
type: string
resources:
description: |-
Resources specifies the compute resource requirements for the profiling job container.
If not specified, no resource requests or limits are set.
properties:
claims:
description: |-
Claims lists the names of resources, defined in spec.resourceClaims,
that are used by this container.
This is an alpha field and requires enabling the
DynamicResourceAllocation feature gate.
This field is immutable. It can only be set for containers.
items:
description: ResourceClaim references one entry in PodSpec.ResourceClaims.
properties:
name:
description: |-
Name must match the name of one entry in pod.spec.resourceClaims of
the Pod where this field is used. It makes that resource available
inside a container.
type: string
request:
description: |-
Request is the name chosen for a request in the referenced claim.
If empty, everything from the claim is made available, otherwise
only the result of this request.
type: string
required:
- name
type: object
type: array
x-kubernetes-list-map-keys:
- name
x-kubernetes-list-type: map
limits:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: |-
Limits describes the maximum amount of compute resources allowed.
More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
type: object
requests:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: |-
Requests describes the minimum amount of compute resources required.
If Requests is omitted for a container, it defaults to Limits if that is explicitly specified,
otherwise to an implementation-defined value. Requests cannot exceed Limits.
More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
type: object
type: object
tolerations:
description: |-
Tolerations allows the profiling job to be scheduled on nodes with matching taints.
For example, to schedule on GPU nodes, add a toleration for the nvidia.com/gpu taint.
items:
description: |-
The pod this Toleration is attached to tolerates any taint that matches
the triple <key,value,effect> using the matching operator <operator>.
properties:
effect:
description: |-
Effect indicates the taint effect to match. Empty means match all taint effects.
When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.
type: string
key:
description: |-
Key is the taint key that the toleration applies to. Empty means match all taint keys.
If the key is empty, operator must be Exists; this combination means to match all values and all keys.
type: string
operator:
description: |-
Operator represents a key's relationship to the value.
Valid operators are Exists and Equal. Defaults to Equal.
Exists is equivalent to wildcard for value, so that a pod can
tolerate all taints of a particular category.
type: string
tolerationSeconds:
description: |-
TolerationSeconds represents the period of time the toleration (which must be
of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default,
it is not set, which means tolerate the taint forever (do not evict). Zero and
negative values will be treated as 0 (evict immediately) by the system.
format: int64
type: integer
value:
description: |-
Value is the taint value the toleration matches to.
If the operator is Exists, the value should be empty, otherwise just a regular string.
type: string
type: object
type: array
required:
- profilerImage
type: object
......
......@@ -28,7 +28,6 @@ import (
corev1 "k8s.io/api/core/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/api/meta"
"k8s.io/apimachinery/pkg/api/resource"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
"k8s.io/apimachinery/pkg/runtime"
......@@ -892,66 +891,10 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
jobName := getProfilingJobName(dgdr)
outputConfigMapName := getOutputConfigMapName(dgdr)
// Parse the profiling config from JSON
var config map[string]interface{}
if err := yaml.Unmarshal(dgdr.Spec.ProfilingConfig.Config.Raw, &config); err != nil {
return nil, false, fmt.Errorf("failed to parse profiling config: %w", err)
}
// Set deployment.namespace if not already set
deploymentVal, hasDeployment := config["deployment"]
var deploymentConfig map[string]interface{}
if !hasDeployment || deploymentVal == nil {
deploymentConfig = make(map[string]interface{})
config["deployment"] = deploymentConfig
} else {
var ok bool
deploymentConfig, ok = deploymentVal.(map[string]interface{})
if !ok {
return nil, false, fmt.Errorf("profilingConfig.config.deployment must be an object, got %T", deploymentVal)
}
}
if _, hasNamespace := deploymentConfig["namespace"]; !hasNamespace {
deploymentConfig["namespace"] = dgdr.Namespace
}
// Set deployment.model from spec.model
deploymentConfig["model"] = dgdr.Spec.Model
// Set deployment.dgd_image from deploymentOverrides.workersImage if provided
if dgdr.Spec.DeploymentOverrides != nil && dgdr.Spec.DeploymentOverrides.WorkersImage != "" {
deploymentConfig["dgd_image"] = dgdr.Spec.DeploymentOverrides.WorkersImage
}
// Set output_dir if not already set
if _, hasOutputDir := config["output_dir"]; !hasOutputDir {
config["output_dir"] = ProfilingOutputPath
}
// Set engine.backend from spec.backend
engineVal, hasEngine := config["engine"]
var engineConfig map[string]interface{}
if !hasEngine || engineVal == nil {
engineConfig = make(map[string]interface{})
config["engine"] = engineConfig
} else {
var ok bool
engineConfig, ok = engineVal.(map[string]interface{})
if !ok {
return nil, false, fmt.Errorf("profilingConfig.config.engine must be an object, got %T", engineVal)
}
}
engineConfig["backend"] = dgdr.Spec.Backend
// If ConfigMapRef is provided, set engine.config path
if dgdr.Spec.ProfilingConfig.ConfigMapRef != nil {
engineConfig["config"] = fmt.Sprintf("%s/%s", ProfilingConfigPath, ProfilingConfigFile)
}
// Serialize config to YAML for passing to profiler
configYAML, err := sigsyaml.Marshal(config)
// Parse and prepare profiling config
configYAML, err := r.prepareProfilingConfig(dgdr)
if err != nil {
return nil, false, fmt.Errorf("failed to marshal profiling config to YAML: %w", err)
return nil, false, err
}
// Common environment variables
......@@ -1023,20 +966,19 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
logger.Info("Using profiler image", "image", imageName)
profilerContainer := corev1.Container{
Name: ContainerNameProfiler,
Image: imageName,
Command: []string{"python", "-m", "benchmarks.profiler.profile_sla"},
Args: profilerArgs,
Resources: corev1.ResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("16"),
corev1.ResourceMemory: resource.MustParse("10Gi"),
},
},
Name: ContainerNameProfiler,
Image: imageName,
Command: []string{"python", "-m", "benchmarks.profiler.profile_sla"},
Args: profilerArgs,
Env: profilerEnv,
VolumeMounts: volumeMounts,
}
// Apply resource requirements if specified in the DGDR
if dgdr.Spec.ProfilingConfig.Resources != nil {
profilerContainer.Resources = *dgdr.Spec.ProfilingConfig.Resources
}
// Generate sidecar script from template
tmpl, err := template.New("sidecar").Parse(sidecarScriptTemplate)
if err != nil {
......@@ -1067,14 +1009,27 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
}},
}
// Build volumes - use emptyDir for profiling output
// The sidecar saves all needed data to ConfigMaps, so persistence is not needed
volumes := []corev1.Volume{{
Name: VolumeNameProfilingOutput,
VolumeSource: corev1.VolumeSource{
EmptyDir: &corev1.EmptyDirVolumeSource{},
},
}}
// Use PVC if specified, otherwise use emptyDir for profiling output
var profilingOutputVolume corev1.Volume
if dgdr.Spec.ProfilingConfig.OutputPVC != "" {
logger.Info("Using PVC for profiling output", "pvc", dgdr.Spec.ProfilingConfig.OutputPVC)
profilingOutputVolume = corev1.Volume{
Name: VolumeNameProfilingOutput,
VolumeSource: corev1.VolumeSource{
PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{
ClaimName: dgdr.Spec.ProfilingConfig.OutputPVC,
},
},
}
} else {
profilingOutputVolume = corev1.Volume{
Name: VolumeNameProfilingOutput,
VolumeSource: corev1.VolumeSource{
EmptyDir: &corev1.EmptyDirVolumeSource{},
},
}
}
volumes := []corev1.Volume{profilingOutputVolume}
// Add ConfigMap volume if provided
if dgdr.Spec.ProfilingConfig.ConfigMapRef != nil {
......@@ -1108,6 +1063,27 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
labelValue = LabelValueAICProfiler
}
podSpec := corev1.PodSpec{
ServiceAccountName: ServiceAccountProfilingJob,
RestartPolicy: corev1.RestartPolicyNever,
SecurityContext: &corev1.PodSecurityContext{
RunAsNonRoot: ptr.To(true), // Enforces that container cannot run as root
RunAsUser: ptr.To[int64](1000), // Run as UID 1000 (non-privileged user)
RunAsGroup: ptr.To[int64](1000), // Run with GID 1000 (non-privileged group)
FSGroup: ptr.To[int64](1000), // Volume files owned by GID 1000
},
Containers: []corev1.Container{profilerContainer, sidecarContainer},
Volumes: volumes,
ImagePullSecrets: []corev1.LocalObjectReference{
{Name: "nvcr-imagepullsecret"},
},
}
// Apply tolerations if specified in the DGDR
if len(dgdr.Spec.ProfilingConfig.Tolerations) > 0 {
podSpec.Tolerations = dgdr.Spec.ProfilingConfig.Tolerations
}
job := &batchv1.Job{
ObjectMeta: metav1.ObjectMeta{
Name: jobName,
......@@ -1121,20 +1097,7 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
Spec: batchv1.JobSpec{
BackoffLimit: &backoffLimit,
Template: corev1.PodTemplateSpec{
Spec: corev1.PodSpec{
ServiceAccountName: ServiceAccountProfilingJob,
RestartPolicy: corev1.RestartPolicyNever, SecurityContext: &corev1.PodSecurityContext{
RunAsNonRoot: ptr.To(true), // Enforces that container cannot run as root
RunAsUser: ptr.To[int64](1000), // Run as UID 1000 (non-privileged user)
RunAsGroup: ptr.To[int64](1000), // Run with GID 1000 (non-privileged group)
FSGroup: ptr.To[int64](1000), // Volume files owned by GID 1000
},
Containers: []corev1.Container{profilerContainer, sidecarContainer},
Volumes: volumes,
ImagePullSecrets: []corev1.LocalObjectReference{
{Name: "nvcr-imagepullsecret"},
},
},
Spec: podSpec,
},
},
}
......@@ -1153,6 +1116,73 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
return nil
}
// prepareProfilingConfig parses and modifies the profiling config
func (r *DynamoGraphDeploymentRequestReconciler) prepareProfilingConfig(dgdr *nvidiacomv1alpha1.DynamoGraphDeploymentRequest) ([]byte, error) {
// Parse the profiling config from JSON
var config map[string]interface{}
if err := yaml.Unmarshal(dgdr.Spec.ProfilingConfig.Config.Raw, &config); err != nil {
return nil, fmt.Errorf("failed to parse profiling config: %w", err)
}
// Set deployment.namespace if not already set
deploymentVal, hasDeployment := config["deployment"]
var deploymentConfig map[string]interface{}
if !hasDeployment || deploymentVal == nil {
deploymentConfig = make(map[string]interface{})
config["deployment"] = deploymentConfig
} else {
var ok bool
deploymentConfig, ok = deploymentVal.(map[string]interface{})
if !ok {
return nil, fmt.Errorf("profilingConfig.config.deployment must be an object, got %T", deploymentVal)
}
}
if _, hasNamespace := deploymentConfig["namespace"]; !hasNamespace {
deploymentConfig["namespace"] = dgdr.Namespace
}
// Set deployment.model from spec.model
deploymentConfig["model"] = dgdr.Spec.Model
// Set deployment.dgd_image from deploymentOverrides.workersImage if provided
if dgdr.Spec.DeploymentOverrides != nil && dgdr.Spec.DeploymentOverrides.WorkersImage != "" {
deploymentConfig["dgd_image"] = dgdr.Spec.DeploymentOverrides.WorkersImage
}
// Set output_dir if not already set
if _, hasOutputDir := config["output_dir"]; !hasOutputDir {
config["output_dir"] = ProfilingOutputPath
}
// Set engine.backend from spec.backend
engineVal, hasEngine := config["engine"]
var engineConfig map[string]interface{}
if !hasEngine || engineVal == nil {
engineConfig = make(map[string]interface{})
config["engine"] = engineConfig
} else {
var ok bool
engineConfig, ok = engineVal.(map[string]interface{})
if !ok {
return nil, fmt.Errorf("profilingConfig.config.engine must be an object, got %T", engineVal)
}
}
engineConfig["backend"] = dgdr.Spec.Backend
// If ConfigMapRef is provided, set engine.config path
if dgdr.Spec.ProfilingConfig.ConfigMapRef != nil {
engineConfig["config"] = fmt.Sprintf("%s/%s", ProfilingConfigPath, ProfilingConfigFile)
}
// Serialize config to YAML for passing to profiler
configYAML, err := sigsyaml.Marshal(config)
if err != nil {
return nil, fmt.Errorf("failed to marshal profiling config to YAML: %w", err)
}
return configYAML, nil
}
// checkProfilingJobStatus checks if the profiling job has completed
func (r *DynamoGraphDeploymentRequestReconciler) checkProfilingJobStatus(ctx context.Context, dgdr *nvidiacomv1alpha1.DynamoGraphDeploymentRequest) (bool, error) {
logger := log.FromContext(ctx)
......
......@@ -29,16 +29,18 @@ This includes:
After setting up Dynamo Cloud, use this script to prepare your namespace with the additional resources needed for benchmarking and profiling workflows:
The setup script creates a `dynamo-pvc` with `ReadWriteMany` (RWX). If your cluster's default `storageClassName` does not support RWX, set `storageClassName` in `deploy/utils/manifests/pvc.yaml` to an RWX-capable class before running the script.
The setup script creates a `dynamo-pvc` with `ReadWriteOnce` (RWO) access mode using your cluster's default storage class. This is sufficient for profiling workflows where only one job writes at a time.
If you want to use `ReadWriteMany` (RWX) for concurrent access, modify `deploy/utils/manifests/pvc.yaml` before running the script:
Example (add under `spec` in `deploy/utils/manifests/pvc.yaml`):
```yaml
...
spec:
accessModes:
- ReadWriteMany
storageClassName: <your-rwx-storageclass>
...
storageClassName: <your-rwx-capable-storageclass> # e.g., NFS-based storage
resources:
requests:
storage: 50Gi
```
> [!TIP]
......
......@@ -7,7 +7,10 @@ metadata:
namespace: ${NAMESPACE}
spec:
accessModes:
- ReadWriteMany
- ReadWriteOnce
# Uncomment and set storageClassName if you need to use a specific storage class
# For ReadWriteMany support, use an RWX-capable storage class (e.g., NFS-based)
# storageClassName: your-storageclass
resources:
requests:
storage: 50Gi
......@@ -593,6 +593,9 @@ _Appears in:_
| `config` _[JSON](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#json-v1-apiextensions-k8s-io)_ | Config is the profiling configuration as arbitrary JSON/YAML. This will be passed directly to the profiler.<br />The profiler will validate the configuration and report any errors. | | Optional: \{\} <br />Type: object <br /> |
| `configMapRef` _[ConfigMapKeySelector](#configmapkeyselector)_ | ConfigMapRef is an optional reference to a ConfigMap containing the DynamoGraphDeployment<br />base config file (disagg.yaml). This is separate from the profiling config above.<br />The path to this config will be set as engine.config in the profiling config. | | Optional: \{\} <br /> |
| `profilerImage` _string_ | ProfilerImage specifies the container image to use for profiling jobs.<br />This image contains the profiler code and dependencies needed for SLA-based profiling.<br />Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" | | Required: \{\} <br /> |
| `outputPVC` _string_ | OutputPVC is an optional PersistentVolumeClaim name for storing profiling output.<br />If specified, all profiling artifacts (logs, plots, configs, raw data) will be written<br />to this PVC instead of an ephemeral emptyDir volume. This allows users to access<br />complete profiling results after the job completes by mounting the PVC.<br />The PVC must exist in the same namespace as the DGDR.<br />If not specified, profiling uses emptyDir and only essential data is saved to ConfigMaps.<br />Note: ConfigMaps are still created regardless of this setting for planner integration. | | Optional: \{\} <br /> |
| `resources` _[ResourceRequirements](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#resourcerequirements-v1-core)_ | Resources specifies the compute resource requirements for the profiling job container.<br />If not specified, no resource requests or limits are set. | | Optional: \{\} <br /> |
| `tolerations` _[Toleration](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#toleration-v1-core) array_ | Tolerations allows the profiling job to be scheduled on nodes with matching taints.<br />For example, to schedule on GPU nodes, add a toleration for the nvidia.com/gpu taint. | | Optional: \{\} <br /> |
#### ResourceItem
......
......@@ -414,6 +414,46 @@ metadata:
dgdr.nvidia.com/namespace: your-namespace
```
### Accessing Detailed Profiling Artifacts
By default, profiling jobs save essential data to ConfigMaps for planner integration. For advanced users who need access to detailed artifacts (logs, performance plots, AIPerf results, etc), configure the DGDR to use `dynamo-pvc`. This is optional and will not affect the functionality of profiler or Planner.
**What's available in ConfigMaps (always created):**
- Generated DGD configuration
- Profiling data for Planner (`.json` files)
**What's available in PVC if attached to DGDR (optional):**
- Performance plots (PNGs)
- DGD configuration and logs of all services for each profiled deployment
- AIPerf profiling artifacts for each AIPerf run
- Raw profiling data (`.npz` files)
- Profiler log
**Setup:**
1. Set up the benchmarking PVC:
```bash
export NAMESPACE=your-namespace
deploy/utils/setup_benchmarking_resources.sh
```
2. Add `outputPVC` to your DGDR's `profilingConfig`:
```yaml
spec:
profilingConfig:
outputPVC: "dynamo-pvc"
config:
# ... rest of config
```
3. After profiling completes, access results:
```bash
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
kubectl delete pod pvc-access-pod -n $NAMESPACE
```
## Troubleshooting
### Quick Diagnostics
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment