feat: add an optional PVC mounting option to DGDR for profiling (#4503)

Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com> Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com>

feat: add an optional PVC mounting option to DGDR for profiling (#4503)
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com> Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com>
36f58e36 · hhzhang16 · GitHub · 179ee38b · 36f58e36 · 36f58e36
Unverified Commit 36f58e36 authored Nov 20, 2025 by hhzhang16 Committed by GitHub Nov 21, 2025
9 changed files
--- a/deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentrequests.yaml
+++ b/deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentrequests.yaml
@@ -183,12 +183,124 @@ spec:
                      required:
                        - name
                      type: object
+                    outputPVC:
+                      description: |-
+                        OutputPVC is an optional PersistentVolumeClaim name for storing profiling output.
+                        If specified, all profiling artifacts (logs, plots, configs, raw data) will be written
+                        to this PVC instead of an ephemeral emptyDir volume. This allows users to access
+                        complete profiling results after the job completes by mounting the PVC.
+                        The PVC must exist in the same namespace as the DGDR.
+                        If not specified, profiling uses emptyDir and only essential data is saved to ConfigMaps.
+                        Note: ConfigMaps are still created regardless of this setting for planner integration.
+                      type: string
                    profilerImage:
                      description: |-
                        ProfilerImage specifies the container image to use for profiling jobs.
                        This image contains the profiler code and dependencies needed for SLA-based profiling.
                        Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
                      type: string
+                    resources:
+                      description: |-
+                        Resources specifies the compute resource requirements for the profiling job container.
+                        If not specified, no resource requests or limits are set.
+                      properties:
+                        claims:
+                          description: |-
+                            Claims lists the names of resources, defined in spec.resourceClaims,
+                            that are used by this container.
+
+                            This is an alpha field and requires enabling the
+                            DynamicResourceAllocation feature gate.
+
+                            This field is immutable. It can only be set for containers.
+                          items:
+                            description: ResourceClaim references one entry in PodSpec.ResourceClaims.
+                            properties:
+                              name:
+                                description: |-
+                                  Name must match the name of one entry in pod.spec.resourceClaims of
+                                  the Pod where this field is used. It makes that resource available
+                                  inside a container.
+                                type: string
+                              request:
+                                description: |-
+                                  Request is the name chosen for a request in the referenced claim.
+                                  If empty, everything from the claim is made available, otherwise
+                                  only the result of this request.
+                                type: string
+                            required:
+                              - name
+                            type: object
+                          type: array
+                          x-kubernetes-list-map-keys:
+                            - name
+                          x-kubernetes-list-type: map
+                        limits:
+                          additionalProperties:
+                            anyOf:
+                              - type: integer
+                              - type: string
+                            pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
+                            x-kubernetes-int-or-string: true
+                          description: |-
+                            Limits describes the maximum amount of compute resources allowed.
+                            More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
+                          type: object
+                        requests:
+                          additionalProperties:
+                            anyOf:
+                              - type: integer
+                              - type: string
+                            pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
+                            x-kubernetes-int-or-string: true
+                          description: |-
+                            Requests describes the minimum amount of compute resources required.
+                            If Requests is omitted for a container, it defaults to Limits if that is explicitly specified,
+                            otherwise to an implementation-defined value. Requests cannot exceed Limits.
+                            More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
+                          type: object
+                      type: object
+                    tolerations:
+                      description: |-
+                        Tolerations allows the profiling job to be scheduled on nodes with matching taints.
+                        For example, to schedule on GPU nodes, add a toleration for the nvidia.com/gpu taint.
+                      items:
+                        description: |-
+                          The pod this Toleration is attached to tolerates any taint that matches
+                          the triple <key,value,effect> using the matching operator <operator>.
+                        properties:
+                          effect:
+                            description: |-
+                              Effect indicates the taint effect to match. Empty means match all taint effects.
+                              When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.
+                            type: string
+                          key:
+                            description: |-
+                              Key is the taint key that the toleration applies to. Empty means match all taint keys.
+                              If the key is empty, operator must be Exists; this combination means to match all values and all keys.
+                            type: string
+                          operator:
+                            description: |-
+                              Operator represents a key's relationship to the value.
+                              Valid operators are Exists and Equal. Defaults to Equal.
+                              Exists is equivalent to wildcard for value, so that a pod can
+                              tolerate all taints of a particular category.
+                            type: string
+                          tolerationSeconds:
+                            description: |-
+                              TolerationSeconds represents the period of time the toleration (which must be
+                              of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default,
+                              it is not set, which means tolerate the taint forever (do not evict). Zero and
+                              negative values will be treated as 0 (evict immediately) by the system.
+                            format: int64
+                            type: integer
+                          value:
+                            description: |-
+                              Value is the taint value the toleration matches to.
+                              If the operator is Exists, the value should be empty, otherwise just a regular string.
+                            type: string
+                        type: object
+                      type: array
                  required:
                    - profilerImage
                  type: object

--- a/deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go
+++ b/deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go
@@ -24,6 +24,7 @@ a high-level, SLA-driven interface for deploying machine learning models on Dyna
 package v1alpha1

 import (
+	corev1 "k8s.io/api/core/v1"
 	apiextensionsv1 "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	runtime "k8s.io/apimachinery/pkg/runtime"
@@ -66,6 +67,26 @@ type ProfilingConfigSpec struct {
 	// Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
 	// +kubebuilder:validation:Required
 	ProfilerImage string `json:"profilerImage"`
+
+	// OutputPVC is an optional PersistentVolumeClaim name for storing profiling output.
+	// If specified, all profiling artifacts (logs, plots, configs, raw data) will be written
+	// to this PVC instead of an ephemeral emptyDir volume. This allows users to access
+	// complete profiling results after the job completes by mounting the PVC.
+	// The PVC must exist in the same namespace as the DGDR.
+	// If not specified, profiling uses emptyDir and only essential data is saved to ConfigMaps.
+	// Note: ConfigMaps are still created regardless of this setting for planner integration.
+	// +kubebuilder:validation:Optional
+	OutputPVC string `json:"outputPVC,omitempty"`
+
+	// Resources specifies the compute resource requirements for the profiling job container.
+	// If not specified, no resource requests or limits are set.
+	// +kubebuilder:validation:Optional
+	Resources *corev1.ResourceRequirements `json:"resources,omitempty"`
+
+	// Tolerations allows the profiling job to be scheduled on nodes with matching taints.
+	// For example, to schedule on GPU nodes, add a toleration for the nvidia.com/gpu taint.
+	// +kubebuilder:validation:Optional
+	Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
 }

 // DeploymentOverridesSpec allows users to customize metadata for auto-created DynamoGraphDeployments.

--- a/deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go
+++ b/deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go
@@ -1009,6 +1009,18 @@ func (in *ProfilingConfigSpec) DeepCopyInto(out *ProfilingConfigSpec) {
 		*out = new(ConfigMapKeySelector)
 		**out = **in
 	}
+	if in.Resources != nil {
+		in, out := &in.Resources, &out.Resources
+		*out = new(v1.ResourceRequirements)
+		(*in).DeepCopyInto(*out)
+	}
+	if in.Tolerations != nil {
+		in, out := &in.Tolerations, &out.Tolerations
+		*out = make([]v1.Toleration, len(*in))
+		for i := range *in {
+			(*in)[i].DeepCopyInto(&(*out)[i])
+		}
+	}
 }

 // DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new ProfilingConfigSpec.

--- a/deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml
+++ b/deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml
@@ -183,12 +183,124 @@ spec:
                      required:
                        - name
                      type: object
+                    outputPVC:
+                      description: |-
+                        OutputPVC is an optional PersistentVolumeClaim name for storing profiling output.
+                        If specified, all profiling artifacts (logs, plots, configs, raw data) will be written
+                        to this PVC instead of an ephemeral emptyDir volume. This allows users to access
+                        complete profiling results after the job completes by mounting the PVC.
+                        The PVC must exist in the same namespace as the DGDR.
+                        If not specified, profiling uses emptyDir and only essential data is saved to ConfigMaps.
+                        Note: ConfigMaps are still created regardless of this setting for planner integration.
+                      type: string
                    profilerImage:
                      description: |-
                        ProfilerImage specifies the container image to use for profiling jobs.
                        This image contains the profiler code and dependencies needed for SLA-based profiling.
                        Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
                      type: string
+                    resources:
+                      description: |-
+                        Resources specifies the compute resource requirements for the profiling job container.
+                        If not specified, no resource requests or limits are set.
+                      properties:
+                        claims:
+                          description: |-
+                            Claims lists the names of resources, defined in spec.resourceClaims,
+                            that are used by this container.
+
+                            This is an alpha field and requires enabling the
+                            DynamicResourceAllocation feature gate.
+
+                            This field is immutable. It can only be set for containers.
+                          items:
+                            description: ResourceClaim references one entry in PodSpec.ResourceClaims.
+                            properties:
+                              name:
+                                description: |-
+                                  Name must match the name of one entry in pod.spec.resourceClaims of
+                                  the Pod where this field is used. It makes that resource available
+                                  inside a container.
+                                type: string
+                              request:
+                                description: |-
+                                  Request is the name chosen for a request in the referenced claim.
+                                  If empty, everything from the claim is made available, otherwise
+                                  only the result of this request.
+                                type: string
+                            required:
+                              - name
+                            type: object
+                          type: array
+                          x-kubernetes-list-map-keys:
+                            - name
+                          x-kubernetes-list-type: map
+                        limits:
+                          additionalProperties:
+                            anyOf:
+                              - type: integer
+                              - type: string
+                            pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
+                            x-kubernetes-int-or-string: true
+                          description: |-
+                            Limits describes the maximum amount of compute resources allowed.
+                            More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
+                          type: object
+                        requests:
+                          additionalProperties:
+                            anyOf:
+                              - type: integer
+                              - type: string
+                            pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
+                            x-kubernetes-int-or-string: true
+                          description: |-
+                            Requests describes the minimum amount of compute resources required.
+                            If Requests is omitted for a container, it defaults to Limits if that is explicitly specified,
+                            otherwise to an implementation-defined value. Requests cannot exceed Limits.
+                            More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
+                          type: object
+                      type: object
+                    tolerations:
+                      description: |-
+                        Tolerations allows the profiling job to be scheduled on nodes with matching taints.
+                        For example, to schedule on GPU nodes, add a toleration for the nvidia.com/gpu taint.
+                      items:
+                        description: |-
+                          The pod this Toleration is attached to tolerates any taint that matches
+                          the triple <key,value,effect> using the matching operator <operator>.
+                        properties:
+                          effect:
+                            description: |-
+                              Effect indicates the taint effect to match. Empty means match all taint effects.
+                              When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.
+                            type: string
+                          key:
+                            description: |-
+                              Key is the taint key that the toleration applies to. Empty means match all taint keys.
+                              If the key is empty, operator must be Exists; this combination means to match all values and all keys.
+                            type: string
+                          operator:
+                            description: |-
+                              Operator represents a key's relationship to the value.
+                              Valid operators are Exists and Equal. Defaults to Equal.
+                              Exists is equivalent to wildcard for value, so that a pod can
+                              tolerate all taints of a particular category.
+                            type: string
+                          tolerationSeconds:
+                            description: |-
+                              TolerationSeconds represents the period of time the toleration (which must be
+                              of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default,
+                              it is not set, which means tolerate the taint forever (do not evict). Zero and
+                              negative values will be treated as 0 (evict immediately) by the system.
+                            format: int64
+                            type: integer
+                          value:
+                            description: |-
+                              Value is the taint value the toleration matches to.
+                              If the operator is Exists, the value should be empty, otherwise just a regular string.
+                            type: string
+                        type: object
+                      type: array
                  required:
                    - profilerImage
                  type: object

--- a/deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go
+++ b/deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go
@@ -28,7 +28,6 @@ import (
 	corev1 "k8s.io/api/core/v1"
 	apierrors "k8s.io/apimachinery/pkg/api/errors"
 	"k8s.io/apimachinery/pkg/api/meta"
-	"k8s.io/apimachinery/pkg/api/resource"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
 	"k8s.io/apimachinery/pkg/runtime"
@@ -892,66 +891,10 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
 		jobName := getProfilingJobName(dgdr)
 		outputConfigMapName := getOutputConfigMapName(dgdr)

-		// Parse the profiling config from JSON
-		var config map[string]interface{}
-		if err := yaml.Unmarshal(dgdr.Spec.ProfilingConfig.Config.Raw, &config); err != nil {
-			return nil, false, fmt.Errorf("failed to parse profiling config: %w", err)
-		}
-
-		// Set deployment.namespace if not already set
-		deploymentVal, hasDeployment := config["deployment"]
-		var deploymentConfig map[string]interface{}
-		if !hasDeployment || deploymentVal == nil {
-			deploymentConfig = make(map[string]interface{})
-			config["deployment"] = deploymentConfig
-		} else {
-			var ok bool
-			deploymentConfig, ok = deploymentVal.(map[string]interface{})
-			if !ok {
-				return nil, false, fmt.Errorf("profilingConfig.config.deployment must be an object, got %T", deploymentVal)
-			}
-		}
-		if _, hasNamespace := deploymentConfig["namespace"]; !hasNamespace {
-			deploymentConfig["namespace"] = dgdr.Namespace
-		}
-
-		// Set deployment.model from spec.model
-		deploymentConfig["model"] = dgdr.Spec.Model
-
-		// Set deployment.dgd_image from deploymentOverrides.workersImage if provided
-		if dgdr.Spec.DeploymentOverrides != nil && dgdr.Spec.DeploymentOverrides.WorkersImage != "" {
-			deploymentConfig["dgd_image"] = dgdr.Spec.DeploymentOverrides.WorkersImage
-		}
-
-		// Set output_dir if not already set
-		if _, hasOutputDir := config["output_dir"]; !hasOutputDir {
-			config["output_dir"] = ProfilingOutputPath
-		}
-
-		// Set engine.backend from spec.backend
-		engineVal, hasEngine := config["engine"]
-		var engineConfig map[string]interface{}
-		if !hasEngine || engineVal == nil {
-			engineConfig = make(map[string]interface{})
-			config["engine"] = engineConfig
-		} else {
-			var ok bool
-			engineConfig, ok = engineVal.(map[string]interface{})
-			if !ok {
-				return nil, false, fmt.Errorf("profilingConfig.config.engine must be an object, got %T", engineVal)
-			}
-		}
-		engineConfig["backend"] = dgdr.Spec.Backend
-
-		// If ConfigMapRef is provided, set engine.config path
-		if dgdr.Spec.ProfilingConfig.ConfigMapRef != nil {
-			engineConfig["config"] = fmt.Sprintf("%s/%s", ProfilingConfigPath, ProfilingConfigFile)
-		}
-
-		// Serialize config to YAML for passing to profiler
-		configYAML, err := sigsyaml.Marshal(config)
+		// Parse and prepare profiling config
+		configYAML, err := r.prepareProfilingConfig(dgdr)
 		if err != nil {
-			return nil, false, fmt.Errorf("failed to marshal profiling config to YAML: %w", err)
+			return nil, false, err
 		}

 		// Common environment variables
@@ -1023,20 +966,19 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
 		logger.Info("Using profiler image", "image", imageName)

 		profilerContainer := corev1.Container{
-			Name:    ContainerNameProfiler,
-			Image:   imageName,
-			Command: []string{"python", "-m", "benchmarks.profiler.profile_sla"},
-			Args:    profilerArgs,
-			Resources: corev1.ResourceRequirements{
-				Requests: corev1.ResourceList{
-					corev1.ResourceCPU:    resource.MustParse("16"),
-					corev1.ResourceMemory: resource.MustParse("10Gi"),
-				},
-			},
+			Name:         ContainerNameProfiler,
+			Image:        imageName,
+			Command:      []string{"python", "-m", "benchmarks.profiler.profile_sla"},
+			Args:         profilerArgs,
 			Env:          profilerEnv,
 			VolumeMounts: volumeMounts,
 		}

+		// Apply resource requirements if specified in the DGDR
+		if dgdr.Spec.ProfilingConfig.Resources != nil {
+			profilerContainer.Resources = *dgdr.Spec.ProfilingConfig.Resources
+		}
+
 		// Generate sidecar script from template
 		tmpl, err := template.New("sidecar").Parse(sidecarScriptTemplate)
 		if err != nil {
@@ -1067,14 +1009,27 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
 			}},
 		}

-		// Build volumes - use emptyDir for profiling output
-		// The sidecar saves all needed data to ConfigMaps, so persistence is not needed
-		volumes := []corev1.Volume{{
-			Name: VolumeNameProfilingOutput,
-			VolumeSource: corev1.VolumeSource{
-				EmptyDir: &corev1.EmptyDirVolumeSource{},
-			},
-		}}
+		// Use PVC if specified, otherwise use emptyDir for profiling output
+		var profilingOutputVolume corev1.Volume
+		if dgdr.Spec.ProfilingConfig.OutputPVC != "" {
+			logger.Info("Using PVC for profiling output", "pvc", dgdr.Spec.ProfilingConfig.OutputPVC)
+			profilingOutputVolume = corev1.Volume{
+				Name: VolumeNameProfilingOutput,
+				VolumeSource: corev1.VolumeSource{
+					PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{
+						ClaimName: dgdr.Spec.ProfilingConfig.OutputPVC,
+					},
+				},
+			}
+		} else {
+			profilingOutputVolume = corev1.Volume{
+				Name: VolumeNameProfilingOutput,
+				VolumeSource: corev1.VolumeSource{
+					EmptyDir: &corev1.EmptyDirVolumeSource{},
+				},
+			}
+		}
+		volumes := []corev1.Volume{profilingOutputVolume}

 		// Add ConfigMap volume if provided
 		if dgdr.Spec.ProfilingConfig.ConfigMapRef != nil {
@@ -1108,6 +1063,27 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
 			labelValue = LabelValueAICProfiler
 		}

+		podSpec := corev1.PodSpec{
+			ServiceAccountName: ServiceAccountProfilingJob,
+			RestartPolicy:      corev1.RestartPolicyNever,
+			SecurityContext: &corev1.PodSecurityContext{
+				RunAsNonRoot: ptr.To(true),        // Enforces that container cannot run as root
+				RunAsUser:    ptr.To[int64](1000), // Run as UID 1000 (non-privileged user)
+				RunAsGroup:   ptr.To[int64](1000), // Run with GID 1000 (non-privileged group)
+				FSGroup:      ptr.To[int64](1000), // Volume files owned by GID 1000
+			},
+			Containers: []corev1.Container{profilerContainer, sidecarContainer},
+			Volumes:    volumes,
+			ImagePullSecrets: []corev1.LocalObjectReference{
+				{Name: "nvcr-imagepullsecret"},
+			},
+		}
+
+		// Apply tolerations if specified in the DGDR
+		if len(dgdr.Spec.ProfilingConfig.Tolerations) > 0 {
+			podSpec.Tolerations = dgdr.Spec.ProfilingConfig.Tolerations
+		}
+
 		job := &batchv1.Job{
 			ObjectMeta: metav1.ObjectMeta{
 				Name:      jobName,
@@ -1121,20 +1097,7 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
 			Spec: batchv1.JobSpec{
 				BackoffLimit: &backoffLimit,
 				Template: corev1.PodTemplateSpec{
-					Spec: corev1.PodSpec{
-						ServiceAccountName: ServiceAccountProfilingJob,
-						RestartPolicy:      corev1.RestartPolicyNever, SecurityContext: &corev1.PodSecurityContext{
-							RunAsNonRoot: ptr.To(true),        // Enforces that container cannot run as root
-							RunAsUser:    ptr.To[int64](1000), // Run as UID 1000 (non-privileged user)
-							RunAsGroup:   ptr.To[int64](1000), // Run with GID 1000 (non-privileged group)
-							FSGroup:      ptr.To[int64](1000), // Volume files owned by GID 1000
-						},
-						Containers: []corev1.Container{profilerContainer, sidecarContainer},
-						Volumes:    volumes,
-						ImagePullSecrets: []corev1.LocalObjectReference{
-							{Name: "nvcr-imagepullsecret"},
-						},
-					},
+					Spec: podSpec,
 				},
 			},
 		}
@@ -1153,6 +1116,73 @@ func (r *DynamoGraphDeploymentRequestReconciler) createProfilingJob(ctx context.
 	return nil
 }

+// prepareProfilingConfig parses and modifies the profiling config
+func (r *DynamoGraphDeploymentRequestReconciler) prepareProfilingConfig(dgdr *nvidiacomv1alpha1.DynamoGraphDeploymentRequest) ([]byte, error) {
+	// Parse the profiling config from JSON
+	var config map[string]interface{}
+	if err := yaml.Unmarshal(dgdr.Spec.ProfilingConfig.Config.Raw, &config); err != nil {
+		return nil, fmt.Errorf("failed to parse profiling config: %w", err)
+	}
+
+	// Set deployment.namespace if not already set
+	deploymentVal, hasDeployment := config["deployment"]
+	var deploymentConfig map[string]interface{}
+	if !hasDeployment || deploymentVal == nil {
+		deploymentConfig = make(map[string]interface{})
+		config["deployment"] = deploymentConfig
+	} else {
+		var ok bool
+		deploymentConfig, ok = deploymentVal.(map[string]interface{})
+		if !ok {
+			return nil, fmt.Errorf("profilingConfig.config.deployment must be an object, got %T", deploymentVal)
+		}
+	}
+	if _, hasNamespace := deploymentConfig["namespace"]; !hasNamespace {
+		deploymentConfig["namespace"] = dgdr.Namespace
+	}
+
+	// Set deployment.model from spec.model
+	deploymentConfig["model"] = dgdr.Spec.Model
+
+	// Set deployment.dgd_image from deploymentOverrides.workersImage if provided
+	if dgdr.Spec.DeploymentOverrides != nil && dgdr.Spec.DeploymentOverrides.WorkersImage != "" {
+		deploymentConfig["dgd_image"] = dgdr.Spec.DeploymentOverrides.WorkersImage
+	}
+
+	// Set output_dir if not already set
+	if _, hasOutputDir := config["output_dir"]; !hasOutputDir {
+		config["output_dir"] = ProfilingOutputPath
+	}
+
+	// Set engine.backend from spec.backend
+	engineVal, hasEngine := config["engine"]
+	var engineConfig map[string]interface{}
+	if !hasEngine || engineVal == nil {
+		engineConfig = make(map[string]interface{})
+		config["engine"] = engineConfig
+	} else {
+		var ok bool
+		engineConfig, ok = engineVal.(map[string]interface{})
+		if !ok {
+			return nil, fmt.Errorf("profilingConfig.config.engine must be an object, got %T", engineVal)
+		}
+	}
+	engineConfig["backend"] = dgdr.Spec.Backend
+
+	// If ConfigMapRef is provided, set engine.config path
+	if dgdr.Spec.ProfilingConfig.ConfigMapRef != nil {
+		engineConfig["config"] = fmt.Sprintf("%s/%s", ProfilingConfigPath, ProfilingConfigFile)
+	}
+
+	// Serialize config to YAML for passing to profiler
+	configYAML, err := sigsyaml.Marshal(config)
+	if err != nil {
+		return nil, fmt.Errorf("failed to marshal profiling config to YAML: %w", err)
+	}
+
+	return configYAML, nil
+}
+
 // checkProfilingJobStatus checks if the profiling job has completed
 func (r *DynamoGraphDeploymentRequestReconciler) checkProfilingJobStatus(ctx context.Context, dgdr *nvidiacomv1alpha1.DynamoGraphDeploymentRequest) (bool, error) {
 	logger := log.FromContext(ctx)

--- a/deploy/utils/README.md
+++ b/deploy/utils/README.md
@@ -29,16 +29,18 @@ This includes:

 After setting up Dynamo Cloud, use this script to prepare your namespace with the additional resources needed for benchmarking and profiling workflows:

-The setup script creates a `dynamo-pvc` with `ReadWriteMany` (RWX). If your cluster's default `storageClassName` does not support RWX, set `storageClassName` in `deploy/utils/manifests/pvc.yaml` to an RWX-capable class before running the script.
+The setup script creates a `dynamo-pvc` with `ReadWriteOnce` (RWO) access mode using your cluster's default storage class. This is sufficient for profiling workflows where only one job writes at a time.
+
+If you want to use `ReadWriteMany` (RWX) for concurrent access, modify `deploy/utils/manifests/pvc.yaml` before running the script:

-Example (add under `spec` in `deploy/utils/manifests/pvc.yaml`):
 ```yaml
-...
 spec:
  accessModes:
  - ReadWriteMany
-  storageClassName: <your-rwx-storageclass>
-...
+  storageClassName: <your-rwx-capable-storageclass>  # e.g., NFS-based storage
+  resources:
+    requests:
+      storage: 50Gi
 ```

 > [!TIP]

--- a/deploy/utils/manifests/pvc.yaml
+++ b/deploy/utils/manifests/pvc.yaml
@@ -7,7 +7,10 @@ metadata:
  namespace: ${NAMESPACE}
 spec:
  accessModes:
-    - ReadWriteMany
+    - ReadWriteOnce
+  # Uncomment and set storageClassName if you need to use a specific storage class
+  # For ReadWriteMany support, use an RWX-capable storage class (e.g., NFS-based)
+  # storageClassName: your-storageclass
  resources:
    requests:
      storage: 50Gi
--- a/docs/kubernetes/api_reference.md
+++ b/docs/kubernetes/api_reference.md
@@ -593,6 +593,9 @@ _Appears in:_
 | `config` _[JSON](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#json-v1-apiextensions-k8s-io)_ | Config is the profiling configuration as arbitrary JSON/YAML. This will be passed directly to the profiler.<br />The profiler will validate the configuration and report any errors. |  | Optional: \{\} <br />Type: object <br /> |
 | `configMapRef` _[ConfigMapKeySelector](#configmapkeyselector)_ | ConfigMapRef is an optional reference to a ConfigMap containing the DynamoGraphDeployment<br />base config file (disagg.yaml). This is separate from the profiling config above.<br />The path to this config will be set as engine.config in the profiling config. |  | Optional: \{\} <br /> |
 | `profilerImage` _string_ | ProfilerImage specifies the container image to use for profiling jobs.<br />This image contains the profiler code and dependencies needed for SLA-based profiling.<br />Example: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" |  | Required: \{\} <br /> |
+| `outputPVC` _string_ | OutputPVC is an optional PersistentVolumeClaim name for storing profiling output.<br />If specified, all profiling artifacts (logs, plots, configs, raw data) will be written<br />to this PVC instead of an ephemeral emptyDir volume. This allows users to access<br />complete profiling results after the job completes by mounting the PVC.<br />The PVC must exist in the same namespace as the DGDR.<br />If not specified, profiling uses emptyDir and only essential data is saved to ConfigMaps.<br />Note: ConfigMaps are still created regardless of this setting for planner integration. |  | Optional: \{\} <br /> |
+| `resources` _[ResourceRequirements](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#resourcerequirements-v1-core)_ | Resources specifies the compute resource requirements for the profiling job container.<br />If not specified, no resource requests or limits are set. |  | Optional: \{\} <br /> |
+| `tolerations` _[Toleration](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#toleration-v1-core) array_ | Tolerations allows the profiling job to be scheduled on nodes with matching taints.<br />For example, to schedule on GPU nodes, add a toleration for the nvidia.com/gpu taint. |  | Optional: \{\} <br /> |


 #### ResourceItem

--- a/docs/planner/sla_planner_quickstart.md
+++ b/docs/planner/sla_planner_quickstart.md
@@ -414,6 +414,46 @@ metadata:
    dgdr.nvidia.com/namespace: your-namespace
 ```

+### Accessing Detailed Profiling Artifacts
+
+By default, profiling jobs save essential data to ConfigMaps for planner integration. For advanced users who need access to detailed artifacts (logs, performance plots, AIPerf results, etc), configure the DGDR to use `dynamo-pvc`. This is optional and will not affect the functionality of profiler or Planner.
+
+**What's available in ConfigMaps (always created):**
+- Generated DGD configuration
+- Profiling data for Planner (`.json` files)
+
+**What's available in PVC if attached to DGDR (optional):**
+- Performance plots (PNGs)
+- DGD configuration and logs of all services for each profiled deployment
+- AIPerf profiling artifacts for each AIPerf run
+- Raw profiling data (`.npz` files)
+- Profiler log
+
+**Setup:**
+
+1. Set up the benchmarking PVC:
+```bash
+export NAMESPACE=your-namespace
+deploy/utils/setup_benchmarking_resources.sh
+```
+
+2. Add `outputPVC` to your DGDR's `profilingConfig`:
+```yaml
+spec:
+  profilingConfig:
+    outputPVC: "dynamo-pvc"
+    config:
+      # ... rest of config
+```
+
+3. After profiling completes, access results:
+```bash
+kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
+kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
+kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
+kubectl delete pod pvc-access-pod -n $NAMESPACE
+```
+
 ## Troubleshooting

 ### Quick Diagnostics