docs(operator): clarify Interconnect and RDMA field semantics in HardwareSpec (#8300)

Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>

docs(operator): clarify Interconnect and RDMA field semantics in HardwareSpec (#8300)
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
19ecf46f · Dr. Stefan Schimanski · GitHub · 5103efdb · 19ecf46f · 19ecf46f
Unverified Commit 19ecf46f authored Apr 21, 2026 by Dr. Stefan Schimanski Committed by GitHub Apr 21, 2026
5 changed files
--- a/components/src/dynamo/profiler/utils/dgdr_v1beta1_types.py
+++ b/components/src/dynamo/profiler/utils/dgdr_v1beta1_types.py
@@ -221,11 +221,11 @@ class HardwareSpec(BaseModel):
    )
    interconnect: Optional[str] = Field(
        default=None,
-        description='Interconnect describes the GPU interconnect type within a node. Examples: "pcie", "nvlink", "infiniband".',
+        description='Interconnect describes the primary GPU-to-GPU interconnect *within a node*.  Semantics / usage: - This is capability metadata used for profiling, planning, and deployment decisions. - It does NOT configure or enable any GPU interconnect; it only describes what is available/assumed. - When omitted, the operator may attempt best-effort discovery (currently distinguishes "nvlink" vs "pcie" based on DCGM NVLink link count). If discovery is unavailable, it may remain empty.  Impact of wrong / missing values: - If set more optimistically than reality (e.g., "nvlink" when only PCIe is present), performance models may overestimate intra-node bandwidth and choose overly aggressive parallelism or layouts, resulting in degraded performance compared to expectations. - If set more pessimistically than reality (e.g., "pcie" when NVLink is present), the system may choose conservative plans and leave performance on the table. - If unset and undiscovered, consumers should treat the interconnect as unknown and fall back to conservative assumptions.  Example values: "pcie", "nvlink". Other values may be accepted but may not be auto-detected. ',
    )
    rdma: Optional[bool] = Field(
        default=None,
-        description="RDMA indicates whether RDMA is available on the cluster.",
+        description="RDMA indicates whether the cluster has RDMA-capable networking available for Dynamo data movement.  Semantics / usage: - This is capability metadata used for profiling, planning, and deployment decisions. - It does NOT install, enable, or configure RDMA (e.g., drivers, SR-IOV, NVIDIA network operator, GPUDirect settings). It only expresses availability/intent. - When omitted, the operator may attempt best-effort discovery (e.g., via node labels indicating RDMA/SR-IOV capability and/or presence of NVIDIA network-operator RDMA components). If discovery is unavailable, it may remain unset.  Impact of wrong / missing values: - False positive (set true when RDMA is not actually usable end-to-end) may cause plans or deployments to assume RDMA is available; depending on the runtime transport selection and fallback behavior, this can lead to connection/setup failures or performance regressions. - False negative (set false when RDMA is available) will typically avoid RDMA-optimized paths and fall back to non-RDMA transports, usually remaining functional but potentially slower. - If unset and undiscovered, consumers should treat RDMA availability as unknown and use conservative defaults / fallback transports. ",
    )

--- a/deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamographdeploymentrequests.yaml
+++ b/deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamographdeploymentrequests.yaml
@@ -615,15 +615,49 @@ spec:
                      type: string
                    interconnect:
                      description: |-
-                        Interconnect describes the GPU interconnect type within a node.
+                        Interconnect describes the primary GPU-to-GPU interconnect *within a node*.
-                        Examples: "pcie", "nvlink", "infiniband".
+                        Semantics / usage:
+                          - This is capability metadata used for profiling, planning, and deployment decisions.
+                          - It does NOT configure or enable any GPU interconnect; it only describes what is available/assumed.
+                          - When omitted, the operator may attempt best-effort discovery (currently distinguishes "nvlink"
+                            vs "pcie" based on DCGM NVLink link count). If discovery is unavailable, it may remain empty.
+                        Impact of wrong / missing values:
+                          - If set more optimistically than reality (e.g., "nvlink" when only PCIe is present), performance
+                            models may overestimate intra-node bandwidth and choose overly aggressive parallelism or layouts,
+                            resulting in degraded performance compared to expectations.
+                          - If set more pessimistically than reality (e.g., "pcie" when NVLink is present), the system may
+                            choose conservative plans and leave performance on the table.
+                          - If unset and undiscovered, consumers should treat the interconnect as unknown and fall back to
+                            conservative assumptions.
+                        Example values: "pcie", "nvlink". Other values may be accepted but may not be auto-detected.
                      type: string
                    numGpusPerNode:
                      description: NumGPUsPerNode is the number of GPUs per node.
                      format: int32
                      type: integer
                    rdma:
-                      description: RDMA indicates whether RDMA is available on the cluster.
+                      description: |-
+                        RDMA indicates whether the cluster has RDMA-capable networking available for Dynamo data movement.
+                        Semantics / usage:
+                          - This is capability metadata used for profiling, planning, and deployment decisions.
+                          - It does NOT install, enable, or configure RDMA (e.g., drivers, SR-IOV, NVIDIA network operator,
+                            GPUDirect settings). It only expresses availability/intent.
+                          - When omitted, the operator may attempt best-effort discovery (e.g., via node labels indicating
+                            RDMA/SR-IOV capability and/or presence of NVIDIA network-operator RDMA components). If discovery
+                            is unavailable, it may remain unset.
+                        Impact of wrong / missing values:
+                          - False positive (set true when RDMA is not actually usable end-to-end) may cause plans or
+                            deployments to assume RDMA is available; depending on the runtime transport selection and
+                            fallback behavior, this can lead to connection/setup failures or performance regressions.
+                          - False negative (set false when RDMA is available) will typically avoid RDMA-optimized paths and
+                            fall back to non-RDMA transports, usually remaining functional but potentially slower.
+                          - If unset and undiscovered, consumers should treat RDMA availability as unknown and use
+                            conservative defaults / fallback transports.
                      type: boolean
                    totalGpus:
                      description: TotalGPUs is the total number of GPUs available in the cluster.

--- a/deploy/operator/api/v1beta1/dynamographdeploymentrequest_types.go
+++ b/deploy/operator/api/v1beta1/dynamographdeploymentrequest_types.go
@@ -353,11 +353,47 @@ type HardwareSpec struct {
 	// NumGPUsPerNode is the number of GPUs per node.
 	// +optional
 	NumGPUsPerNode *int32 `json:"numGpusPerNode,omitempty"`
-	// Interconnect describes the GPU interconnect type within a node.
+	// Interconnect describes the primary GPU-to-GPU interconnect *within a node*.
-	// Examples: "pcie", "nvlink", "infiniband".
+	//
+	// Semantics / usage:
+	//   - This is capability metadata used for profiling, planning, and deployment decisions.
+	//   - It does NOT configure or enable any GPU interconnect; it only describes what is available/assumed.
+	//   - When omitted, the operator may attempt best-effort discovery (currently distinguishes "nvlink"
+	//     vs "pcie" based on DCGM NVLink link count). If discovery is unavailable, it may remain empty.
+	//
+	// Impact of wrong / missing values:
+	//   - If set more optimistically than reality (e.g., "nvlink" when only PCIe is present), performance
+	//     models may overestimate intra-node bandwidth and choose overly aggressive parallelism or layouts,
+	//     resulting in degraded performance compared to expectations.
+	//   - If set more pessimistically than reality (e.g., "pcie" when NVLink is present), the system may
+	//     choose conservative plans and leave performance on the table.
+	//   - If unset and undiscovered, consumers should treat the interconnect as unknown and fall back to
+	//     conservative assumptions.
+	//
+	// Example values: "pcie", "nvlink". Other values may be accepted but may not be auto-detected.
+	//
 	// +optional
 	Interconnect string `json:"interconnect,omitempty"`
-	// RDMA indicates whether RDMA is available on the cluster.
+	// RDMA indicates whether the cluster has RDMA-capable networking available for Dynamo data movement.
+	//
+	// Semantics / usage:
+	//   - This is capability metadata used for profiling, planning, and deployment decisions.
+	//   - It does NOT install, enable, or configure RDMA (e.g., drivers, SR-IOV, NVIDIA network operator,
+	//     GPUDirect settings). It only expresses availability/intent.
+	//   - When omitted, the operator may attempt best-effort discovery (e.g., via node labels indicating
+	//     RDMA/SR-IOV capability and/or presence of NVIDIA network-operator RDMA components). If discovery
+	//     is unavailable, it may remain unset.
+	//
+	// Impact of wrong / missing values:
+	//   - False positive (set true when RDMA is not actually usable end-to-end) may cause plans or
+	//     deployments to assume RDMA is available; depending on the runtime transport selection and
+	//     fallback behavior, this can lead to connection/setup failures or performance regressions.
+	//   - False negative (set false when RDMA is available) will typically avoid RDMA-optimized paths and
+	//     fall back to non-RDMA transports, usually remaining functional but potentially slower.
+	//   - If unset and undiscovered, consumers should treat RDMA availability as unknown and use
+	//     conservative defaults / fallback transports.
+	//
 	// +optional
 	RDMA *bool `json:"rdma,omitempty"`
 }

--- a/deploy/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml
+++ b/deploy/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml
@@ -615,15 +615,49 @@ spec:
                      type: string
                    interconnect:
                      description: |-
-                        Interconnect describes the GPU interconnect type within a node.
+                        Interconnect describes the primary GPU-to-GPU interconnect *within a node*.
-                        Examples: "pcie", "nvlink", "infiniband".
+                        Semantics / usage:
+                          - This is capability metadata used for profiling, planning, and deployment decisions.
+                          - It does NOT configure or enable any GPU interconnect; it only describes what is available/assumed.
+                          - When omitted, the operator may attempt best-effort discovery (currently distinguishes "nvlink"
+                            vs "pcie" based on DCGM NVLink link count). If discovery is unavailable, it may remain empty.
+                        Impact of wrong / missing values:
+                          - If set more optimistically than reality (e.g., "nvlink" when only PCIe is present), performance
+                            models may overestimate intra-node bandwidth and choose overly aggressive parallelism or layouts,
+                            resulting in degraded performance compared to expectations.
+                          - If set more pessimistically than reality (e.g., "pcie" when NVLink is present), the system may
+                            choose conservative plans and leave performance on the table.
+                          - If unset and undiscovered, consumers should treat the interconnect as unknown and fall back to
+                            conservative assumptions.
+                        Example values: "pcie", "nvlink". Other values may be accepted but may not be auto-detected.
                      type: string
                    numGpusPerNode:
                      description: NumGPUsPerNode is the number of GPUs per node.
                      format: int32
                      type: integer
                    rdma:
-                      description: RDMA indicates whether RDMA is available on the cluster.
+                      description: |-
+                        RDMA indicates whether the cluster has RDMA-capable networking available for Dynamo data movement.
+                        Semantics / usage:
+                          - This is capability metadata used for profiling, planning, and deployment decisions.
+                          - It does NOT install, enable, or configure RDMA (e.g., drivers, SR-IOV, NVIDIA network operator,
+                            GPUDirect settings). It only expresses availability/intent.
+                          - When omitted, the operator may attempt best-effort discovery (e.g., via node labels indicating
+                            RDMA/SR-IOV capability and/or presence of NVIDIA network-operator RDMA components). If discovery
+                            is unavailable, it may remain unset.
+                        Impact of wrong / missing values:
+                          - False positive (set true when RDMA is not actually usable end-to-end) may cause plans or
+                            deployments to assume RDMA is available; depending on the runtime transport selection and
+                            fallback behavior, this can lead to connection/setup failures or performance regressions.
+                          - False negative (set false when RDMA is available) will typically avoid RDMA-optimized paths and
+                            fall back to non-RDMA transports, usually remaining functional but potentially slower.
+                          - If unset and undiscovered, consumers should treat RDMA availability as unknown and use
+                            conservative defaults / fallback transports.
                      type: boolean
                    totalGpus:
                      description: TotalGPUs is the total number of GPUs available in the cluster.

--- a/docs/kubernetes/api-reference.md
+++ b/docs/kubernetes/api-reference.md
@@ -1584,8 +1584,8 @@ _Appears in:_
 | `vramMb` _float_ | VRAMMB is the VRAM per GPU in MiB. |  | Optional: \{\} <br /> |
 | `totalGpus` _integer_ | TotalGPUs is the total number of GPUs available in the cluster. |  | Optional: \{\} <br /> |
 | `numGpusPerNode` _integer_ | NumGPUsPerNode is the number of GPUs per node. |  | Optional: \{\} <br /> |
-| `interconnect` _string_ | Interconnect describes the GPU interconnect type within a node.<br />Examples: "pcie", "nvlink", "infiniband". |  | Optional: \{\} <br /> |
+| `interconnect` _string_ | Interconnect describes the primary GPU-to-GPU interconnect *within a node*.<br />Semantics / usage:<br />  - This is capability metadata used for profiling, planning, and deployment decisions.<br />  - It does NOT configure or enable any GPU interconnect; it only describes what is available/assumed.<br />  - When omitted, the operator may attempt best-effort discovery (currently distinguishes "nvlink"<br />    vs "pcie" based on DCGM NVLink link count). If discovery is unavailable, it may remain empty.<br />Impact of wrong / missing values:<br />  - If set more optimistically than reality (e.g., "nvlink" when only PCIe is present), performance<br />    models may overestimate intra-node bandwidth and choose overly aggressive parallelism or layouts,<br />    resulting in degraded performance compared to expectations.<br />  - If set more pessimistically than reality (e.g., "pcie" when NVLink is present), the system may<br />    choose conservative plans and leave performance on the table.<br />  - If unset and undiscovered, consumers should treat the interconnect as unknown and fall back to<br />    conservative assumptions.<br />Example values: "pcie", "nvlink". Other values may be accepted but may not be auto-detected. |  | Optional: \{\} <br /> |
-| `rdma` _boolean_ | RDMA indicates whether RDMA is available on the cluster. |  | Optional: \{\} <br /> |
+| `rdma` _boolean_ | RDMA indicates whether the cluster has RDMA-capable networking available for Dynamo data movement.<br />Semantics / usage:<br />  - This is capability metadata used for profiling, planning, and deployment decisions.<br />  - It does NOT install, enable, or configure RDMA (e.g., drivers, SR-IOV, NVIDIA network operator,<br />    GPUDirect settings). It only expresses availability/intent.<br />  - When omitted, the operator may attempt best-effort discovery (e.g., via node labels indicating<br />    RDMA/SR-IOV capability and/or presence of NVIDIA network-operator RDMA components). If discovery<br />    is unavailable, it may remain unset.<br />Impact of wrong / missing values:<br />  - False positive (set true when RDMA is not actually usable end-to-end) may cause plans or<br />    deployments to assume RDMA is available; depending on the runtime transport selection and<br />    fallback behavior, this can lead to connection/setup failures or performance regressions.<br />  - False negative (set false when RDMA is available) will typically avoid RDMA-optimized paths and<br />    fall back to non-RDMA transports, usually remaining functional but potentially slower.<br />  - If unset and undiscovered, consumers should treat RDMA availability as unknown and use<br />    conservative defaults / fallback transports. |  | Optional: \{\} <br /> |