feat: add pre-deployment check for storageclass (#3573)

9bb2bc93 · Biswa Panda · GitHub · de3ca70b · de3ca70b · 9bb2bc93
Unverified Commit 9bb2bc93 authored Oct 13, 2025 by Biswa Panda Committed by GitHub Oct 14, 2025
7 changed files
--- a/benchmarks/nixl/README.md
+++ b/benchmarks/nixl/README.md
-# NIXL Benchmark Technical Documentation (Kubernetes)
-This guide describes how to run the NIXL benchmark using the provided Docker image on a Kubernetes (K8s) cluster.
---
-## Prerequisites
- A running Kubernetes cluster with access to NVIDIA GPUs (e.g., using NVIDIA GPU Operator or device plugin)
- `kubectl` configured to access your cluster
- deploy dynamo cloud in a namespace
---
-## 1. Prepare the Kubernetes Deployment
-A sample deployment YAML is provided in this repository:
-`benchmarks/nixl/nixl-benchmark-deployment.yaml`
-Update the image field in sample yaml to appropiate image in your registry.
-You can use the `yq` tool to update the image field in the deployment YAML
-```bash
-yq -i '.spec.template.spec.containers[] |= select(.name == "nixl-benchmark") .image = "your-registry/your-nixl-benchmark:your-tag"' benchmarks/nixl/nixl-benchmark-deployment.yaml > nixl-benchmark-deployment.yaml
-```
-## 2. Deploy using kubectl
-Launch using the command below:
-```bash
-kubectl apply -f  nixl-benchmark-deployment.yaml
-```
\ No newline at end of file
--- a/deploy/cloud/pre-deployment/README.md
+++ b/deploy/cloud/pre-deployment/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Pre-Deployment Check Script
+This directory contains a pre-deployment check script that verifies your Kubernetes cluster meets the requirements for deploying Dynamo.
+- For NCCL tests, please refer to the [NCCL tests](https://docs.nebius.com/kubernetes/gpu/nccl-test#run-tests) for more details.
+- For NIXL benchmark, please refer to the [NIXL benchmark pre-deployment checks](/deploy/cloud/pre-deployment/nixl/README.md) for more details.
+## Usage
+Run the pre-deployment check before deploying Dynamo:
+```bash
+./pre-deployment-check.sh
+```
+## What it checks
+The script performs few checks and provides a detailed summary:
+### 1. kubectl Connectivity
+- Verifies that `kubectl` is installed and kubectl can connect to your Kubernetes cluster
+### 2. Default StorageClass
+- Verifies that a default StorageClass is configured in your cluster
+- If no default StorageClass is found:
+  - Lists all available StorageClasses in the cluster with full details
+  - Provides a sample command to set a StorageClass as default
+  - References the official Kubernetes documentation for detailed guidance
+### 3. Cluster GPU Resources
+- Checks for GPU-enabled nodes in the cluster using label `nvidia.com/gpu.present=true`
+## Sample Output
+### Complete Script Output Example:
+```
+========================================
+  Dynamo Pre-Deployment Check Script
+========================================
+--- Checking kubectl connectivity ---
+✅ kubectl is available and cluster is accessible
+--- Checking for default StorageClass ---
+❌ No default StorageClass found
+Dynamo requires a default StorageClass for persistent volume provisioning.
+Please configure a default StorageClass before proceeding with deployment.
+Available StorageClasses in your cluster:
+NAME                                 PROVISIONER                     RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
+my-default-storage-class (default)   compute.csi.mock                Delete          WaitForFirstConsumer   true                   65d
+fast-ssd-storage                     kubernetes.io/gce-pd            Delete          Immediate              true                   30d
+To set a StorageClass as default, use the following command:
+kubectl patch storageclass <storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
+Example with your first available StorageClass:
+kubectl patch storageclass my-default-storage-class -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
+For more information on managing default StorageClasses, visit:
+https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/
+--- Checking cluster gpu resources ---
+✅ Found 17 gpu node(s) in the cluster
+Node information:
+--- Pre-Deployment Check Summary ---
+✅ kubectl Connectivity: PASSED
+❌ Default StorageClass: FAILED
+✅ Cluster Resources: PASSED
+Summary: 2 passed, 1 failed
+❌ 1 pre-deployment check(s) failed.
+Please address the issues above before proceeding with deployment.
+```
+### When all checks pass:
+```
+========================================
+  Dynamo Pre-Deployment Check Script
+========================================
+--- Checking kubectl connectivity ---
+✅ kubectl is available and cluster is accessible
+--- Checking for default StorageClass ---
+✅ Default StorageClass found
+  - NAME                               PROVISIONER      RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
+my-default-storage-class (default)   compute.csi.mock   Delete          WaitForFirstConsumer   true                   65d
+--- Checking cluster gpu resources ---
+✅ Found 17 gpu node(s) in the cluster
+Node information:
+--- Pre-Deployment Check Summary ---
+✅ kubectl Connectivity: PASSED
+✅ Default StorageClass: PASSED
+✅ Cluster Resources: PASSED
+Summary: 3 passed, 0 failed
+🎉 All pre-deployment checks passed!
+Your cluster is ready for Dynamo deployment.
+```
+## Check Status Summary
+The script provides a comprehensive summary showing the status of each check:
+| Check Name | Description | Pass/Fail Status |
+|------------|-------------|------------------|
+| **kubectl Connectivity** | Verifies kubectl installation and cluster access | ✅ PASSED / ❌ FAILED |
+| **Default StorageClass** | Checks for default StorageClass annotation | ✅ PASSED / ❌ FAILED |
+| **Cluster Resources** | Validates GPU nodes availability | ✅ PASSED / ❌ FAILED |
+## Setting a Default StorageClass
+If you need to set a default StorageClass, use the following command:
+```bash
+kubectl patch storageclass <storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
+```
+Replace `<storage-class-name>` with the name of your desired StorageClass.
+## Troubleshooting
+### Multiple Default StorageClasses
+If you have multiple StorageClasses marked as default, the script will warn you:
+```
+⚠️  Warning: Multiple default StorageClasses detected
+   This may cause unpredictable behavior. Consider having only one default StorageClass.
+```
+To remove the default annotation from a StorageClass:
+```bash
+kubectl patch storageclass <storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
+```
+### No GPU Nodes Found
+If no GPU nodes are found, ensure your cluster has nodes with the `nvidia.com/gpu.present=true` label.
+### No StorageClasses Available
+If no StorageClasses are available in your cluster, you'll need to:
+1. Install a storage provisioner (e.g., for cloud providers, local storage, etc.)
+2. Create appropriate StorageClass resources
+3. Mark one as default
+## Reference
+For more information on managing default StorageClasses, visit:
+[Kubernetes Documentation - Change the default StorageClass](https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/)
\ No newline at end of file
--- a/deploy/cloud/pre-deployment/nixl/README.md
+++ b/deploy/cloud/pre-deployment/nixl/README.md
+# NIXL Benchmark Documentation
+This guide describes how to build and deploy the NIXL benchmark using the provided scripts on a Kubernetes (K8s) cluster.
+> **Note**: NIXL benchmark is part of the Dynamo platform. Before proceeding, ensure your cluster meets the basic Dynamo requirements by running the pre-deployment check script located in the parent directory (`../pre-deployment-check.sh`).
+---
+## Prerequisites
+### Cluster Requirements
+Before deploying NIXL benchmark, ensure your cluster meets the Dynamo platform requirements by running the pre-deployment check:
+```bash
+# Run from the parent directory
+../pre-deployment-check.sh
+```
+This script verifies:
+- `kubectl` connectivity and cluster access
+- GPU nodes availability (`nvidia.com/gpu.present=true` label)
+- GPU Operator installation and status
+### NIXL-Specific Requirements
+In addition to the cluster requirements above, NIXL benchmark requires:
+- **Docker** installed and configured on your local machine (for building images)
+- **Docker registry access** to push the built nixlbench images
+- **ETCD service** deployed and accessible as `etcd:2379`
+- **Build utilities**: `wget` and `unzip` for downloading NIXL source code
+### Verification Steps
+1. **Run pre-deployment check** (recommended):
+   ```bash
+   ../pre-deployment-check.sh
+   ```
+   Ensure all checks pass before proceeding.
+2. **Verify ETCD availability** (NIXL-specific):
+   ```bash
+   kubectl get svc etcd
+   ```
+3. **Confirm Docker registry access**:
+   ```bash
+   docker login your-registry.com  # if using private registry
+   ```
+---
+## Quick Start
+For the easiest experience, use the interactive build and deploy script:
+```bash
+./build_and_deploy.sh
+```
+This script provides a flexible workflow where you can:
+1. **Select architecture**: Choose between x86_64 (Intel/AMD 64-bit) or aarch64 (ARM64)
+2. **Choose which steps to execute**: Select any combination of:
+   - Build nixlbench Docker image
+   - Update deployment YAML file
+   - Deploy to Kubernetes
+3. **Provide Docker registry** (only when needed for building or updating deployment)
+---
+## Interactive Script Features
+### Architecture Selection
+The script supports two architectures:
+- **Option 1**: x86_64 (Intel/AMD 64-bit)
+- **Option 2**: aarch64 (ARM64)
+You can select by entering:
+- `1` or `x86_64` for x86_64 architecture
+- `2` or `aarch64` for aarch64 architecture
+### Step Selection
+Choose which steps to execute by entering comma-separated numbers:
+- **All steps**: `1,2,3`
+- **Build and update only**: `1,2` (skips Kubernetes deployment)
+- **Deploy only**: `3` (useful if image is already built and deployment file exists)
+- **Build only**: `1` (useful for just creating the Docker image)
+- **Update deployment only**: `2` (useful for updating deployment file with new registry/version)
+### Smart Registry Prompting
+The script only prompts for Docker registry information when needed:
+- **Steps 1 or 2**: Registry required for building image or updating deployment
+- **Step 3 only**: No registry prompt (uses existing deployment file)
+---
+## What Each Step Does
+### Step 1: Build nixlbench Docker Image
+- Downloads NIXL source code (version 0.6.0) from GitHub
+- Extracts and navigates to the build directory
+- Pauses for user confirmation before building
+- Builds Docker image with specified registry and architecture
+- Tags image as: `{registry}/nixlbench:0.6.0-{arch}`
+### Step 2: Update Deployment YAML File
+- Copies base deployment template (`nixlbench-deployment.yaml`)
+- Creates architecture-specific deployment file (`nixlbench-deployment-{arch}.yaml`)
+- Updates image reference with your registry and architecture
+- Preserves all other deployment configurations
+### Step 3: Deploy to Kubernetes
+- Validates deployment file exists
+- Applies deployment to Kubernetes cluster
+- Provides monitoring commands for checking status
+---
+## Deployment Configuration
+The deployment creates:
+- **2 replicas** of the nixlbench pod
+- **Resource requests/limits**:
+  - CPU: 10 cores
+  - Memory: 5Gi
+  - GPU: 1 NVIDIA GPU per pod
+- **Environment variables**:
+  - `ETCD_ENDPOINTS`: Points to `etcd:2379`
+- **Command**: Runs nixlbench with VRAM segments and keeps container alive
+---
+## Usage Examples
+### Example 1: Complete Workflow
+```bash
+./build_and_deploy.sh
+# Select: 1 (x86_64)
+# Steps: 1,2,3
+# Registry: docker.io/myusername
+# Confirm: y
+```
+### Example 2: Build Image Only
+```bash
+./build_and_deploy.sh
+# Select: 2 (aarch64)
+# Steps: 1
+# Registry: my-private-registry.com
+# Confirm: y
+```
+### Example 3: Deploy Existing Image
+```bash
+./build_and_deploy.sh
+# Select: 1 (x86_64)
+# Steps: 3
+# Confirm: y
+```
+### Example 4: Update Deployment File Only
+```bash
+./build_and_deploy.sh
+# Select: 2 (aarch64)
+# Steps: 2
+# Registry: new-registry.com
+# Confirm: y
+```
+---
+## Generated Files
+The script creates architecture-specific deployment files:
+- `nixlbench-deployment-x86_64.yaml` - For x86_64 builds
+- `nixlbench-deployment-aarch64.yaml` - For aarch64 builds
+These files are customized versions of the base template with your specific:
+- Docker registry
+- Image tag
+- Architecture
+---
+## Monitoring Your Deployment
+After deployment, monitor your NIXL benchmark:
+```bash
+# Check pod status
+kubectl get pods -l app=nixl-benchmark
+# View logs
+kubectl logs -l app=nixl-benchmark -f
+# Check resource usage
+kubectl top pods -l app=nixl-benchmark
+# Get detailed pod information
+kubectl describe pods -l app=nixl-benchmark
+```
+If deployed to a specific namespace:
+```bash
+kubectl get pods -l app=nixl-benchmark -n your-namespace
+kubectl logs -l app=nixl-benchmark -f -n your-namespace
+```
+---
+## Troubleshooting
+### Cluster-Level Issues
+For cluster-related problems, first run the pre-deployment check to identify issues:
+```bash
+../pre-deployment-check.sh
+```
+This will help diagnose:
+- kubectl connectivity problems
+- Missing default StorageClass
+- GPU node availability issues
+- GPU Operator status problems
+### NIXL-Specific Issues
+1. **ETCD Connection**:
+   - Ensure etcd service is running: `kubectl get svc dynamo-platform-etcd`
+   - Verify etcd endpoints are accessible from pods
+   - Check if etcd is in the correct namespace
+2. **Image Pull Issues**:
+   - Verify registry credentials are configured
+   - Check image exists: `docker pull {registry}/nixlbench:0.6.0-{arch}`
+   - Ensure image was pushed successfully after build
+3. **Build Failures**:
+   - Ensure Docker daemon is running
+   - Check available disk space in `/tmp`
+   - Verify network connectivity to GitHub
+   - Confirm build utilities are installed: `which wget unzip`
+4. **Deployment File Not Found**:
+   - Run step 2 to create deployment file before step 3
+   - Check file permissions in script directory
+   - Verify script directory path is correct
+### Debug Commands
+```bash
+# Check script-generated files
+ls -la nixlbench-deployment-*.yaml
+# Verify deployment status
+kubectl get deployment nixl-benchmark -o yaml
+# Check events for issues
+kubectl get events --sort-by=.metadata.creationTimestamp
+```
+### Cleanup
+To remove the deployment:
+```bash
+kubectl delete deployment nixl-benchmark
+```
+Or if deployed to a specific namespace:
+```bash
+kubectl delete deployment nixl-benchmark -n your-namespace
+```
+To clean up generated files:
+```bash
+rm -f nixlbench-deployment-*.yaml
+```
+---
+## Script Reference
+### build_and_deploy.sh
+Interactive script that provides flexible build and deployment workflow:
+- **Architecture selection**: x86_64 or aarch64
+- **Step selection**: Choose any combination of build, update, deploy
+- **Validation**: Checks for deployment files before deploying
+### nixlbench-deployment.yaml
+Base Kubernetes deployment template that gets customized by the script:
+- **Template image**: `my-registry/nixlbench:version-arch`
+- **Resource allocation**: 10 CPU, 5Gi memory, 1 GPU per pod
+- **ETCD integration**: Pre-configured environment variables
+- **Benchmark command**: Runs with VRAM segment configuration
\ No newline at end of file
--- a/deploy/cloud/pre-deployment/nixl/build_and_deploy.sh
+++ b/deploy/cloud/pre-deployment/nixl/build_and_deploy.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+set -euo pipefail
+NIXL_VERSION="0.6.0"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# Function to check if a command exists
+command_exists() {
+    command -v "$1" >/dev/null 2>&1
+}
+# Function to check Docker daemon status
+check_docker_daemon() {
+    if ! docker info >/dev/null 2>&1; then
+        return 1
+    fi
+    return 0
+}
+# Function to check all required dependencies
+check_dependencies() {
+    echo "Checking required dependencies..."
+    local missing_deps=()
+    local warnings=()
+    # Check wget
+    if ! command_exists wget; then
+        missing_deps+=("wget")
+    else
+        echo "✅ wget is available"
+    fi
+    # Check unzip
+    if ! command_exists unzip; then
+        missing_deps+=("unzip")
+    else
+        echo "✅ unzip is available"
+    fi
+    # Check kubectl
+    if ! command_exists kubectl; then
+        missing_deps+=("kubectl")
+    else
+        echo "✅ kubectl is available"
+        # Test kubectl connectivity
+        if ! kubectl cluster-info >/dev/null 2>&1; then
+            warnings+=("kubectl is installed but cannot connect to cluster")
+        else
+            echo "✅ kubectl can connect to cluster"
+        fi
+    fi
+    # Check Docker
+    if ! command_exists docker; then
+        missing_deps+=("docker")
+    else
+        echo "✅ docker is available"
+        # Check Docker daemon
+        if ! check_docker_daemon; then
+            warnings+=("Docker is installed but daemon is not running or accessible")
+        else
+            echo "✅ Docker daemon is running"
+            # Additional Docker toolchain checks
+            if ! docker ps >/dev/null 2>&1; then
+                warnings+=("Docker requires sudo or user is not in docker group - consider adding user to docker group")
+            fi
+            if ! docker buildx version >/dev/null 2>&1; then
+                warnings+=("Docker buildx not available (may affect multi-architecture builds)")
+            fi
+        fi
+    fi
+    # Report missing dependencies
+    if [ ${#missing_deps[@]} -gt 0 ]; then
+        echo
+        echo "❌ Missing required dependencies:"
+        for dep in "${missing_deps[@]}"; do
+            echo "  - $dep"
+        done
+        echo
+        echo "Please install the missing dependencies and try again."
+        echo
+        echo "Installation suggestions:"
+        for dep in "${missing_deps[@]}"; do
+            case "$dep" in
+                wget)
+                    echo "  wget: sudo apt-get install wget (Ubuntu/Debian) or yum install wget (RHEL/CentOS)"
+                    ;;
+                unzip)
+                    echo "  unzip: sudo apt-get install unzip (Ubuntu/Debian) or yum install unzip (RHEL/CentOS)"
+                    ;;
+                kubectl)
+                    echo "  kubectl: https://kubernetes.io/docs/tasks/tools/install-kubectl/"
+                    ;;
+                docker)
+                    echo "  docker: https://docs.docker.com/get-docker/"
+                    ;;
+            esac
+        done
+        return 1
+    fi
+    # Report warnings
+    if [ ${#warnings[@]} -gt 0 ]; then
+        echo
+        echo "⚠️  Warnings:"
+        for warning in "${warnings[@]}"; do
+            echo "  - $warning"
+        done
+        echo
+        printf "Do you want to continue despite these warnings? (y/N): "
+        read continue_with_warnings
+        case "$continue_with_warnings" in
+            [Yy]|[Yy][Ee][Ss])
+                echo "Continuing with warnings..."
+                ;;
+            *)
+                echo "Please resolve the warnings and try again."
+                return 1
+                ;;
+        esac
+    fi
+    echo "✅ All required dependencies are available"
+    return 0
+}
+# Function to display available architectures
+show_architectures() {
+    echo "Available architectures:"
+    echo "1) x86_64 (Intel/AMD 64-bit)"
+    echo "2) aarch64 (ARM64)"
+}
+# Function to validate architecture input
+validate_architecture() {
+    local arch=$1
+    case $arch in
+        1|x86_64)
+            echo "x86_64"
+            return 0
+            ;;
+        2|aarch64)
+            echo "aarch64"
+            return 0
+            ;;
+        *)
+            return 1
+            ;;
+    esac
+}
+# Function to prompt for registry
+prompt_for_registry() {
+    echo
+    printf "Enter your Docker registry (e.g., my-registry, docker.io/username): "
+    read REGISTRY
+    if [ -z "$REGISTRY" ]; then
+        echo "Error: Registry cannot be empty"
+        exit 1
+    fi
+}
+# Function to build nixlbench image
+build_nixlbench() {
+    local arch=$1
+    local registry=$2
+    echo "Building nixlbench image for architecture: $arch"
+    echo "Registry: $registry"
+    NIXL_BUILD_DIR="/tmp/nixlbench-${NIXL_VERSION}"
+    rm -rf "${NIXL_BUILD_DIR}"
+    mkdir -p "${NIXL_BUILD_DIR}"
+    cd "${NIXL_BUILD_DIR}"
+    echo "Downloading NIXL source..."
+    wget https://github.com/ai-dynamo/nixl/archive/refs/tags/${NIXL_VERSION}.zip
+    unzip "${NIXL_VERSION}.zip"
+    cd "nixl-${NIXL_VERSION}/benchmark/nixlbench/contrib"
+    read -p "Press Enter to continue"
+    echo "Building Docker image..."
+    ./build.sh --tag "${registry}/nixlbench:${NIXL_VERSION}-${arch}" --arch "${arch}"
+    echo "Build completed successfully!"
+    echo "Image: ${registry}/nixlbench:${NIXL_VERSION}-${arch}"
+}
+# Function to update deployment yaml
+update_deployment() {
+    local arch=$1
+    local registry=$2
+    local deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${arch}.yaml"
+    echo "Creating deployment file: $deployment_file"
+    # Copy the original deployment file and update the image
+    cp "${SCRIPT_DIR}/nixlbench-deployment.yaml" "$deployment_file"
+    # Update the image field using sed
+    sed -i "s|my-registry/nixlbench:version-arch|${registry}/nixlbench:${NIXL_VERSION}-${arch}|g" "$deployment_file"
+    echo "Deployment file updated with image: ${registry}/nixlbench:${NIXL_VERSION}-${arch}"
+}
+# Function to prompt for steps to execute
+prompt_for_steps() {
+    echo
+    echo "Select which steps to execute:"
+    echo "1) Build nixlbench Docker image"
+    echo "2) Update deployment YAML file"
+    echo "3) Deploy to Kubernetes"
+    echo
+    echo "Enter the steps you want to execute (e.g., '1,2,3' for all, '1,2' to skip deployment, '3' for deployment only):"
+    printf "Steps to execute: "
+    read steps_input
+    if [ -z "$steps_input" ]; then
+        echo "Error: Please select at least one step"
+        return 1
+    fi
+    # Parse the input and set flags
+    EXECUTE_BUILD=false
+    EXECUTE_UPDATE=false
+    EXECUTE_DEPLOY=false
+    # Convert comma-separated input to array
+    IFS=',' read -ra STEPS <<< "$steps_input"
+    for step in "${STEPS[@]}"; do
+        # Remove whitespace
+        step=$(echo "$step" | tr -d ' ')
+        case "$step" in
+            1)
+                EXECUTE_BUILD=true
+                ;;
+            2)
+                EXECUTE_UPDATE=true
+                ;;
+            3)
+                EXECUTE_DEPLOY=true
+                ;;
+            *)
+                echo "Warning: Invalid step '$step' ignored. Valid steps are 1, 2, 3"
+                ;;
+        esac
+    done
+    # Check if at least one valid step was selected
+    if [ "$EXECUTE_BUILD" = false ] && [ "$EXECUTE_UPDATE" = false ] && [ "$EXECUTE_DEPLOY" = false ]; then
+        echo "Error: No valid steps selected"
+        return 1
+    fi
+    return 0
+}
+# Function to deploy to Kubernetes
+deploy_to_k8s() {
+    local arch=$1
+    local deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${arch}.yaml"
+    echo "Deploying to Kubernetes..."
+    kubectl apply -f "$deployment_file"
+    echo "Deployment applied successfully!"
+    echo
+    echo "To check the status of your deployment:"
+    echo "kubectl get pods -l app=nixl-benchmark"
+    echo
+    echo "To view logs:"
+    echo "kubectl logs -l app=nixl-benchmark -f"
+}
+# Main script
+main() {
+    echo "NIXL Benchmark Build and Deploy Script"
+    echo "======================================"
+    echo
+    # Check dependencies first
+    if ! check_dependencies; then
+        exit 1
+    fi
+    echo
+    # Show available architectures
+    show_architectures
+    echo
+    # Prompt for architecture
+    while true; do
+        printf "Select architecture (1-2 or enter x86_64/aarch64): "
+        read arch_input
+        if [ -z "$arch_input" ]; then
+            echo "Error: Please select an architecture"
+            continue
+        fi
+        SELECTED_ARCH=$(validate_architecture "$arch_input")
+        if [ $? -eq 0 ]; then
+            break
+        else
+            echo "Error: Invalid architecture. Please select 1, 2, x86_64, or aarch64"
+        fi
+    done
+    echo "Selected architecture: $SELECTED_ARCH"
+    # Prompt for registry (only if building or updating deployment)
+    REGISTRY=""
+    # Prompt for steps to execute
+    while true; do
+        if prompt_for_steps; then
+            break
+        fi
+        echo "Please try again."
+        echo
+    done
+    # Only prompt for registry if we need it
+    if [ "$EXECUTE_BUILD" = true ] || [ "$EXECUTE_UPDATE" = true ]; then
+        prompt_for_registry
+    fi
+    echo
+    echo "Summary:"
+    echo "- Architecture: $SELECTED_ARCH"
+    if [ -n "$REGISTRY" ]; then
+        echo "- Registry: $REGISTRY"
+        echo "- Image will be: $REGISTRY/nixlbench:$NIXL_VERSION-$SELECTED_ARCH"
+    fi
+    echo "- Steps to execute:"
+    if [ "$EXECUTE_BUILD" = true ]; then
+        echo "  ✓ Build nixlbench Docker image"
+    else
+        echo "  ✗ Build nixlbench Docker image (skipped)"
+    fi
+    if [ "$EXECUTE_UPDATE" = true ]; then
+        echo "  ✓ Update deployment YAML file"
+    else
+        echo "  ✗ Update deployment YAML file (skipped)"
+    fi
+    if [ "$EXECUTE_DEPLOY" = true ]; then
+        echo "  ✓ Deploy to Kubernetes"
+    else
+        echo "  ✗ Deploy to Kubernetes (skipped)"
+    fi
+    echo
+    printf "Proceed with selected steps? (y/N): "
+    read confirm
+    case "$confirm" in
+        [Yy]|[Yy][Ee][Ss])
+            ;;
+        *)
+            echo "Process cancelled."
+            exit 0
+            ;;
+    esac
+    # Execute selected steps
+    if [ "$EXECUTE_BUILD" = true ]; then
+        echo
+        echo "=== Building nixlbench Docker image ==="
+        build_nixlbench "$SELECTED_ARCH" "$REGISTRY"
+    fi
+    if [ "$EXECUTE_UPDATE" = true ]; then
+        echo
+        echo "=== Updating deployment YAML file ==="
+        update_deployment "$SELECTED_ARCH" "$REGISTRY"
+    fi
+    if [ "$EXECUTE_DEPLOY" = true ]; then
+        echo
+        echo "=== Deploying to Kubernetes ==="
+        # Check if deployment file exists
+        deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${SELECTED_ARCH}.yaml"
+        if [ ! -f "$deployment_file" ]; then
+            echo "Warning: Deployment file not found at $deployment_file"
+            echo "You may need to run step 2 (Update deployment YAML file) first."
+            printf "Do you want to continue with deployment anyway? (y/N): "
+            read deploy_confirm
+            case "$deploy_confirm" in
+                [Yy]|[Yy][Ee][Ss])
+                    ;;
+                *)
+                    echo "Deployment skipped."
+                    EXECUTE_DEPLOY=false
+                    ;;
+            esac
+        fi
+        if [ "$EXECUTE_DEPLOY" = true ]; then
+            deploy_to_k8s "$SELECTED_ARCH"
+        fi
+    fi
+    echo
+    echo "Process completed successfully!"
+}
+# Run main function
+main "$@"
--- a/benchmarks/nixl/nixl-benchmark-deployment.yaml
+++ b/benchmarks/nixl/nixl-benchmark-deployment.yaml
@@ -14,16 +14,22 @@ spec:
      labels:
        app: nixl-benchmark
    spec:
-      imagePullSecrets:
-        - name: nvcr-imagepullsecret
      containers:
      - name: nixl-benchmark
-        image: my-registry/vllm-runtime:nixlbench-e42c07a8
+        image: "my-registry/nixlbench:version-arch"
        command: ["sh", "-c"]
+        env:
+          - name: ETCD_ENDPOINTS
+            value: etcd:2379
        args:
-          - "nixlbench -etcd_endpoints http://dynamo-platform-etcd:2379 --target_seg_type VRAM --initiator_seg_type VRAM && sleep infinity"
+        - |
+          nixlbench -etcd_endpoints ${ETCD_ENDPOINTS} --target_seg_type VRAM --initiator_seg_type VRAM && sleep infinity
        resources:
          requests:
-            nvidia.com/gpu: "1"
+              cpu: "10"
+              memory: "5Gi"
+              nvidia.com/gpu: "1"
          limits:
-            nvidia.com/gpu: "1"
+              cpu: "10"
+              memory: "5Gi"
+              nvidia.com/gpu: "1"
--- a/deploy/cloud/pre-deployment/pre-deployment-check.sh
+++ b/deploy/cloud/pre-deployment/pre-deployment-check.sh
+#!/usr/bin/env bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# Pre-deployment check script for Dynamo
+# This script verifies that the Kubernetes cluster has the necessary prerequisites
+# before deploying Dynamo platform.
+#
+# Checks performed:
+# 1. kubectl connectivity - Verifies kubectl is installed and can connect to cluster
+# 2. Default StorageClass - Ensures a default StorageClass is configured
+# 3. Cluster GPU Resources - Validates GPU nodes are available
+# 4. GPU Operator - Confirms GPU operator is installed and running
+set -e
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+# Function to print colored output
+print_status() {
+    local color=$1
+    local message=$2
+    echo -e "${color}${message}${NC}"
+}
+print_header() {
+    echo -e "\n${BLUE}========================================${NC}"
+    echo -e "${BLUE}  Dynamo Pre-Deployment Check Script  ${NC}"
+    echo -e "${BLUE}========================================${NC}\n"
+}
+print_section() {
+    echo -e "\n${BLUE}--- $1 ---${NC}"
+}
+# Function to check if kubectl is available and cluster is accessible
+check_kubectl() {
+    print_section "Checking kubectl connectivity"
+    if ! command -v kubectl &> /dev/null; then
+        print_status $RED "❌ kubectl is not installed or not in PATH"
+        print_status $YELLOW "Please install kubectl and ensure it's in your PATH"
+        return 1
+    fi
+    if ! kubectl cluster-info &> /dev/null; then
+        print_status $RED "❌ Cannot connect to Kubernetes cluster"
+        print_status $YELLOW "Please ensure kubectl is configured to connect to your cluster"
+        return 1
+    fi
+    print_status $GREEN "✅ kubectl is available and cluster is accessible"
+    return 0
+}
+# Function to check for default storage class
+check_default_storage_class() {
+    print_section "Checking for default StorageClass"
+    # Use JSONPath to find storage classes with the default annotation set to "true"
+    local default_storage_classes
+    default_storage_classes=$(kubectl get storageclass -o jsonpath='{range .items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")]}{.metadata.name}{"\n"}{end}' 2>/dev/null || echo "")
+    if [[ -z "$default_storage_classes" ]]; then
+        print_status $RED "❌ No default StorageClass found"
+        print_status $YELLOW "\nDynamo requires a default StorageClass for persistent volume provisioning."
+        print_status $BLUE "Please follow the instructions below to configure a default StorageClass before proceeding with deployment.\n"
+        # Show available storage classes
+        print_status $BLUE "Available StorageClasses in your cluster:"
+        local all_storage_classes
+        all_storage_classes=$(kubectl get storageclass 2>/dev/null || echo "")
+        if [[ -z "$all_storage_classes" ]]; then
+            print_status $YELLOW "  No StorageClasses found in the cluster"
+        else
+            echo -e "$all_storage_classes"
+            local all_storage_class_names
+            all_storage_class_names=$(kubectl get storageclass -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' 2>/dev/null || echo "")
+            print_status $BLUE "\nTo set a StorageClass as default, use the following command:"
+            print_status $YELLOW "kubectl patch storageclass <storage-class-name> -p '{\"metadata\": {\"annotations\":{\"storageclass.kubernetes.io/is-default-class\":\"true\"}}}'"
+            if [[ -n "$all_storage_class_names" ]]; then
+                local first_sc_name
+                first_sc_name=$(echo "$all_storage_class_names" | head -n1)
+                print_status $BLUE "\nExample with your first available StorageClass:"
+                print_status $YELLOW "kubectl patch storageclass ${first_sc_name} -p '{\"metadata\": {\"annotations\":{\"storageclass.kubernetes.io/is-default-class\":\"true\"}}}'"
+            fi
+        fi
+        print_status $BLUE "\nFor more information on managing default StorageClasses, visit:"
+        print_status $BLUE "https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/"
+        return 1
+    else
+        print_status $GREEN "✅ Default StorageClass found"
+        while IFS= read -r sc_name; do
+            if [[ -n "$sc_name" ]]; then
+                local provisioner
+                default_sc=$(kubectl get storageclass "$sc_name" 2>/dev/null || echo "unknown")
+                print_status $GREEN "  - ${default_sc}"
+            fi
+        done <<< "$default_storage_classes"
+        # Check if there are multiple default storage classes (which can cause issues)
+        local default_count
+        default_count=$(echo "$default_storage_classes" | grep -c . || echo "0")
+        if [[ $default_count -gt 1 ]]; then
+            print_status $YELLOW "⚠️  Warning: Multiple default StorageClasses detected"
+            print_status $YELLOW "   This may cause unpredictable behavior. Consider having only one default StorageClass."
+        fi
+        return 0
+    fi
+}
+check_cluster_resources() {
+    print_section "Checking cluster GPU resources"
+    local node_count
+    node_count=$(kubectl get nodes -l nvidia.com/gpu.present=true -o name 2>/dev/null | wc -l || echo "0")
+    if [[ $node_count -eq 0 ]]; then
+        print_status $RED "❌ No GPU nodes found in the cluster"
+        print_status $YELLOW "Dynamo requires nodes with nvidia.com/gpu.present=true label."
+        print_status $BLUE "Please ensure your cluster has GPU-enabled nodes properly labeled."
+        return 1
+    else
+        print_status $GREEN "✅ Found ${node_count} GPU node(s) in the cluster"
+        return 0
+    fi
+    # Show basic node information (commented out for cleaner output)
+    # print_status $BLUE "GPU Node information:"
+    # kubectl get nodes -l nvidia.com/gpu.present=true -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type,ROLES:.metadata.labels.'node-role\.kubernetes\.io/.*',VERSION:.status.nodeInfo.kubeletVersion 2>/dev/null || true
+}
+check_gpu_operator() {
+    print_section "Checking GPU operator"
+    # Check if GPU operator pods exist and are running
+    local gpu_operator_pods
+    gpu_operator_pods=$(kubectl get pods -A -lapp=gpu-operator --no-headers 2>/dev/null || echo "")
+    if [[ -z "$gpu_operator_pods" ]]; then
+        print_status $RED "❌ GPU operator not found in the cluster"
+        print_status $YELLOW "Dynamo requires GPU operator to be installed and running."
+        print_status $BLUE "Please install GPU operator before proceeding with deployment."
+        return 1
+    fi
+    # Check if any GPU operator pods are running
+    local running_pods
+    running_pods=$(echo "$gpu_operator_pods" | grep -c "Running" || echo "0")
+    local total_pods
+    total_pods=$(echo "$gpu_operator_pods" | wc -l)
+    if [[ $running_pods -eq 0 ]]; then
+        print_status $RED "❌ GPU operator pods are not running"
+        print_status $YELLOW "Found $total_pods GPU operator pod(s) but none are in Running state:"
+        echo "$gpu_operator_pods"
+        return 1
+    elif [[ $running_pods -lt $total_pods ]]; then
+        print_status $YELLOW "⚠️  GPU operator partially running: $running_pods/$total_pods pods running"
+        echo "$gpu_operator_pods"
+        print_status $GREEN "✅ GPU operator is available (with warnings)"
+        return 0
+    else
+        print_status $GREEN "✅ GPU operator is running ($running_pods/$total_pods pods)"
+        return 0
+    fi
+}
+# Global variables to track check results (using simple arrays for compatibility)
+CHECK_RESULTS=""
+CHECK_ORDER=""
+# Function to record check result
+record_check_result() {
+    local check_name="$1"
+    local status="$2"
+    # Append to results string with delimiter
+    if [[ -z "$CHECK_RESULTS" ]]; then
+        CHECK_RESULTS="${check_name}:${status}"
+        CHECK_ORDER="${check_name}"
+    else
+        CHECK_RESULTS="${CHECK_RESULTS}|${check_name}:${status}"
+        CHECK_ORDER="${CHECK_ORDER}|${check_name}"
+    fi
+}
+# Function to get check result by name
+get_check_result() {
+    local check_name="$1"
+    echo "$CHECK_RESULTS" | tr '|' '\n' | grep "^${check_name}:" | cut -d':' -f2
+}
+# Function to display check summary
+display_check_summary() {
+    print_section "Pre-Deployment Check Summary"
+    local passed=0
+    local failed=0
+    # Split CHECK_ORDER by delimiter and iterate
+    IFS='|' read -ra CHECKS <<< "$CHECK_ORDER"
+    for check_name in "${CHECKS[@]}"; do
+        local status=$(get_check_result "$check_name")
+        if [[ "$status" == "PASS" ]]; then
+            print_status $GREEN "✅ $check_name: PASSED"
+            ((passed++))
+        else
+            print_status $RED "❌ $check_name: FAILED"
+            ((failed++))
+        fi
+    done
+    echo ""
+    print_status $BLUE "Summary: $passed passed, $failed failed"
+    if [[ $failed -eq 0 ]]; then
+        print_status $GREEN "🎉 All pre-deployment checks passed!"
+        print_status $GREEN "Your cluster is ready for Dynamo deployment."
+        return 0
+    else
+        print_status $RED "❌ $failed pre-deployment check(s) failed."
+        print_status $RED "Please address the issues above before proceeding with deployment."
+        return 1
+    fi
+}
+# Main execution
+main() {
+    print_header
+    local overall_exit_code=0
+    # Run checks and capture results
+    if check_kubectl; then
+        record_check_result "kubectl Connectivity" "PASS"
+    else
+        record_check_result "kubectl Connectivity" "FAIL"
+        overall_exit_code=1
+    fi
+    if check_default_storage_class; then
+        record_check_result "Default StorageClass" "PASS"
+    else
+        record_check_result "Default StorageClass" "FAIL"
+        overall_exit_code=1
+    fi
+    if check_cluster_resources; then
+        record_check_result "Cluster GPU Resources" "PASS"
+    else
+        record_check_result "Cluster GPU Resources" "FAIL"
+        overall_exit_code=1
+    fi
+    if check_gpu_operator; then
+        record_check_result "GPU Operator" "PASS"
+    else
+        record_check_result "GPU Operator" "FAIL"
+        overall_exit_code=1
+    fi
+    # Display summary
+    echo ""
+    if ! display_check_summary; then
+        overall_exit_code=1
+    fi
+    exit $overall_exit_code
+}
+# Run the script
+main "$@"
--- a/docs/kubernetes/README.md
+++ b/docs/kubernetes/README.md
@@ -19,6 +19,11 @@ limitations under the License.
 High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
+## Pre-deployment Checks
+Before deploying the platform, it is recommended to run the pre-deployment checks to ensure the cluster is ready for deployment. Please refer to the [pre-deployment checks](/deploy/cloud/pre-deployment/README.md) for more details.
 ## 1. Install Platform First
 ```bash