"ssh:/git@developer.sourcefind.cn:2222/OpenDAS/dynamo.git" did not exist on "5bbbeae3596eb6b0babe66be352497dd8cd88cf8"
Unverified Commit 9bb2bc93 authored by Biswa Panda's avatar Biswa Panda Committed by GitHub
Browse files

feat: add pre-deployment check for storageclass (#3573)

parent de3ca70b
# NIXL Benchmark Technical Documentation (Kubernetes)
This guide describes how to run the NIXL benchmark using the provided Docker image on a Kubernetes (K8s) cluster.
---
## Prerequisites
- A running Kubernetes cluster with access to NVIDIA GPUs (e.g., using NVIDIA GPU Operator or device plugin)
- `kubectl` configured to access your cluster
- deploy dynamo cloud in a namespace
---
## 1. Prepare the Kubernetes Deployment
A sample deployment YAML is provided in this repository:
`benchmarks/nixl/nixl-benchmark-deployment.yaml`
Update the image field in sample yaml to appropiate image in your registry.
You can use the `yq` tool to update the image field in the deployment YAML
```bash
yq -i '.spec.template.spec.containers[] |= select(.name == "nixl-benchmark") .image = "your-registry/your-nixl-benchmark:your-tag"' benchmarks/nixl/nixl-benchmark-deployment.yaml > nixl-benchmark-deployment.yaml
```
## 2. Deploy using kubectl
Launch using the command below:
```bash
kubectl apply -f nixl-benchmark-deployment.yaml
```
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Pre-Deployment Check Script
This directory contains a pre-deployment check script that verifies your Kubernetes cluster meets the requirements for deploying Dynamo.
- For NCCL tests, please refer to the [NCCL tests](https://docs.nebius.com/kubernetes/gpu/nccl-test#run-tests) for more details.
- For NIXL benchmark, please refer to the [NIXL benchmark pre-deployment checks](/deploy/cloud/pre-deployment/nixl/README.md) for more details.
## Usage
Run the pre-deployment check before deploying Dynamo:
```bash
./pre-deployment-check.sh
```
## What it checks
The script performs few checks and provides a detailed summary:
### 1. kubectl Connectivity
- Verifies that `kubectl` is installed and kubectl can connect to your Kubernetes cluster
### 2. Default StorageClass
- Verifies that a default StorageClass is configured in your cluster
- If no default StorageClass is found:
- Lists all available StorageClasses in the cluster with full details
- Provides a sample command to set a StorageClass as default
- References the official Kubernetes documentation for detailed guidance
### 3. Cluster GPU Resources
- Checks for GPU-enabled nodes in the cluster using label `nvidia.com/gpu.present=true`
## Sample Output
### Complete Script Output Example:
```
========================================
Dynamo Pre-Deployment Check Script
========================================
--- Checking kubectl connectivity ---
✅ kubectl is available and cluster is accessible
--- Checking for default StorageClass ---
❌ No default StorageClass found
Dynamo requires a default StorageClass for persistent volume provisioning.
Please configure a default StorageClass before proceeding with deployment.
Available StorageClasses in your cluster:
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
my-default-storage-class (default) compute.csi.mock Delete WaitForFirstConsumer true 65d
fast-ssd-storage kubernetes.io/gce-pd Delete Immediate true 30d
To set a StorageClass as default, use the following command:
kubectl patch storageclass <storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Example with your first available StorageClass:
kubectl patch storageclass my-default-storage-class -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
For more information on managing default StorageClasses, visit:
https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/
--- Checking cluster gpu resources ---
✅ Found 17 gpu node(s) in the cluster
Node information:
--- Pre-Deployment Check Summary ---
✅ kubectl Connectivity: PASSED
❌ Default StorageClass: FAILED
✅ Cluster Resources: PASSED
Summary: 2 passed, 1 failed
❌ 1 pre-deployment check(s) failed.
Please address the issues above before proceeding with deployment.
```
### When all checks pass:
```
========================================
Dynamo Pre-Deployment Check Script
========================================
--- Checking kubectl connectivity ---
✅ kubectl is available and cluster is accessible
--- Checking for default StorageClass ---
✅ Default StorageClass found
- NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
my-default-storage-class (default) compute.csi.mock Delete WaitForFirstConsumer true 65d
--- Checking cluster gpu resources ---
✅ Found 17 gpu node(s) in the cluster
Node information:
--- Pre-Deployment Check Summary ---
✅ kubectl Connectivity: PASSED
✅ Default StorageClass: PASSED
✅ Cluster Resources: PASSED
Summary: 3 passed, 0 failed
🎉 All pre-deployment checks passed!
Your cluster is ready for Dynamo deployment.
```
## Check Status Summary
The script provides a comprehensive summary showing the status of each check:
| Check Name | Description | Pass/Fail Status |
|------------|-------------|------------------|
| **kubectl Connectivity** | Verifies kubectl installation and cluster access | ✅ PASSED / ❌ FAILED |
| **Default StorageClass** | Checks for default StorageClass annotation | ✅ PASSED / ❌ FAILED |
| **Cluster Resources** | Validates GPU nodes availability | ✅ PASSED / ❌ FAILED |
## Setting a Default StorageClass
If you need to set a default StorageClass, use the following command:
```bash
kubectl patch storageclass <storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
```
Replace `<storage-class-name>` with the name of your desired StorageClass.
## Troubleshooting
### Multiple Default StorageClasses
If you have multiple StorageClasses marked as default, the script will warn you:
```
⚠️ Warning: Multiple default StorageClasses detected
This may cause unpredictable behavior. Consider having only one default StorageClass.
```
To remove the default annotation from a StorageClass:
```bash
kubectl patch storageclass <storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
```
### No GPU Nodes Found
If no GPU nodes are found, ensure your cluster has nodes with the `nvidia.com/gpu.present=true` label.
### No StorageClasses Available
If no StorageClasses are available in your cluster, you'll need to:
1. Install a storage provisioner (e.g., for cloud providers, local storage, etc.)
2. Create appropriate StorageClass resources
3. Mark one as default
## Reference
For more information on managing default StorageClasses, visit:
[Kubernetes Documentation - Change the default StorageClass](https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/)
\ No newline at end of file
# NIXL Benchmark Documentation
This guide describes how to build and deploy the NIXL benchmark using the provided scripts on a Kubernetes (K8s) cluster.
> **Note**: NIXL benchmark is part of the Dynamo platform. Before proceeding, ensure your cluster meets the basic Dynamo requirements by running the pre-deployment check script located in the parent directory (`../pre-deployment-check.sh`).
---
## Prerequisites
### Cluster Requirements
Before deploying NIXL benchmark, ensure your cluster meets the Dynamo platform requirements by running the pre-deployment check:
```bash
# Run from the parent directory
../pre-deployment-check.sh
```
This script verifies:
- `kubectl` connectivity and cluster access
- GPU nodes availability (`nvidia.com/gpu.present=true` label)
- GPU Operator installation and status
### NIXL-Specific Requirements
In addition to the cluster requirements above, NIXL benchmark requires:
- **Docker** installed and configured on your local machine (for building images)
- **Docker registry access** to push the built nixlbench images
- **ETCD service** deployed and accessible as `etcd:2379`
- **Build utilities**: `wget` and `unzip` for downloading NIXL source code
### Verification Steps
1. **Run pre-deployment check** (recommended):
```bash
../pre-deployment-check.sh
```
Ensure all checks pass before proceeding.
2. **Verify ETCD availability** (NIXL-specific):
```bash
kubectl get svc etcd
```
3. **Confirm Docker registry access**:
```bash
docker login your-registry.com # if using private registry
```
---
## Quick Start
For the easiest experience, use the interactive build and deploy script:
```bash
./build_and_deploy.sh
```
This script provides a flexible workflow where you can:
1. **Select architecture**: Choose between x86_64 (Intel/AMD 64-bit) or aarch64 (ARM64)
2. **Choose which steps to execute**: Select any combination of:
- Build nixlbench Docker image
- Update deployment YAML file
- Deploy to Kubernetes
3. **Provide Docker registry** (only when needed for building or updating deployment)
---
## Interactive Script Features
### Architecture Selection
The script supports two architectures:
- **Option 1**: x86_64 (Intel/AMD 64-bit)
- **Option 2**: aarch64 (ARM64)
You can select by entering:
- `1` or `x86_64` for x86_64 architecture
- `2` or `aarch64` for aarch64 architecture
### Step Selection
Choose which steps to execute by entering comma-separated numbers:
- **All steps**: `1,2,3`
- **Build and update only**: `1,2` (skips Kubernetes deployment)
- **Deploy only**: `3` (useful if image is already built and deployment file exists)
- **Build only**: `1` (useful for just creating the Docker image)
- **Update deployment only**: `2` (useful for updating deployment file with new registry/version)
### Smart Registry Prompting
The script only prompts for Docker registry information when needed:
- **Steps 1 or 2**: Registry required for building image or updating deployment
- **Step 3 only**: No registry prompt (uses existing deployment file)
---
## What Each Step Does
### Step 1: Build nixlbench Docker Image
- Downloads NIXL source code (version 0.6.0) from GitHub
- Extracts and navigates to the build directory
- Pauses for user confirmation before building
- Builds Docker image with specified registry and architecture
- Tags image as: `{registry}/nixlbench:0.6.0-{arch}`
### Step 2: Update Deployment YAML File
- Copies base deployment template (`nixlbench-deployment.yaml`)
- Creates architecture-specific deployment file (`nixlbench-deployment-{arch}.yaml`)
- Updates image reference with your registry and architecture
- Preserves all other deployment configurations
### Step 3: Deploy to Kubernetes
- Validates deployment file exists
- Applies deployment to Kubernetes cluster
- Provides monitoring commands for checking status
---
## Deployment Configuration
The deployment creates:
- **2 replicas** of the nixlbench pod
- **Resource requests/limits**:
- CPU: 10 cores
- Memory: 5Gi
- GPU: 1 NVIDIA GPU per pod
- **Environment variables**:
- `ETCD_ENDPOINTS`: Points to `etcd:2379`
- **Command**: Runs nixlbench with VRAM segments and keeps container alive
---
## Usage Examples
### Example 1: Complete Workflow
```bash
./build_and_deploy.sh
# Select: 1 (x86_64)
# Steps: 1,2,3
# Registry: docker.io/myusername
# Confirm: y
```
### Example 2: Build Image Only
```bash
./build_and_deploy.sh
# Select: 2 (aarch64)
# Steps: 1
# Registry: my-private-registry.com
# Confirm: y
```
### Example 3: Deploy Existing Image
```bash
./build_and_deploy.sh
# Select: 1 (x86_64)
# Steps: 3
# Confirm: y
```
### Example 4: Update Deployment File Only
```bash
./build_and_deploy.sh
# Select: 2 (aarch64)
# Steps: 2
# Registry: new-registry.com
# Confirm: y
```
---
## Generated Files
The script creates architecture-specific deployment files:
- `nixlbench-deployment-x86_64.yaml` - For x86_64 builds
- `nixlbench-deployment-aarch64.yaml` - For aarch64 builds
These files are customized versions of the base template with your specific:
- Docker registry
- Image tag
- Architecture
---
## Monitoring Your Deployment
After deployment, monitor your NIXL benchmark:
```bash
# Check pod status
kubectl get pods -l app=nixl-benchmark
# View logs
kubectl logs -l app=nixl-benchmark -f
# Check resource usage
kubectl top pods -l app=nixl-benchmark
# Get detailed pod information
kubectl describe pods -l app=nixl-benchmark
```
If deployed to a specific namespace:
```bash
kubectl get pods -l app=nixl-benchmark -n your-namespace
kubectl logs -l app=nixl-benchmark -f -n your-namespace
```
---
## Troubleshooting
### Cluster-Level Issues
For cluster-related problems, first run the pre-deployment check to identify issues:
```bash
../pre-deployment-check.sh
```
This will help diagnose:
- kubectl connectivity problems
- Missing default StorageClass
- GPU node availability issues
- GPU Operator status problems
### NIXL-Specific Issues
1. **ETCD Connection**:
- Ensure etcd service is running: `kubectl get svc dynamo-platform-etcd`
- Verify etcd endpoints are accessible from pods
- Check if etcd is in the correct namespace
2. **Image Pull Issues**:
- Verify registry credentials are configured
- Check image exists: `docker pull {registry}/nixlbench:0.6.0-{arch}`
- Ensure image was pushed successfully after build
3. **Build Failures**:
- Ensure Docker daemon is running
- Check available disk space in `/tmp`
- Verify network connectivity to GitHub
- Confirm build utilities are installed: `which wget unzip`
4. **Deployment File Not Found**:
- Run step 2 to create deployment file before step 3
- Check file permissions in script directory
- Verify script directory path is correct
### Debug Commands
```bash
# Check script-generated files
ls -la nixlbench-deployment-*.yaml
# Verify deployment status
kubectl get deployment nixl-benchmark -o yaml
# Check events for issues
kubectl get events --sort-by=.metadata.creationTimestamp
```
### Cleanup
To remove the deployment:
```bash
kubectl delete deployment nixl-benchmark
```
Or if deployed to a specific namespace:
```bash
kubectl delete deployment nixl-benchmark -n your-namespace
```
To clean up generated files:
```bash
rm -f nixlbench-deployment-*.yaml
```
---
## Script Reference
### build_and_deploy.sh
Interactive script that provides flexible build and deployment workflow:
- **Architecture selection**: x86_64 or aarch64
- **Step selection**: Choose any combination of build, update, deploy
- **Validation**: Checks for deployment files before deploying
### nixlbench-deployment.yaml
Base Kubernetes deployment template that gets customized by the script:
- **Template image**: `my-registry/nixlbench:version-arch`
- **Resource allocation**: 10 CPU, 5Gi memory, 1 GPU per pod
- **ETCD integration**: Pre-configured environment variables
- **Benchmark command**: Runs with VRAM segment configuration
\ No newline at end of file
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -euo pipefail
NIXL_VERSION="0.6.0"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Function to check if a command exists
command_exists() {
command -v "$1" >/dev/null 2>&1
}
# Function to check Docker daemon status
check_docker_daemon() {
if ! docker info >/dev/null 2>&1; then
return 1
fi
return 0
}
# Function to check all required dependencies
check_dependencies() {
echo "Checking required dependencies..."
local missing_deps=()
local warnings=()
# Check wget
if ! command_exists wget; then
missing_deps+=("wget")
else
echo "✅ wget is available"
fi
# Check unzip
if ! command_exists unzip; then
missing_deps+=("unzip")
else
echo "✅ unzip is available"
fi
# Check kubectl
if ! command_exists kubectl; then
missing_deps+=("kubectl")
else
echo "✅ kubectl is available"
# Test kubectl connectivity
if ! kubectl cluster-info >/dev/null 2>&1; then
warnings+=("kubectl is installed but cannot connect to cluster")
else
echo "✅ kubectl can connect to cluster"
fi
fi
# Check Docker
if ! command_exists docker; then
missing_deps+=("docker")
else
echo "✅ docker is available"
# Check Docker daemon
if ! check_docker_daemon; then
warnings+=("Docker is installed but daemon is not running or accessible")
else
echo "✅ Docker daemon is running"
# Additional Docker toolchain checks
if ! docker ps >/dev/null 2>&1; then
warnings+=("Docker requires sudo or user is not in docker group - consider adding user to docker group")
fi
if ! docker buildx version >/dev/null 2>&1; then
warnings+=("Docker buildx not available (may affect multi-architecture builds)")
fi
fi
fi
# Report missing dependencies
if [ ${#missing_deps[@]} -gt 0 ]; then
echo
echo "❌ Missing required dependencies:"
for dep in "${missing_deps[@]}"; do
echo " - $dep"
done
echo
echo "Please install the missing dependencies and try again."
echo
echo "Installation suggestions:"
for dep in "${missing_deps[@]}"; do
case "$dep" in
wget)
echo " wget: sudo apt-get install wget (Ubuntu/Debian) or yum install wget (RHEL/CentOS)"
;;
unzip)
echo " unzip: sudo apt-get install unzip (Ubuntu/Debian) or yum install unzip (RHEL/CentOS)"
;;
kubectl)
echo " kubectl: https://kubernetes.io/docs/tasks/tools/install-kubectl/"
;;
docker)
echo " docker: https://docs.docker.com/get-docker/"
;;
esac
done
return 1
fi
# Report warnings
if [ ${#warnings[@]} -gt 0 ]; then
echo
echo "⚠️ Warnings:"
for warning in "${warnings[@]}"; do
echo " - $warning"
done
echo
printf "Do you want to continue despite these warnings? (y/N): "
read continue_with_warnings
case "$continue_with_warnings" in
[Yy]|[Yy][Ee][Ss])
echo "Continuing with warnings..."
;;
*)
echo "Please resolve the warnings and try again."
return 1
;;
esac
fi
echo "✅ All required dependencies are available"
return 0
}
# Function to display available architectures
show_architectures() {
echo "Available architectures:"
echo "1) x86_64 (Intel/AMD 64-bit)"
echo "2) aarch64 (ARM64)"
}
# Function to validate architecture input
validate_architecture() {
local arch=$1
case $arch in
1|x86_64)
echo "x86_64"
return 0
;;
2|aarch64)
echo "aarch64"
return 0
;;
*)
return 1
;;
esac
}
# Function to prompt for registry
prompt_for_registry() {
echo
printf "Enter your Docker registry (e.g., my-registry, docker.io/username): "
read REGISTRY
if [ -z "$REGISTRY" ]; then
echo "Error: Registry cannot be empty"
exit 1
fi
}
# Function to build nixlbench image
build_nixlbench() {
local arch=$1
local registry=$2
echo "Building nixlbench image for architecture: $arch"
echo "Registry: $registry"
NIXL_BUILD_DIR="/tmp/nixlbench-${NIXL_VERSION}"
rm -rf "${NIXL_BUILD_DIR}"
mkdir -p "${NIXL_BUILD_DIR}"
cd "${NIXL_BUILD_DIR}"
echo "Downloading NIXL source..."
wget https://github.com/ai-dynamo/nixl/archive/refs/tags/${NIXL_VERSION}.zip
unzip "${NIXL_VERSION}.zip"
cd "nixl-${NIXL_VERSION}/benchmark/nixlbench/contrib"
read -p "Press Enter to continue"
echo "Building Docker image..."
./build.sh --tag "${registry}/nixlbench:${NIXL_VERSION}-${arch}" --arch "${arch}"
echo "Build completed successfully!"
echo "Image: ${registry}/nixlbench:${NIXL_VERSION}-${arch}"
}
# Function to update deployment yaml
update_deployment() {
local arch=$1
local registry=$2
local deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${arch}.yaml"
echo "Creating deployment file: $deployment_file"
# Copy the original deployment file and update the image
cp "${SCRIPT_DIR}/nixlbench-deployment.yaml" "$deployment_file"
# Update the image field using sed
sed -i "s|my-registry/nixlbench:version-arch|${registry}/nixlbench:${NIXL_VERSION}-${arch}|g" "$deployment_file"
echo "Deployment file updated with image: ${registry}/nixlbench:${NIXL_VERSION}-${arch}"
}
# Function to prompt for steps to execute
prompt_for_steps() {
echo
echo "Select which steps to execute:"
echo "1) Build nixlbench Docker image"
echo "2) Update deployment YAML file"
echo "3) Deploy to Kubernetes"
echo
echo "Enter the steps you want to execute (e.g., '1,2,3' for all, '1,2' to skip deployment, '3' for deployment only):"
printf "Steps to execute: "
read steps_input
if [ -z "$steps_input" ]; then
echo "Error: Please select at least one step"
return 1
fi
# Parse the input and set flags
EXECUTE_BUILD=false
EXECUTE_UPDATE=false
EXECUTE_DEPLOY=false
# Convert comma-separated input to array
IFS=',' read -ra STEPS <<< "$steps_input"
for step in "${STEPS[@]}"; do
# Remove whitespace
step=$(echo "$step" | tr -d ' ')
case "$step" in
1)
EXECUTE_BUILD=true
;;
2)
EXECUTE_UPDATE=true
;;
3)
EXECUTE_DEPLOY=true
;;
*)
echo "Warning: Invalid step '$step' ignored. Valid steps are 1, 2, 3"
;;
esac
done
# Check if at least one valid step was selected
if [ "$EXECUTE_BUILD" = false ] && [ "$EXECUTE_UPDATE" = false ] && [ "$EXECUTE_DEPLOY" = false ]; then
echo "Error: No valid steps selected"
return 1
fi
return 0
}
# Function to deploy to Kubernetes
deploy_to_k8s() {
local arch=$1
local deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${arch}.yaml"
echo "Deploying to Kubernetes..."
kubectl apply -f "$deployment_file"
echo "Deployment applied successfully!"
echo
echo "To check the status of your deployment:"
echo "kubectl get pods -l app=nixl-benchmark"
echo
echo "To view logs:"
echo "kubectl logs -l app=nixl-benchmark -f"
}
# Main script
main() {
echo "NIXL Benchmark Build and Deploy Script"
echo "======================================"
echo
# Check dependencies first
if ! check_dependencies; then
exit 1
fi
echo
# Show available architectures
show_architectures
echo
# Prompt for architecture
while true; do
printf "Select architecture (1-2 or enter x86_64/aarch64): "
read arch_input
if [ -z "$arch_input" ]; then
echo "Error: Please select an architecture"
continue
fi
SELECTED_ARCH=$(validate_architecture "$arch_input")
if [ $? -eq 0 ]; then
break
else
echo "Error: Invalid architecture. Please select 1, 2, x86_64, or aarch64"
fi
done
echo "Selected architecture: $SELECTED_ARCH"
# Prompt for registry (only if building or updating deployment)
REGISTRY=""
# Prompt for steps to execute
while true; do
if prompt_for_steps; then
break
fi
echo "Please try again."
echo
done
# Only prompt for registry if we need it
if [ "$EXECUTE_BUILD" = true ] || [ "$EXECUTE_UPDATE" = true ]; then
prompt_for_registry
fi
echo
echo "Summary:"
echo "- Architecture: $SELECTED_ARCH"
if [ -n "$REGISTRY" ]; then
echo "- Registry: $REGISTRY"
echo "- Image will be: $REGISTRY/nixlbench:$NIXL_VERSION-$SELECTED_ARCH"
fi
echo "- Steps to execute:"
if [ "$EXECUTE_BUILD" = true ]; then
echo " ✓ Build nixlbench Docker image"
else
echo " ✗ Build nixlbench Docker image (skipped)"
fi
if [ "$EXECUTE_UPDATE" = true ]; then
echo " ✓ Update deployment YAML file"
else
echo " ✗ Update deployment YAML file (skipped)"
fi
if [ "$EXECUTE_DEPLOY" = true ]; then
echo " ✓ Deploy to Kubernetes"
else
echo " ✗ Deploy to Kubernetes (skipped)"
fi
echo
printf "Proceed with selected steps? (y/N): "
read confirm
case "$confirm" in
[Yy]|[Yy][Ee][Ss])
;;
*)
echo "Process cancelled."
exit 0
;;
esac
# Execute selected steps
if [ "$EXECUTE_BUILD" = true ]; then
echo
echo "=== Building nixlbench Docker image ==="
build_nixlbench "$SELECTED_ARCH" "$REGISTRY"
fi
if [ "$EXECUTE_UPDATE" = true ]; then
echo
echo "=== Updating deployment YAML file ==="
update_deployment "$SELECTED_ARCH" "$REGISTRY"
fi
if [ "$EXECUTE_DEPLOY" = true ]; then
echo
echo "=== Deploying to Kubernetes ==="
# Check if deployment file exists
deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${SELECTED_ARCH}.yaml"
if [ ! -f "$deployment_file" ]; then
echo "Warning: Deployment file not found at $deployment_file"
echo "You may need to run step 2 (Update deployment YAML file) first."
printf "Do you want to continue with deployment anyway? (y/N): "
read deploy_confirm
case "$deploy_confirm" in
[Yy]|[Yy][Ee][Ss])
;;
*)
echo "Deployment skipped."
EXECUTE_DEPLOY=false
;;
esac
fi
if [ "$EXECUTE_DEPLOY" = true ]; then
deploy_to_k8s "$SELECTED_ARCH"
fi
fi
echo
echo "Process completed successfully!"
}
# Run main function
main "$@"
...@@ -14,16 +14,22 @@ spec: ...@@ -14,16 +14,22 @@ spec:
labels: labels:
app: nixl-benchmark app: nixl-benchmark
spec: spec:
imagePullSecrets:
- name: nvcr-imagepullsecret
containers: containers:
- name: nixl-benchmark - name: nixl-benchmark
image: my-registry/vllm-runtime:nixlbench-e42c07a8 image: "my-registry/nixlbench:version-arch"
command: ["sh", "-c"] command: ["sh", "-c"]
env:
- name: ETCD_ENDPOINTS
value: etcd:2379
args: args:
- "nixlbench -etcd_endpoints http://dynamo-platform-etcd:2379 --target_seg_type VRAM --initiator_seg_type VRAM && sleep infinity" - |
nixlbench -etcd_endpoints ${ETCD_ENDPOINTS} --target_seg_type VRAM --initiator_seg_type VRAM && sleep infinity
resources: resources:
requests: requests:
nvidia.com/gpu: "1" cpu: "10"
memory: "5Gi"
nvidia.com/gpu: "1"
limits: limits:
nvidia.com/gpu: "1" cpu: "10"
memory: "5Gi"
nvidia.com/gpu: "1"
#!/usr/bin/env bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Pre-deployment check script for Dynamo
# This script verifies that the Kubernetes cluster has the necessary prerequisites
# before deploying Dynamo platform.
#
# Checks performed:
# 1. kubectl connectivity - Verifies kubectl is installed and can connect to cluster
# 2. Default StorageClass - Ensures a default StorageClass is configured
# 3. Cluster GPU Resources - Validates GPU nodes are available
# 4. GPU Operator - Confirms GPU operator is installed and running
set -e
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Function to print colored output
print_status() {
local color=$1
local message=$2
echo -e "${color}${message}${NC}"
}
print_header() {
echo -e "\n${BLUE}========================================${NC}"
echo -e "${BLUE} Dynamo Pre-Deployment Check Script ${NC}"
echo -e "${BLUE}========================================${NC}\n"
}
print_section() {
echo -e "\n${BLUE}--- $1 ---${NC}"
}
# Function to check if kubectl is available and cluster is accessible
check_kubectl() {
print_section "Checking kubectl connectivity"
if ! command -v kubectl &> /dev/null; then
print_status $RED "❌ kubectl is not installed or not in PATH"
print_status $YELLOW "Please install kubectl and ensure it's in your PATH"
return 1
fi
if ! kubectl cluster-info &> /dev/null; then
print_status $RED "❌ Cannot connect to Kubernetes cluster"
print_status $YELLOW "Please ensure kubectl is configured to connect to your cluster"
return 1
fi
print_status $GREEN "✅ kubectl is available and cluster is accessible"
return 0
}
# Function to check for default storage class
check_default_storage_class() {
print_section "Checking for default StorageClass"
# Use JSONPath to find storage classes with the default annotation set to "true"
local default_storage_classes
default_storage_classes=$(kubectl get storageclass -o jsonpath='{range .items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")]}{.metadata.name}{"\n"}{end}' 2>/dev/null || echo "")
if [[ -z "$default_storage_classes" ]]; then
print_status $RED "❌ No default StorageClass found"
print_status $YELLOW "\nDynamo requires a default StorageClass for persistent volume provisioning."
print_status $BLUE "Please follow the instructions below to configure a default StorageClass before proceeding with deployment.\n"
# Show available storage classes
print_status $BLUE "Available StorageClasses in your cluster:"
local all_storage_classes
all_storage_classes=$(kubectl get storageclass 2>/dev/null || echo "")
if [[ -z "$all_storage_classes" ]]; then
print_status $YELLOW " No StorageClasses found in the cluster"
else
echo -e "$all_storage_classes"
local all_storage_class_names
all_storage_class_names=$(kubectl get storageclass -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' 2>/dev/null || echo "")
print_status $BLUE "\nTo set a StorageClass as default, use the following command:"
print_status $YELLOW "kubectl patch storageclass <storage-class-name> -p '{\"metadata\": {\"annotations\":{\"storageclass.kubernetes.io/is-default-class\":\"true\"}}}'"
if [[ -n "$all_storage_class_names" ]]; then
local first_sc_name
first_sc_name=$(echo "$all_storage_class_names" | head -n1)
print_status $BLUE "\nExample with your first available StorageClass:"
print_status $YELLOW "kubectl patch storageclass ${first_sc_name} -p '{\"metadata\": {\"annotations\":{\"storageclass.kubernetes.io/is-default-class\":\"true\"}}}'"
fi
fi
print_status $BLUE "\nFor more information on managing default StorageClasses, visit:"
print_status $BLUE "https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/"
return 1
else
print_status $GREEN "✅ Default StorageClass found"
while IFS= read -r sc_name; do
if [[ -n "$sc_name" ]]; then
local provisioner
default_sc=$(kubectl get storageclass "$sc_name" 2>/dev/null || echo "unknown")
print_status $GREEN " - ${default_sc}"
fi
done <<< "$default_storage_classes"
# Check if there are multiple default storage classes (which can cause issues)
local default_count
default_count=$(echo "$default_storage_classes" | grep -c . || echo "0")
if [[ $default_count -gt 1 ]]; then
print_status $YELLOW "⚠️ Warning: Multiple default StorageClasses detected"
print_status $YELLOW " This may cause unpredictable behavior. Consider having only one default StorageClass."
fi
return 0
fi
}
check_cluster_resources() {
print_section "Checking cluster GPU resources"
local node_count
node_count=$(kubectl get nodes -l nvidia.com/gpu.present=true -o name 2>/dev/null | wc -l || echo "0")
if [[ $node_count -eq 0 ]]; then
print_status $RED "❌ No GPU nodes found in the cluster"
print_status $YELLOW "Dynamo requires nodes with nvidia.com/gpu.present=true label."
print_status $BLUE "Please ensure your cluster has GPU-enabled nodes properly labeled."
return 1
else
print_status $GREEN "✅ Found ${node_count} GPU node(s) in the cluster"
return 0
fi
# Show basic node information (commented out for cleaner output)
# print_status $BLUE "GPU Node information:"
# kubectl get nodes -l nvidia.com/gpu.present=true -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type,ROLES:.metadata.labels.'node-role\.kubernetes\.io/.*',VERSION:.status.nodeInfo.kubeletVersion 2>/dev/null || true
}
check_gpu_operator() {
print_section "Checking GPU operator"
# Check if GPU operator pods exist and are running
local gpu_operator_pods
gpu_operator_pods=$(kubectl get pods -A -lapp=gpu-operator --no-headers 2>/dev/null || echo "")
if [[ -z "$gpu_operator_pods" ]]; then
print_status $RED "❌ GPU operator not found in the cluster"
print_status $YELLOW "Dynamo requires GPU operator to be installed and running."
print_status $BLUE "Please install GPU operator before proceeding with deployment."
return 1
fi
# Check if any GPU operator pods are running
local running_pods
running_pods=$(echo "$gpu_operator_pods" | grep -c "Running" || echo "0")
local total_pods
total_pods=$(echo "$gpu_operator_pods" | wc -l)
if [[ $running_pods -eq 0 ]]; then
print_status $RED "❌ GPU operator pods are not running"
print_status $YELLOW "Found $total_pods GPU operator pod(s) but none are in Running state:"
echo "$gpu_operator_pods"
return 1
elif [[ $running_pods -lt $total_pods ]]; then
print_status $YELLOW "⚠️ GPU operator partially running: $running_pods/$total_pods pods running"
echo "$gpu_operator_pods"
print_status $GREEN "✅ GPU operator is available (with warnings)"
return 0
else
print_status $GREEN "✅ GPU operator is running ($running_pods/$total_pods pods)"
return 0
fi
}
# Global variables to track check results (using simple arrays for compatibility)
CHECK_RESULTS=""
CHECK_ORDER=""
# Function to record check result
record_check_result() {
local check_name="$1"
local status="$2"
# Append to results string with delimiter
if [[ -z "$CHECK_RESULTS" ]]; then
CHECK_RESULTS="${check_name}:${status}"
CHECK_ORDER="${check_name}"
else
CHECK_RESULTS="${CHECK_RESULTS}|${check_name}:${status}"
CHECK_ORDER="${CHECK_ORDER}|${check_name}"
fi
}
# Function to get check result by name
get_check_result() {
local check_name="$1"
echo "$CHECK_RESULTS" | tr '|' '\n' | grep "^${check_name}:" | cut -d':' -f2
}
# Function to display check summary
display_check_summary() {
print_section "Pre-Deployment Check Summary"
local passed=0
local failed=0
# Split CHECK_ORDER by delimiter and iterate
IFS='|' read -ra CHECKS <<< "$CHECK_ORDER"
for check_name in "${CHECKS[@]}"; do
local status=$(get_check_result "$check_name")
if [[ "$status" == "PASS" ]]; then
print_status $GREEN "✅ $check_name: PASSED"
((passed++))
else
print_status $RED "❌ $check_name: FAILED"
((failed++))
fi
done
echo ""
print_status $BLUE "Summary: $passed passed, $failed failed"
if [[ $failed -eq 0 ]]; then
print_status $GREEN "🎉 All pre-deployment checks passed!"
print_status $GREEN "Your cluster is ready for Dynamo deployment."
return 0
else
print_status $RED "❌ $failed pre-deployment check(s) failed."
print_status $RED "Please address the issues above before proceeding with deployment."
return 1
fi
}
# Main execution
main() {
print_header
local overall_exit_code=0
# Run checks and capture results
if check_kubectl; then
record_check_result "kubectl Connectivity" "PASS"
else
record_check_result "kubectl Connectivity" "FAIL"
overall_exit_code=1
fi
if check_default_storage_class; then
record_check_result "Default StorageClass" "PASS"
else
record_check_result "Default StorageClass" "FAIL"
overall_exit_code=1
fi
if check_cluster_resources; then
record_check_result "Cluster GPU Resources" "PASS"
else
record_check_result "Cluster GPU Resources" "FAIL"
overall_exit_code=1
fi
if check_gpu_operator; then
record_check_result "GPU Operator" "PASS"
else
record_check_result "GPU Operator" "FAIL"
overall_exit_code=1
fi
# Display summary
echo ""
if ! display_check_summary; then
overall_exit_code=1
fi
exit $overall_exit_code
}
# Run the script
main "$@"
...@@ -19,6 +19,11 @@ limitations under the License. ...@@ -19,6 +19,11 @@ limitations under the License.
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides. High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
## Pre-deployment Checks
Before deploying the platform, it is recommended to run the pre-deployment checks to ensure the cluster is ready for deployment. Please refer to the [pre-deployment checks](/deploy/cloud/pre-deployment/README.md) for more details.
## 1. Install Platform First ## 1. Install Platform First
```bash ```bash
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment