README.md 7.78 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
# NIXL Benchmark Documentation

This guide describes how to build and deploy the NIXL benchmark using the provided scripts on a Kubernetes (K8s) cluster.

> **Note**: NIXL benchmark is part of the Dynamo platform. Before proceeding, ensure your cluster meets the basic Dynamo requirements by running the pre-deployment check script located in the parent directory (`../pre-deployment-check.sh`).

---

## Prerequisites

### Cluster Requirements
Before deploying NIXL benchmark, ensure your cluster meets the Dynamo platform requirements by running the pre-deployment check:

```bash
# Run from the parent directory
../pre-deployment-check.sh
```

This script verifies:
- `kubectl` connectivity and cluster access
- GPU nodes availability (`nvidia.com/gpu.present=true` label)
- GPU Operator installation and status

### NIXL-Specific Requirements
In addition to the cluster requirements above, NIXL benchmark requires:
- **Docker** installed and configured on your local machine (for building images)
- **Docker registry access** to push the built nixlbench images
- **ETCD service** deployed and accessible as `etcd:2379`
- **Build utilities**: `wget` and `unzip` for downloading NIXL source code

### Verification Steps
1. **Run pre-deployment check** (recommended):
   ```bash
   ../pre-deployment-check.sh
   ```
   Ensure all checks pass before proceeding.

2. **Verify ETCD availability** (NIXL-specific):
   ```bash
   kubectl get svc etcd
   ```

3. **Confirm Docker registry access**:
   ```bash
   docker login your-registry.com  # if using private registry
   ```

---

## Quick Start

For the easiest experience, use the interactive build and deploy script:

```bash
./build_and_deploy.sh
```

This script provides a flexible workflow where you can:
1. **Select architecture**: Choose between x86_64 (Intel/AMD 64-bit) or aarch64 (ARM64)
2. **Choose which steps to execute**: Select any combination of:
   - Build nixlbench Docker image
   - Update deployment YAML file
   - Deploy to Kubernetes
3. **Provide Docker registry** (only when needed for building or updating deployment)

---

## Interactive Script Features

### Architecture Selection
The script supports two architectures:
- **Option 1**: x86_64 (Intel/AMD 64-bit)
- **Option 2**: aarch64 (ARM64)

You can select by entering:
- `1` or `x86_64` for x86_64 architecture
- `2` or `aarch64` for aarch64 architecture

### Step Selection
Choose which steps to execute by entering comma-separated numbers:

- **All steps**: `1,2,3`
- **Build and update only**: `1,2` (skips Kubernetes deployment)
- **Deploy only**: `3` (useful if image is already built and deployment file exists)
- **Build only**: `1` (useful for just creating the Docker image)
- **Update deployment only**: `2` (useful for updating deployment file with new registry/version)

### Smart Registry Prompting
The script only prompts for Docker registry information when needed:
- **Steps 1 or 2**: Registry required for building image or updating deployment
- **Step 3 only**: No registry prompt (uses existing deployment file)

---

## What Each Step Does

### Step 1: Build nixlbench Docker Image
- Downloads NIXL source code (version 0.6.0) from GitHub
- Extracts and navigates to the build directory
- Pauses for user confirmation before building
- Builds Docker image with specified registry and architecture
- Tags image as: `{registry}/nixlbench:0.6.0-{arch}`

### Step 2: Update Deployment YAML File
- Copies base deployment template (`nixlbench-deployment.yaml`)
- Creates architecture-specific deployment file (`nixlbench-deployment-{arch}.yaml`)
- Updates image reference with your registry and architecture
- Preserves all other deployment configurations

### Step 3: Deploy to Kubernetes
- Validates deployment file exists
- Applies deployment to Kubernetes cluster
- Provides monitoring commands for checking status

---

## Deployment Configuration

The deployment creates:
- **2 replicas** of the nixlbench pod
- **Resource requests/limits**:
  - CPU: 10 cores
  - Memory: 5Gi
  - GPU: 1 NVIDIA GPU per pod
- **Environment variables**:
  - `ETCD_ENDPOINTS`: Points to `etcd:2379`
- **Command**: Runs nixlbench with VRAM segments and keeps container alive

---

## Usage Examples

### Example 1: Complete Workflow
```bash
./build_and_deploy.sh
# Select: 1 (x86_64)
# Steps: 1,2,3
# Registry: docker.io/myusername
# Confirm: y
```

### Example 2: Build Image Only
```bash
./build_and_deploy.sh
# Select: 2 (aarch64)
# Steps: 1
# Registry: my-private-registry.com
# Confirm: y
```

### Example 3: Deploy Existing Image
```bash
./build_and_deploy.sh
# Select: 1 (x86_64)
# Steps: 3
# Confirm: y
```

### Example 4: Update Deployment File Only
```bash
./build_and_deploy.sh
# Select: 2 (aarch64)
# Steps: 2
# Registry: new-registry.com
# Confirm: y
```

---

## Generated Files

The script creates architecture-specific deployment files:
- `nixlbench-deployment-x86_64.yaml` - For x86_64 builds
- `nixlbench-deployment-aarch64.yaml` - For aarch64 builds

These files are customized versions of the base template with your specific:
- Docker registry
- Image tag
- Architecture

---

## Monitoring Your Deployment

After deployment, monitor your NIXL benchmark:

```bash
# Check pod status
kubectl get pods -l app=nixl-benchmark

# View logs
kubectl logs -l app=nixl-benchmark -f

# Check resource usage
kubectl top pods -l app=nixl-benchmark

# Get detailed pod information
kubectl describe pods -l app=nixl-benchmark
```

If deployed to a specific namespace:
```bash
kubectl get pods -l app=nixl-benchmark -n your-namespace
kubectl logs -l app=nixl-benchmark -f -n your-namespace
```

---


## Troubleshooting

### Cluster-Level Issues
For cluster-related problems, first run the pre-deployment check to identify issues:

```bash
../pre-deployment-check.sh
```

This will help diagnose:
- kubectl connectivity problems
- Missing default StorageClass
- GPU node availability issues
- GPU Operator status problems

### NIXL-Specific Issues

1. **ETCD Connection**:
   - Ensure etcd service is running: `kubectl get svc dynamo-platform-etcd`
   - Verify etcd endpoints are accessible from pods
   - Check if etcd is in the correct namespace

2. **Image Pull Issues**:
   - Verify registry credentials are configured
   - Check image exists: `docker pull {registry}/nixlbench:0.6.0-{arch}`
   - Ensure image was pushed successfully after build

3. **Build Failures**:
   - Ensure Docker daemon is running
   - Check available disk space in `/tmp`
   - Verify network connectivity to GitHub
   - Confirm build utilities are installed: `which wget unzip`

4. **Deployment File Not Found**:
   - Run step 2 to create deployment file before step 3
   - Check file permissions in script directory
   - Verify script directory path is correct

### Debug Commands
```bash
# Check script-generated files
ls -la nixlbench-deployment-*.yaml

# Verify deployment status
kubectl get deployment nixl-benchmark -o yaml

# Check events for issues
kubectl get events --sort-by=.metadata.creationTimestamp
```

### Cleanup

To remove the deployment:
```bash
kubectl delete deployment nixl-benchmark
```

Or if deployed to a specific namespace:
```bash
kubectl delete deployment nixl-benchmark -n your-namespace
```

To clean up generated files:
```bash
rm -f nixlbench-deployment-*.yaml
```

---

## Script Reference

### build_and_deploy.sh
Interactive script that provides flexible build and deployment workflow:
- **Architecture selection**: x86_64 or aarch64
- **Step selection**: Choose any combination of build, update, deploy
- **Validation**: Checks for deployment files before deploying

### nixlbench-deployment.yaml
Base Kubernetes deployment template that gets customized by the script:
- **Template image**: `my-registry/nixlbench:version-arch`
- **Resource allocation**: 10 CPU, 5Gi memory, 1 GPU per pod
- **ETCD integration**: Pre-configured environment variables
- **Benchmark command**: Runs with VRAM segment configuration