testing.md 11.6 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Testing
5
6
---

7
8
# Fault Tolerance Testing

9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.

## Overview

Dynamo's fault tolerance test suite is located in `tests/fault_tolerance/` and includes:

| Test Category | Location | Purpose |
|---------------|----------|---------|
| Cancellation | `cancellation/` | Request cancellation during in-flight operations |
| Migration | `migration/` | Request migration when workers fail |
| etcd HA | `etcd_ha/` | etcd failover and recovery |
| Hardware | `hardware/` | GPU and network fault injection |
| Deployment | `deploy/` | End-to-end deployment testing |

## Test Directory Structure

```
tests/fault_tolerance/
├── cancellation/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── migration/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── etcd_ha/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── hardware/
│   └── fault_injection_service/
│       ├── api_service/
│       └── agents/
├── deploy/
│   ├── test_deployment.py
│   ├── scenarios.py
│   ├── base_checker.py
│   └── ...
└── client.py
```

## Request Cancellation Tests

Test that in-flight requests can be properly canceled.

### Running Cancellation Tests

```bash
# Run all cancellation tests
pytest tests/fault_tolerance/cancellation/ -v

# Run for specific backend
pytest tests/fault_tolerance/cancellation/test_vllm.py -v
```

### Cancellation Test Utilities

The `cancellation/utils.py` module provides:

#### CancellableRequest

Thread-safe request cancellation via TCP socket manipulation:

```python
from tests.fault_tolerance.cancellation.utils import CancellableRequest

request = CancellableRequest()

# Send request in separate thread
thread = Thread(target=send_request, args=(request,))
thread.start()

# Cancel after some time
time.sleep(1)
request.cancel()  # Closes underlying socket
```

#### send_completion_request / send_chat_completion_request

Send cancellable completion requests:

```python
from tests.fault_tolerance.cancellation.utils import (
    send_completion_request,
    send_chat_completion_request
)

# Non-streaming
response = send_completion_request(
    base_url="http://localhost:8000",
    model="Qwen/Qwen3-0.6B",
    prompt="Hello, world!",
    max_tokens=100
)

# Streaming with cancellation
responses = send_chat_completion_request(
    base_url="http://localhost:8000",
    model="Qwen/Qwen3-0.6B",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
    cancellable_request=request
)
```

#### poll_for_pattern

Wait for specific patterns in logs:

```python
from tests.fault_tolerance.cancellation.utils import poll_for_pattern

# Wait for cancellation confirmation
found = poll_for_pattern(
    log_file="/var/log/dynamo/worker.log",
    pattern="Request cancelled",
    timeout=30,
    interval=0.5
)
```

## Migration Tests

Test that requests migrate to healthy workers when failures occur.

### Running Migration Tests

```bash
# Run all migration tests
pytest tests/fault_tolerance/migration/ -v

# Run for specific backend
pytest tests/fault_tolerance/migration/test_vllm.py -v
```

### Migration Test Utilities

The `migration/utils.py` module provides:

- Frontend wrapper with configurable request planes
- Long-running request spawning for migration scenarios
- Health check disabling for controlled testing

### Example Migration Test

```python
def test_migration_on_worker_failure():
    # Start deployment with 2 workers
    deployment = start_deployment(workers=2)

    # Send long-running request
    request_thread = spawn_long_request(max_tokens=1000)

    # Kill one worker mid-generation
    kill_worker(deployment.workers[0])

    # Verify request completes on remaining worker
    response = request_thread.join()
    assert response.status_code == 200
    assert len(response.tokens) > 0
```

## etcd HA Tests

Test system behavior during etcd failures and recovery.

### Running etcd HA Tests

```bash
pytest tests/fault_tolerance/etcd_ha/ -v
```

### Test Scenarios

- **Leader failover**: etcd leader node fails, cluster elects new leader
- **Network partition**: etcd node becomes unreachable
- **Recovery**: System recovers after etcd becomes available

## Hardware Fault Injection

The fault injection service enables testing under simulated hardware failures.

### Fault Injection Service

Located at `tests/fault_tolerance/hardware/fault_injection_service/`, this FastAPI service orchestrates fault injection:

```bash
# Start the fault injection service
cd tests/fault_tolerance/hardware/fault_injection_service
python -m api_service.main
```

### Supported Fault Types

#### GPU Faults

| Fault Type | Description |
|------------|-------------|
| `XID_ERROR` | Simulate GPU XID error (various codes) |
| `THROTTLE` | GPU thermal throttling |
| `MEMORY_PRESSURE` | GPU memory exhaustion |
| `OVERHEAT` | GPU overheating condition |
| `COMPUTE_OVERLOAD` | GPU compute saturation |

#### Network Faults

| Fault Type | Description |
|------------|-------------|
| `FRONTEND_WORKER` | Partition between frontend and workers |
| `WORKER_NATS` | Partition between workers and NATS |
| `WORKER_WORKER` | Partition between workers |
| `CUSTOM` | Custom network partition |

### Fault Injection API

#### Inject GPU Fault

```bash
curl -X POST http://localhost:8080/api/v1/faults/gpu/inject \
  -H "Content-Type: application/json" \
  -d '{
    "target_pod": "vllm-worker-0",
    "fault_type": "XID_ERROR",
    "severity": "HIGH"
  }'
```

#### Inject Specific XID Error

```bash
# Inject XID 79 (GPU memory page fault)
curl -X POST http://localhost:8080/api/v1/faults/gpu/inject/xid-79 \
  -H "Content-Type: application/json" \
  -d '{"target_pod": "vllm-worker-0"}'
```

Supported XID codes: 43, 48, 74, 79, 94, 95, 119, 120

#### Inject Network Partition

```bash
curl -X POST http://localhost:8080/api/v1/faults/network/inject \
  -H "Content-Type: application/json" \
  -d '{
    "partition_type": "FRONTEND_WORKER",
    "duration_seconds": 30
  }'
```

#### Recover from Fault

```bash
curl -X POST http://localhost:8080/api/v1/faults/{fault_id}/recover
```

#### List Active Faults

```bash
curl http://localhost:8080/api/v1/faults
```

### GPU Fault Injector Agent

The GPU fault injector runs as a DaemonSet on worker nodes:

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-fault-injector
spec:
  selector:
    matchLabels:
      app: gpu-fault-injector
  template:
    spec:
      containers:
      - name: agent
        image: dynamo/gpu-fault-injector:latest
        securityContext:
          privileged: true
        volumeMounts:
        - name: dev
          mountPath: /dev
```

The agent injects fake XID messages via `/dev/kmsg` to trigger NVSentinel detection.

## Deployment Testing Framework

The `deploy/` directory contains an end-to-end testing framework.

### Test Phases

Tests run through three phases:

| Phase | Description |
|-------|-------------|
| `STANDARD` | Baseline performance under normal conditions |
| `OVERFLOW` | System behavior during fault/overload |
| `RECOVERY` | System recovery after fault resolution |

### Scenario Configuration

Define test scenarios in `scenarios.py`:

```python
from tests.fault_tolerance.deploy.scenarios import Scenario, Load, Failure

scenario = Scenario(
    name="worker_failure_migration",
    backend="vllm",
    load=Load(
        clients=10,
        requests_per_client=100,
        max_tokens=256
    ),
    failure=Failure(
        type="pod_kill",
        target="vllm-worker-0",
        trigger_after_requests=50
    )
)
```

### Running Deployment Tests

```bash
# Run all deployment tests
pytest tests/fault_tolerance/deploy/test_deployment.py -v

# Run specific scenario
pytest tests/fault_tolerance/deploy/test_deployment.py::test_worker_failure -v
```

### Validation Checkers

The framework includes pluggable validators:

```python
from tests.fault_tolerance.deploy.base_checker import BaseChecker, ValidationContext

class MigrationChecker(BaseChecker):
    def check(self, context: ValidationContext) -> bool:
        # Verify migrations occurred
        migrations = context.metrics.get("migrations_total", 0)
        return migrations > 0
```

### Results Parsing

Parse test results for analysis:

```python
from tests.fault_tolerance.deploy.parse_results import process_overflow_recovery_test

results = process_overflow_recovery_test(log_dir="/path/to/logs")
print(f"Success rate: {results['success_rate']}")
print(f"P99 latency: {results['p99_latency_ms']}ms")
```

## Client Utilities

The `client.py` module provides shared client functionality:

### Multi-Threaded Load Generation

```python
from tests.fault_tolerance.client import client

# Generate load with multiple clients
results = client(
    base_url="http://localhost:8000",
    num_clients=10,
    requests_per_client=100,
    model="Qwen/Qwen3-0.6B",
    max_tokens=256,
    log_dir="/tmp/test_logs"
)
```

### Request Options

| Parameter | Description |
|-----------|-------------|
| `base_url` | Frontend URL |
| `num_clients` | Number of concurrent clients |
| `requests_per_client` | Requests per client |
| `model` | Model name |
| `max_tokens` | Max tokens per request |
| `log_dir` | Directory for client logs |
| `endpoint` | `completions` or `chat/completions` |

## Running the Full Test Suite

### Prerequisites

1. Kubernetes cluster with GPU nodes
2. Dynamo deployment
3. etcd cluster (for HA tests)
4. Fault injection service (for hardware tests)

### Environment Setup

```bash
export KUBECONFIG=/path/to/kubeconfig
export DYNAMO_NAMESPACE=dynamo-test
export FRONTEND_URL=http://localhost:8000
```

### Run All Tests

```bash
# Install test dependencies
pip install pytest pytest-asyncio

# Run all fault tolerance tests
pytest tests/fault_tolerance/ -v --tb=short

# Run with specific markers
pytest tests/fault_tolerance/ -v -m "not slow"
```

### Test Markers

| Marker | Description |
|--------|-------------|
| `slow` | Long-running tests (> 5 minutes) |
| `gpu` | Requires GPU resources |
| `k8s` | Requires Kubernetes cluster |
| `etcd_ha` | Requires multi-node etcd |

## Best Practices

### 1. Isolate Test Environments

Run fault tolerance tests in dedicated namespaces:

```bash
kubectl create namespace dynamo-fault-test
```

### 2. Clean Up After Tests

Ensure fault injection is recovered:

```bash
# List and recover all active faults
curl http://localhost:8080/api/v1/faults | jq -r '.[].id' | \
  xargs -I {} curl -X POST http://localhost:8080/api/v1/faults/{}/recover
```

### 3. Collect Logs

Preserve logs for debugging:

```bash
pytest tests/fault_tolerance/ -v \
  --log-dir=/tmp/fault_test_logs \
  --capture=no
```

### 4. Monitor During Tests

Watch system state during tests:

```bash
# Terminal 1: Watch pods
watch kubectl get pods -n dynamo-test

# Terminal 2: Watch metrics
watch 'curl -s localhost:8000/metrics | grep -E "(migration|rejection)"'
```

## Related Documentation

- [Request Migration](request-migration.md) - Migration implementation details
- [Request Cancellation](request-cancellation.md) - Cancellation implementation
- [Health Checks](../observability/health-checks.md) - Health monitoring
- [Metrics](../observability/metrics.md) - Available metrics for monitoring