Makes CUDA calls return error codes to simulate various GPU failures. Uses LD_PRELOAD to intercept CUDA library calls.
Intercepts CUDA calls to simulate GPU failures using LD_PRELOAD. Faults persist across pod restarts via hostPath volumes, enabling realistic hardware failure testing.
```
```
Pod calls cudaMalloc() → LD_PRELOAD intercepts → Returns error → Pod crashes
Pod calls cudaMalloc() → LD_PRELOAD intercepts → Checks /host-fault/cuda_fault_enabled → Returns error → Pod crashes
```
```
**Result**: Realistic GPU failure testing without hardware damage.
**Key Features**:
-**Persistent faults**: hostPath volume (`/var/lib/cuda-fault-test`) survives pod restarts on same node
-**Runtime toggle**: Enable/disable faults without pod restarts via `/host-fault/cuda_fault_enabled`
-**Node-specific**: Faults only on target node, healthy nodes unaffected
## Scope
## Scope
...
@@ -35,13 +38,20 @@ This library simulates **software/orchestration-level failures** that occur when
...
@@ -35,13 +38,20 @@ This library simulates **software/orchestration-level failures** that occur when