"...samples/nvidia.com_v1alpha1_dynamocomponentrequest.yaml" did not exist on "26fe79dc341965953401a20d9c71e1c51947b712"
README.md 5.42 KB
Newer Older
1
# Kubernetes utilities for Dynamo Benchmarking and Profiling
2

3
4
5
6
7
8
This directory contains utilities and manifests for Dynamo benchmarking and profiling workflows.

## Prerequisites

**Before using these utilities, you must first set up Dynamo Cloud following the main installation guide:**

9
👉 **[Follow the Dynamo Cloud installation guide](/docs/kubernetes/installation_guide.md) to install the Dynamo Kubernetes Platform first.**
10
11
12
13
14

This includes:
1. Installing the Dynamo CRDs
2. Installing the Dynamo Platform (operator, etcd, NATS)
3. Setting up your target namespace
15
16
17

## Contents

18
- `setup_benchmarking_resources.sh` — Sets up benchmarking and profiling resources in your existing Dynamo namespace
19
- `manifests/`
20
21
  - `serviceaccount.yaml` — ServiceAccount `dynamo-sa` for benchmarking and profiling jobs
  - `role.yaml` — Role `dynamo-role` with necessary permissions
22
  - `rolebinding.yaml` — RoleBinding `dynamo-binding`
23
  - `pvc.yaml` — PVC `dynamo-pvc` for storing profiler results and configurations
24
  - `pvc-access-pod.yaml` — short‑lived pod for copying profiler results from the PVC
25
26
27
28
29
- `kubernetes.py` — helper used by tooling to apply/read resources (e.g., access pod for PVC downloads)
- `inject_manifest.py` — utility for injecting deployment configurations into the PVC for profiling
- `download_pvc_results.py` — utility for downloading benchmark/profiling results from the PVC
- `dynamo_deployment.py` — utilities for working with DynamoGraphDeployment resources
- `requirements.txt` — Python dependencies for benchmarking utilities
30
31
32

## Quick start

33
### Benchmarking Resource Setup
34

35
After setting up Dynamo Cloud, use this script to prepare your namespace with the additional resources needed for benchmarking and profiling workflows:
36

37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
The setup script creates a `dynamo-pvc` with `ReadWriteMany` (RWX). If your cluster's default `storageClassName` does not support RWX, set `storageClassName` in `deploy/utils/manifests/pvc.yaml` to an RWX-capable class before running the script.

Example (add under `spec` in `deploy/utils/manifests/pvc.yaml`):
```yaml
...
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: <your-rwx-storageclass>
...
```

> [!TIP]
> **Check your clusters storage classes**
>
> - List storage classes and provisioners:
> ```bash
> kubectl get sc -o wide
> ```

57
```bash
58
59
export NAMESPACE=your-dynamo-namespace
export HF_TOKEN=<HF_TOKEN>  # Optional: for HuggingFace model access
60

61
deploy/utils/setup_benchmarking_resources.sh
62
63
```

64
This script applies the following manifests to your existing Dynamo namespace:
65
66
67
68
69
70

- `deploy/utils/manifests/serviceaccount.yaml` - ServiceAccount `dynamo-sa`
- `deploy/utils/manifests/role.yaml` - Role `dynamo-role`
- `deploy/utils/manifests/rolebinding.yaml` - RoleBinding `dynamo-binding`
- `deploy/utils/manifests/pvc.yaml` - PVC `dynamo-pvc`

71
If `HF_TOKEN` is provided, it also creates a secret for HuggingFace model access.
72

73
After running the setup script, verify the resources by checking:
74
75

```bash
76
77
kubectl get serviceaccount dynamo-sa -n $NAMESPACE
kubectl get pvc dynamo-pvc -n $NAMESPACE
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
```

### PVC Manipulation Scripts

These scripts interact with the Persistent Volume Claim (PVC) that stores configuration files and benchmark/profiling results. They're essential for the Dynamo benchmarking and profiling workflows.

#### Why These Scripts Are Needed

1. **For Pre-Deployment Profiling**: The profiling job needs access to your Dynamo deployment configurations (DGD manifests) to test different parallelization strategies
2. **For Retrieving Results**: Both benchmarking and profiling jobs write their results to the PVC, which you need to download for analysis

#### Script Usage

**Inject deployment configurations for profiling:**

```bash
# The profiling job reads your DGD config from the PVC
95
96
# IMPORTANT: All paths must start with /data/ for security reasons
python3 -m deploy.utils.inject_manifest \
97
98
  --namespace $NAMESPACE \
  --src ./my-disagg.yaml \
99
  --dest /data/configs/disagg.yaml
100
101
102
103
104
105
```

**Download benchmark/profiling results:**

```bash
# After benchmarking or profiling completes, download results
106
python3 -m deploy.utils.download_pvc_results \
107
108
  --namespace $NAMESPACE \
  --output-dir ./pvc_files \
109
  --folder /data/results \
110
111
112
  --no-config   # optional: skip *.yaml/*.yml in the download
```

113
114
115
116
117
118
119
120
121
122
123
124
#### Path Requirements

**Important**: The PVC is mounted at `/data` in the access pod for security reasons. All destination paths must start with `/data/`.

**Common path patterns:**
- `/data/configs/` - Configuration files (DGD manifests)
- `/data/results/` - Benchmark results
- `/data/profiling_results/` - Profiling data
- `/data/benchmarking/` - Benchmarking artifacts

**User-friendly error messages**: If you forget the `/data/` prefix, the script will show a helpful error message with the correct path and example commands.

125
126
#### Next Steps

127
For complete benchmarking and profiling workflows:
128
129
130
131
132
- **Benchmarking Guide**: See [docs/benchmarks/benchmarking.md](../../docs/benchmarks/benchmarking.md) for comparing DynamoGraphDeployments and external endpoints
- **Pre-Deployment Profiling**: See [docs/benchmarks/pre_deployment_profiling.md](../../docs/benchmarks/pre_deployment_profiling.md) for optimizing configurations before deployment

## Notes

133
134
- Profiling job manifest remains in `benchmarks/profiler/deploy/profile_sla_job.yaml` and relies on the ServiceAccount/PVC created by the setup script.
- This setup is focused on benchmarking and profiling resources only - the main Dynamo platform must be installed separately.