README.md 4.76 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# EPD Correctness Test

This test verifies that EPD (Encoder-Prefill-Decode) disaggregation produces identical outputs to a baseline single instance.

## What It Tests

- **Baseline**: Single vLLM instance serving a multimodal model
- **EPD (1E+1PD)**: 1 Encoder + 1 Prefill-Decode instance
- **Baseline (1P+1D)**: 1 Prefill + 1 Decode instance
- **EPD (1E+1P+1D)**: 1 Encoder + 1 Prefill + 1 Decode instance

The test ensures that disaggregated encoding produces **identical** outputs to the baseline.

Note that currently PD disaggregation set up may give slightly different results from a single instance. Therefore, we need the result from 1P+1D as the baseline for 1E+1P+1D

Please refer to [Disaggregated Encoder Feature](../../../docs/features/disagg_encoder.md) for the detailed explanation for the EPD features.

## Files

- `run_epd_correctness_test.sh` - Main test script (starts all instances and runs tests)
- `test_epd_correctness.py` - Python test script (compares outputs)

## Usage

### Multimodal Prompts (Default)

```bash
cd vllm
./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
```

This runs the test with actual multimodal (image) prompts.

### Text-Only Prompts

```bash
cd vllm
USE_MM_PROMPTS=0 ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
```

This runs a quick test with text-only prompts to verify the setup works.

### Custom Configuration

```bash
# Use specific GPUs
GPU_E=0 GPU_PD=1 GPU_P=1 GPU_D=2 bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh

# Use specific ports
ENDPOINT_PORT=10001 bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh

# Use specific model
MODEL="Qwen/Qwen2.5-VL-3B-Instruct" bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh

# Use specific storage path
EC_SHARED_STORAGE_PATH="/tmp/my_ec_cache" bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
```

## How It Works

### Step 1: Baseline

1. Start single vLLM instance on GPU
2. Run test prompts (multimodal or text-only)
3. Save outputs to `.vllm_epd_baseline.txt`
4. Shutdown instance

### Step 2: EPD (1E + 1PD)

1. Clear encoder cache storage
2. Start instances and proxy
3. Run same test prompts
4. Assert outputs match baseline exactly
5. Shutdown instances

### Step 3: EPD (1E + 1P + 1D)

1. Clear encoder cache storage
2. Start instances and proxy
3. Run same test prompts
4. Assert outputs match baseline exactly
5. Shutdown instances

## Test Scenarios

### Multimodal Prompts (--use_mm_prompts)

Tests encoder cache transfer:

- Single image query
- Multiple images in one request
- Mixed image and text
- Image with detailed questions

### Text-Only Prompts (default)

Quick sanity check:

- Simple text queries
- Text-only explanations
- Verifies proxy routing works

## Expected Behavior

### ✅ Test Passes When

- All disagg outputs match baseline outputs exactly
- No errors during instance startup
- Encoder cache is properly saved and loaded
- Proxy correctly routes requests

### ❌ Test Fails When

- Outputs differ between baseline and disagg
- Server startup fails
- Encoder cache not found (should fallback to local execution)
- Proxy routing errors

## Notes

- The test uses deterministic generation (`temperature=0.0`, `seed=42`)
- Encoder cache should enable exact output reproduction
- Test cleans up all instances and cache files after completion
- Safe to run multiple times (idempotent)
- We setup the PD disagg part with NixlConnector. Please read details about EPD in `examples/online_serving/disaggregated_encoder/README.md`

## Requirements

- Multiple GPUs (3 for 1E+1P+1D, 2 for 1E+1PD, 1 for baseline)
    - 1E+1P+1D is runnable with 2 GPU by assign E and P on the same GPU now.
- Multimodal model (e.g., Qwen2.5-VL-3B-Instruct)
- Internet access (for accessing vllm test images)

## Debugging

### Check Logs

Logs and baseline output are saved in `/tmp/` by default.
Can be customized by changing the environment variables.

### Check Encoder Cache

```bash
# Verify cache files are created
ls -la $EC_SHARED_STORAGE_PATH/

# Should see directories with mm_hash names
# Each containing encoder_cache.safetensors
```

### Manual Testing

Run individual components:

```bash
# Baseline only
python test_epd_correctness.py \
    --service_url http://localhost:8000 \
    --model_name Qwen/Qwen2.5-VL-3B-Instruct \
    --mode baseline \
    --baseline_file test_output.txt \
    --use_mm_prompts

# Disagg only (requires baseline output file!)
python test_epd_correctness.py \
    --service_url http://localhost:8000 \
    --model_name Qwen/Qwen2.5-VL-3B-Instruct \
    --mode disagg \
    --baseline_file test_output.txt \
    --use_mm_prompts
```