README.md 3.41 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# LMCache Dynamo MMLU Testing Suite

## Overview
Test the correctness of Dynamo integration with LMCache by comparing MMLU benchmark results with and without LMCache enabled.

## Testing Principle
Compare MMLU test results under two configurations:
- **Baseline Test**: Dynamo without LMCache (`ENABLE_LMCACHE=0`)
- **LMCache Test**: Dynamo with LMCache enabled (`ENABLE_LMCACHE=1`)

If both configurations produce the same inference results, it verifies that LMCache functionality is correct.

## Quick Start

### Prerequisites
1. Ensure dynamo and its dependencies are properly installed (i.e. nats and etcd are running)
2. Download MMLU dataset to `data` directory
3. Ensure HuggingFace models are accessible

### Download MMLU Dataset

```bash
cd ./tests/lmcache

# Auto-download and organize data
python3 download_mmlu.py
```

### Run Single Model Test
Change model name in the script to test other models.
```bash
cd ./tests/lmcache

# 1. Baseline test (without LMCache)
./deploy-baseline-dynamo.sh Qwen/Qwen3-0.6B
# Wait for model to load, then run test in another terminal:
python3 mmlu-baseline-dynamo.py --model Qwen/Qwen3-0.6B --number-of-subjects 15
# Stop services with Ctrl+C in the deploy script terminal

# 2. LMCache test (with LMCache enabled)
./deploy-lmcache_enabled-dynamo.sh Qwen/Qwen3-0.6B
# Wait for model to load, then run test in another terminal:
python3 mmlu-lmcache_enabled-dynamo.py --model Qwen/Qwen3-0.6B --number-of-subjects 15
# Stop services with Ctrl+C in the deploy script terminal

# 3. Compare results
python3 summarize_scores_dynamo.py
```

## File Description

### Deployment Scripts
- **`deploy-baseline-dynamo.sh`**: Deploy Dynamo without LMCache (baseline)
- **`deploy-lmcache_enabled-dynamo.sh`**: Deploy Dynamo with LMCache enabled (test)

### Test Scripts
- **`mmlu-baseline-dynamo.py`**: Run MMLU test on baseline Dynamo
- **`mmlu-lmcache_enabled-dynamo.py`**: Run MMLU test on Dynamo with LMCache
- **`summarize_scores_dynamo.py`**: Compare and analyze test results

## Architecture Differences

### Baseline Architecture (deploy-baseline-dynamo.sh)
```
65
HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → Direct Inference
66
67
68
69
70
Environment: ENABLE_LMCACHE=0
```

### LMCache Architecture (deploy-lmcache_enabled-dynamo.sh)
```
71
HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → LMCache-enabled Inference
72
73
74
75
76
77
78
79
80
81
82
Environment: ENABLE_LMCACHE=1
            LMCACHE_CHUNK_SIZE=256
            LMCACHE_LOCAL_CPU=True
            LMCACHE_MAX_LOCAL_CPU_SIZE=1.0
```

## API Format

Test scripts use Dynamo's Chat Completions API:

```bash
83
curl -X POST http://localhost:8000/v1/chat/completions \
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
  -H "Content-Type: application/json" \
  -d '{
    "model": Qwen/Qwen3-0.6B,
    "messages": [{"role": "user", "content": "question content"}],
    "temperature": 0,
    "max_tokens": 3,
    "stream": false,
    "seed": 42
  }'
```


## Result Interpretation

After testing completes, the following files will be generated:
- `dynamo-baseline-{model_name}.jsonl`: Baseline test results
- `dynamo-lmcache-{model_name}.jsonl`: LMCache test results

If the accuracy in both result files is very close (difference < 1%), it indicates LMCache functionality is correct.

## Notes

1. **Determinism guarantee**: All tests use the same seed (42) and zero temperature to ensure reproducible results
2. **Pre-requisites**: Ensure nats and etcd are running.
3. **Sequential execution**: Must stop the first test before starting the second to avoid port conflicts