README.md 10.4 KB
Newer Older
luopl's avatar
luopl committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
# MathVision Benchmark Evaluation

This directory contains the implementation for evaluating vision-language models on the MathVision benchmark using vLLM for high-speed inference.

## Overview

MathVision is a mathematical visual reasoning benchmark that evaluates models' ability to solve mathematical problems based on visual information. This implementation provides:

- **High-speed inference** using vLLM with automatic batch optimization
- **Two-stage evaluation** using rule-based and GPT-4o-based answer extraction
- **Support for thinking models** with extended reasoning capabilities
- **Modular code structure** for easy maintenance and extension

## Project Structure

```
MathVision/
├── run_mathv.py          # Main script for inference and evaluation
├── dataset_utils.py      # Dataset loading and preprocessing utilities
├── eval_utils.py         # Evaluation logic and answer extraction
├── common_utils.py       # Common utilities for image processing, file I/O
├── infer_instruct.sh     # Inference script for instruct models
├── infer_think.sh        # Inference script for thinking models
├── eval_instruct.sh      # Evaluation script for instruct model results
├── eval_think.sh         # Evaluation script for thinking model results
├── requirements.txt      # Python dependencies
└── README.md            # This file
```

## Requirements

### Python Dependencies

```bash
pip install -r requirements.txt
```

Key dependencies:
- `vllm` - High-speed LLM inference engine
- `transformers` - HuggingFace transformers
- `qwen_vl_utils` - Qwen VL utilities for vision processing
- `pandas`, `numpy` - Data processing
- `Pillow` - Image processing
- `latex2sympy2` - LaTeX to symbolic math conversion (optional)
- `openpyxl` - Excel file handling
- `requests` - API calls for evaluation

### Environment Variables

For evaluation, you need to set up API credentials for the judge model:

**Option 1: DashScope API (Recommended)**
```bash
export CHATGPT_DASHSCOPE_API_KEY="your-api-key"
export DASHSCOPE_API_BASE="https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions"
```

**Option 2: Custom OpenAI-compatible API**
```bash
export MIT_SPIDER_TOKEN="your-api-key"
export MIT_SPIDER_URL="your-api-endpoint"
```

## Quick Start

### 1. Inference

Run inference on MathVision dataset using an instruct model:

```bash
bash infer_instruct.sh
```

Or customize the inference:

```bash
python run_mathv.py infer \
    --model-path /path/to/Qwen3-VL-Instruct \
    --data-dir /path/to/data \
    --dataset MathVision \
    --output-file results/predictions.jsonl \
    --max-new-tokens 32768 \
    --temperature 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --repetition-penalty 1.0 \
    --presence-penalty 1.5
```

For thinking models with extended reasoning:

```bash
bash infer_think.sh
```

### 2. Evaluation

Evaluate the inference results using GPT-4o as a judge:

```bash
bash eval_instruct.sh
```

Or customize the evaluation:

```bash
python run_mathv.py eval \
    --data-dir /path/to/data \
    --input-file results/predictions.jsonl \
    --output-file results/evaluation.csv \
    --dataset MathVision \
    --eval-model gpt-4o-2024-05-13 \
    --api-type dash \
    --nproc 16
```

## Detailed Usage

### Inference Mode

**Basic Arguments:**
- `--model-path`: Path to the Qwen3-VL model directory (required)
- `--data-dir`: Directory to store/load MathVision dataset (required)
- `--dataset`: Dataset name (default: `MathVision`)
  - `MathVision`: Full dataset with ~3,000 samples
  - `MathVision_MINI`: Mini version for quick testing
- `--output-file`: Path to save inference results in JSONL format (required)

**vLLM Arguments:**
- `--tensor-parallel-size`: Number of GPUs for tensor parallelism (default: auto-detect)
- `--gpu-memory-utilization`: GPU memory utilization ratio, 0.0-1.0 (default: 0.9)
- `--max-model-len`: Maximum model context length (default: 128000)
- `--max-images-per-prompt`: Maximum images per prompt (default: 10)

**Generation Parameters:**
- `--max-new-tokens`: Maximum tokens to generate (default: 32768)
- `--temperature`: Sampling temperature (default: 0.7)
- `--top-p`: Top-p sampling (default: 0.8)
- `--top-k`: Top-k sampling (default: 20)
- `--repetition-penalty`: Repetition penalty (default: 1.0)
- `--presence-penalty`: Presence penalty to reduce repetition (default: 1.5)

**Advanced Options:**
- `--use-cot`: Enable Chain-of-Thought prompting for better reasoning
- `--cot-prompt`: Custom CoT prompt (default: " Let's think step by step.")
- `--num-samples`: Number of samples to process (optional, for testing)

### Evaluation Mode

**Basic Arguments:**
- `--data-dir`: Directory containing MathVision dataset (required)
- `--input-file`: Inference results file in JSONL format (required)
- `--output-file`: Path to save evaluation results in CSV format (required)
- `--dataset`: Dataset name, must match inference (default: `MathVision`)

**Judge Model Arguments:**
- `--eval-model`: Judge model name (default: `gpt-4o`)
  - Options: `gpt-4o`, `gpt-4o-2024-05-13`, `gpt-3.5-turbo-0125`, etc.
- `--api-type`: API service type (default: `dash`)
  - `dash`: DashScope API (Alibaba Cloud)
  - `mit`: Custom OpenAI-compatible API
- `--nproc`: Number of parallel workers for evaluation (default: 4)

## Output Files

### Inference Output

The inference script generates a JSONL file where each line contains:

```json
{
  "question_id": 123,
  "annotation": {
    "index": 123,
    "question": "What is the area of the triangle?",
    "answer": "12",
    "category": "Geometry",
    "choices": "[]",
    ...
  },
  "task": "MathVision",
  "result": {
    "gen": "The final answer",
    "gen_raw": "Raw model output including thinking process"
  },
  "messages": [...]
}
```

### Evaluation Output

The evaluation script generates multiple files:

1. **Intermediate results** (`*_eval_results.xlsx`): Raw predictions with metadata
2. **Detailed evaluation** (`*_eval_results_eval.xlsx`): Results with extracted answers
   - Columns: `index`, `question`, `prediction`, `answer`, `res` (extracted), `log`, `extract_model`, `extract_flag`, `category`
3. **Score summary** (`*_eval_results_eval_score.csv`): Accuracy by category

Example score summary:
```
Subject         | tot | prefetch | hit | prefetch_rate | acc
----------------|-----|----------|-----|---------------|------
Overall         | 3000| 2400     | 2100| 80.0          | 70.0
Algebra         | 800 | 640      | 560 | 80.0          | 70.0
Geometry        | 750 | 600      | 525 | 80.0          | 70.0
```

## Model-Specific Configurations

### Instruct Models (e.g., Qwen3-VL-2B-Instruct, Qwen3-VL-30B-Instruct)

Use standard parameters for balanced performance:

```bash
--max-new-tokens 32768
--temperature 0.7
--top-p 0.8
--top-k 20
--repetition-penalty 1.0
--presence-penalty 1.5
```

### Thinking Models (e.g., Qwen3-VL-4B-Thinking, Qwen3-VL-30B-Thinking)

Use extended parameters for deeper reasoning:

```bash
--max-new-tokens 40960
--temperature 1.0
--top-p 0.95
--top-k 20
--repetition-penalty 1.0
--presence-penalty 0.0
```

**Note:** Thinking models output reasoning steps wrapped in `<think>...</think>` tags. The evaluation automatically extracts the final answer after `</think>`.

## Performance Tips

1. **GPU Memory**: Adjust `--gpu-memory-utilization` based on your GPU:
   - 0.9: Recommended for most cases
   - 0.95: For maximum throughput (may cause OOM)
   - 0.7-0.8: If experiencing OOM errors

2. **Batch Size**: vLLM automatically optimizes batch size based on available memory

3. **Tensor Parallelism**: Use `--tensor-parallel-size` for large models:
   - 2B/4B models: 1-2 GPUs
   - 7B/14B models: 2-4 GPUs
   - 30B+ models: 4-8 GPUs

4. **Context Length**: Reduce `--max-model-len` if memory is limited:
   - 128000: Default, works well for most cases
   - 64000: Reduces memory usage by ~40%

5. **Image Resolution**: MathVision uses optimized resolution (768×28×28 to 5120×28×28)
   - Lower min_pixels for faster processing
   - Higher max_pixels for better accuracy on complex diagrams

## Troubleshooting

### Common Issues

**1. CUDA Out of Memory**
```bash
# Reduce GPU memory utilization
--gpu-memory-utilization 0.7

# Or reduce context length
--max-model-len 64000
```

**2. vLLM Multiprocessing Issues**
The code automatically sets `VLLM_WORKER_MULTIPROC_METHOD=spawn`. If you still encounter issues:
```bash
export VLLM_WORKER_MULTIPROC_METHOD=spawn
```

**3. Evaluation API Errors**
- Verify API credentials are set correctly
- Check API endpoint connectivity
- Monitor rate limits
- Increase `--nproc` value if rate-limited (up to 32)

**4. Dataset Download Issues**
The dataset is automatically downloaded from:
```
https://opencompass.openxlab.space/utils/VLMEval/MathVision.tsv
```
If download fails, manually download and place in `--data-dir`.

**5. Excel Export Errors**
The code automatically removes illegal Excel characters. If you still encounter issues:
- Check `clean_for_excel()` function in `run_mathv.py`
- Ensure `openpyxl` is installed

**6. LaTeX Conversion Errors**
Install `latex2sympy2` for better LaTeX support:
```bash
pip install latex2sympy2
```

## Advanced Usage

### Custom Image Resolution

Edit `run_mathv.py` to modify image resolution:

```python
MIN_PIXELS = 768*28*28   # ~0.6M pixels
MAX_PIXELS = 5120*28*28  # ~4M pixels
```

### Custom Evaluation Prompts

The evaluation uses in-context examples defined in `eval_utils.py`:
- Edit `get_gpt4_ICE()` to customize examples
- Edit `build_mathv_gpt4_prompt()` to modify prompt structure

### Testing with Limited Samples

Use `--num-samples` for quick testing:

```bash
python run_mathv.py infer \
    --model-path /path/to/model \
    --data-dir /path/to/data \
    --dataset MathVision \
    --output-file results/test.jsonl \
    --num-samples 100
```

### Debugging

Enable debug mode for detailed logs:

```bash
DEBUG=true python run_mathv.py eval ...
```

This processes only the first 5 samples in single-threaded mode.

## Citation

If you use this code or the MathVision benchmark, please cite:

```bibtex
@article{mathvision,
  title={Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset}, 
  author={Ke Wang and Junting Pan and Weikang Shi and Zimu Lu and Mingjie Zhan and Hongsheng Li},
  journal={arXiv:2402.14804},
  year={2024}
}
```

## License

This code is released under the same license as the Qwen3-VL model.

## Support

For issues and questions:
- GitHub Issues: [Qwen3-VL Repository](https://github.com/QwenLM/Qwen3-VL)
- Documentation: See inline code comments and docstrings