README.md

# README for Evaluation

## 🌟 Overview

This script provides an evaluation pipeline for `MMBench` and `CCBench`.

## 🗂️ Data Preparation

Before starting to download the data, please create the `InternVL/internvl_chat/data` folder.

### MMBench and CCBench

Follow the instructions below to prepare the data:

```shell
# Step 1: Create the data directory
mkdir -p data/mmbench && cd data/mmbench

# Step 2: Download csv files
wget http://opencompass.openxlab.space/utils/MMBench/CCBench_legacy.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_20230712.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_en_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_en_20231003.tsv

cd ../..
```

After preparation is complete, the directory structure is:

```shell
data/mmbench
 ├── CCBench_legacy.tsv
 ├── mmbench_dev_20230712.tsv
 ├── mmbench_dev_cn_20231003.tsv
 ├── mmbench_dev_en_20231003.tsv
 ├── mmbench_test_cn_20231003.tsv
 └── mmbench_test_en_20231003.tsv
```

## 🏃 Evaluation Execution

> ⚠️ Note: For testing InternVL (1.5, 2.0, 2.5, and later versions), always enable `--dynamic` to perform dynamic resolution testing.

To run the evaluation, execute the following command on an 8-GPU setup:

```shell
# Test the MMBench-Dev-EN
torchrun --nproc_per_node=8 eval/mmbench/evaluate_mmbench.py --checkpoint ${CHECKPOINT} --dynamic --datasets mmbench_dev_20230712
# Test the MMBench-Test-EN
torchrun --nproc_per_node=8 eval/mmbench/evaluate_mmbench.py --checkpoint ${CHECKPOINT} --dynamic --datasets mmbench_test_en_20231003
# Test the MMBench-Dev-CN
torchrun --nproc_per_node=8 eval/mmbench/evaluate_mmbench.py --checkpoint ${CHECKPOINT} --dynamic --datasets mmbench_dev_cn_20231003
# Test the MMBench-Test-CN
torchrun --nproc_per_node=8 eval/mmbench/evaluate_mmbench.py --checkpoint ${CHECKPOINT} --dynamic --datasets mmbench_test_cn_20231003
# Test the CCBench-Dev
torchrun --nproc_per_node=8 eval/mmbench/evaluate_mmbench.py --checkpoint ${CHECKPOINT} --dynamic --datasets ccbench_dev_cn
```

Alternatively, you can run the following simplified command:

```shell
# Test the MMBench-Dev-EN
GPUS=8 sh evaluate.sh ${CHECKPOINT} mmbench-dev-en --dynamic
# Test the MMBench-Test-EN
GPUS=8 sh evaluate.sh ${CHECKPOINT} mmbench-test-en --dynamic
# Test the MMBench-Dev-CN
GPUS=8 sh evaluate.sh ${CHECKPOINT} mmbench-dev-cn --dynamic
# Test the MMBench-Test-CN
GPUS=8 sh evaluate.sh ${CHECKPOINT} mmbench-test-cn --dynamic
# Test the CCBench-Dev
GPUS=8 sh evaluate.sh ${CHECKPOINT} ccbench-dev --dynamic
```

After the test is completed, a file with a name similar to `results/mmbench_dev_20230712_241224214015.xlsx` will be generated. Please upload these files to the [official server](https://mmbench.opencompass.org.cn/mmbench-submission) to obtain the evaluation scores.

### Arguments

The following arguments can be configured for the evaluation script:

| Argument         | Type   | Default                  | Description                                                                                                       |
| ---------------- | ------ | ------------------------ | ----------------------------------------------------------------------------------------------------------------- |
| `--checkpoint`   | `str`  | `''`                     | Path to the model checkpoint.                                                                                     |
| `--datasets`     | `str`  | `'mmbench_dev_20230712'` | Comma-separated list of datasets to evaluate.                                                                     |
| `--dynamic`      | `flag` | `False`                  | Enables dynamic high resolution preprocessing.                                                                    |
| `--max-num`      | `int`  | `6`                      | Maximum tile number for dynamic high resolution.                                                                  |
| `--load-in-8bit` | `flag` | `False`                  | Loads the model weights in 8-bit precision.                                                                       |
| `--auto`         | `flag` | `False`                  | Automatically splits a large model across 8 GPUs when needed, useful for models too large to fit on a single GPU. |