README.md

# README for Evaluation

## 🌟 Overview

This script provides an evaluation pipeline for `LLaVA-Bench`.

For scoring, we use **GPT-4-0613** as the evaluation model.
While the provided code can run the benchmark, we recommend using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for testing this benchmark if you aim to align results with our technical report.

## 🗂️ Data Preparation

Before starting to download the data, please create the `InternVL/internvl_chat/data` folder.

### LLaVA-Bench

Follow the instructions below to prepare the data:

```shell
# Step 1: Download the dataset
cd data/
git clone https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild
cd ../
```

After preparation is complete, the directory structure is:

```shell
data/llava-bench-in-the-wild
├── images
├── answers_gpt4.jsonl
├── bard_0718.jsonl
├── bing_chat_0629.jsonl
├── context.jsonl
├── questions.jsonl
└── README.md
```

## 🏃 Evaluation Execution

> ⚠️ Note: For testing InternVL (1.5, 2.0, 2.5, and later versions), always enable `--dynamic` to perform dynamic resolution testing.

To run the evaluation, execute the following command on an 1-GPU setup:

```shell
# Step 1: Remove old inference results if exists
rm -rf results/llava_bench_results_review.jsonl

# Step 2: Run the evaluation
torchrun --nproc_per_node=1 eval/llava_bench/evaluate_llava_bench.py --checkpoint ${CHECKPOINT} --dynamic

# Step 3: Scoring the results using gpt-4-0613
export OPENAI_API_KEY="your_openai_api_key"
python -u eval/llava_bench/eval_gpt_review_bench.py \
  --question data/llava-bench-in-the-wild/questions.jsonl \
  --context data/llava-bench-in-the-wild/context.jsonl \
  --rule eval/llava_bench/rule.json \
  --answer-list \
      data/llava-bench-in-the-wild/answers_gpt4.jsonl \
      results/llava_bench_results.jsonl \
  --output \
      results/llava_bench_results_review.jsonl
python -u eval/llava_bench/summarize_gpt_review.py -f results/llava_bench_results_review.jsonl
```

Alternatively, you can run the following simplified command:

```shell
export OPENAI_API_KEY="your_openai_api_key"
GPUS=1 sh evaluate.sh ${CHECKPOINT} llava-bench --dynamic
```

### Arguments

The following arguments can be configured for the evaluation script:

| Argument         | Type   | Default         | Description                                                                                                       |
| ---------------- | ------ | --------------- | ----------------------------------------------------------------------------------------------------------------- |
| `--checkpoint`   | `str`  | `''`            | Path to the model checkpoint.                                                                                     |
| `--datasets`     | `str`  | `'llava_bench'` | Comma-separated list of datasets to evaluate.                                                                     |
| `--dynamic`      | `flag` | `False`         | Enables dynamic high resolution preprocessing.                                                                    |
| `--max-num`      | `int`  | `6`             | Maximum tile number for dynamic high resolution.                                                                  |
| `--load-in-8bit` | `flag` | `False`         | Loads the model weights in 8-bit precision.                                                                       |
| `--auto`         | `flag` | `False`         | Automatically splits a large model across 8 GPUs when needed, useful for models too large to fit on a single GPU. |