# 🛠️ How to implement a new Benchmark / VLM in VLMEvalKit?
## Implement a new benchmark
Example PR: **Math-Vision Benchmark** ([#292](https://github.com/open-compass/VLMEvalKit/pull/292/files))
In VLMEvalKit, benchmarks are organized as dataset classes. When you try to implement a new benchmark, you can either reuse existing dataset classes (*e.g.*, You can reuse `ImageMCQDataset` when implementing a new multi-choice benchmark), or support a new dataset class. Each dataset must have the following two member functions (either reuse the one of the parent class or implement your own):
-`build_prompt(self, line)`: The function input `line` is an integer (the sample index) or a `pd.Series` object (the raw record of the sample). The function outputs a `multi-modal message`, serving as the input of an MLLM. The `multi-modal message` is an interleaved list of multi-modal messages adopting the following format (the example includes an image and a text message): `[dict(type='image', value=IMAGE_PTH), dict(type='text', value=prompt)]`.
-`evaluate(self, eval_file, **judge_kwargs)`: The function input `eval_file` is the MLLM prediction (typically in `.xlsx` format). If the benchmark requires an external LLM (typically GPT) for evaluation, then `judge_kwargs` can pass the arguments for the LLM. The function outputs the benchmark evaluation results (metrics) in the form of `dict` or `pd.DataFrame`.
We then brief the typical steps to implement a new benchmark under VLMEvalKit:
### 1. Prepare your benchmark tsv file
Currently, we organize a benchmark as one single TSV file. During inference, the data file will be automatically downloaded from the definited `DATASET_URL` link to `$LMUData` file (default path is `$HOME/LMUData`, if not set explicitly). You can upload the prepared TSV file to a downloadable address (e.g., Huggingface) or send it to us at <opencompass@pjlab.org.cn>. We will assist in uploading the dataset to the server. You can also customize `LMUData` path in the environment variable `LMUData=/path/to/your/data`.
The contents of the TSV file consist of:
| Dataset Name \ Fields | index | image | image_path | question | hint | multi-choice<br>options | answer | category | l2-category | split |
-**question**: The question corresponding to the image, a string
-**answer**: The answer to the question, a string. The `test` split does not need this field
### 2. Cutomize your benchmark prompt
`ImageBaseDataset` defines the default prompt format. If you need to add prompts specific to the dataset or input data in the `Interleave` format to the model, you can implement this through the `build_prompt(line)` function. This function takes a line from a TSV file as input, containing fields such as index, image, question, etc. The function returns a dictionary list of multimodal messages `msg` in the format `[dict(type='image', value=IMAGE_PTH), dict(type='text', value=prompt)]`, including the image path and the text prompt to be input into VLMs. For interleave type inputs, you can directly place the dictionary of the image path at the image token position.
### 3. Cutomize your benchmark metrics
To add evaluation for a new benchmark, you need to customize a class object to implement the dataset’s metrics calculation. Multimodal datasets inherit from the `ImageBaseDataset` object in `vlmeval/dataset/image_base.py`. The TYPE defines the type of dataset, `DATASET_URL` is the download address of the dataset, and `DATASET_MD5` is the MD5 checksum for consistency checking of the dataset file.
In this class, **you need to implement** the `evaluate(eval_file, **judge_kwargs)` class function to calculate metrics and output results for the custom dataset. The function input `eval_file` is the path to the model prediction results file `{model_name}_{dataset}.xlsx`. This file can be read as a pandas.DataFrame using the `load(eval_file)` method, containing fields such as index, question, answer, category, prediction, etc. The judge_kwargs will pass a dictionary related to evaluation, such as the name of the `judge model`, the number of API request threads, etc. **The return value** of the function is the calculated accuracy and other metrics, formatted as a dictionary composed of lists, organized into a pandas.DataFrame.
## Implement a new model
Example PR: **Support LLaVA-Next-Interleave** ([#294](https://github.com/open-compass/VLMEvalKit/pull/294))
**1. Support `generate_inner` API (mandatory).**
All existing models are implemented in `vlmeval/vlm`. For a minimal model, your model class **must implement the method**`generate_inner(msgs, dataset=None)`. In this function, you feed a multi-modal message to your VLM and return the VLM prediction (which is a string). The optional argument `dataset` can be used as the flag for the model to switch among various inference strategies.
The multi-modal messages `msgs` is a list of dictionaries, each dictionary has two keys: type and value:
-`type`: We currently support two types, choices are ["image", "text"].
-`value`: When type=='text' , the value is the text message (a single string); when type=='image', the value can be the local path of an image file, or the image URL.
Currently a multi-modal message may contain arbitrarily interleaved images and texts. If your model do not support that, a practice can be taking the 1st image and concatenated text messages as the input. You can set the `INTERLEAVE = False` in your model class and use `self.message_to_promptimg(message, dataset=dataset)` to build your prompt and the first image's path.
dict(type='text',value='How many apples are there in these images?')
]
response=model.generate(msg1)
```
For convenience sake, we also support to take a list of string as inputs. In that case, we will check if a string is an image path or image URL and automatically convert it to the list[dict] format:
msg2=[IMAGE_URL,IMAGE_URL,'How many apples are there in these images?']
response=model.generate(msg1)
```
**Support Custom Prompt (optional).**
Besides, your model can support **custom prompt building** by implementing two optional methods: `use_custom_prompt(dataset)` and `build_prompt(line, dataset=None)`.
Both functions take the dataset name as the input:
-`use_custom_prompt(dataset)` returns a boolean flag, indicating whether the model should use the custom prompt building strategy.
- If `use_custom_prompt(dataset)` returns True, `build_prompt(line, dataset)` should return a customly bulit multimodal message for the corresponding `dataset`, given `line`, which is a dictionary that includes the necessary information of a data sample. If `use_custom_prompt(dataset)` returns False, the default prompt building strategy will be used.
**Support multi-turn chatting (optional).**
You can also support the multi-turn chatting and evaluation with your VLM by supporting the `chat_inner(message, dataset)` function. The function outputs a single string response, and the `message` is a list of chat history, following the below format.
```python
# Assume msg1, msg2, msg3, ... are multi-modal messages following the previously described format
# `chat_inner` take the following chat history list as input:
message=[
dict(role='user',content=msg1),
dict(role='assistant',content=msg2),
dict(role='user',content=msg3),
dict(role='assistant',content=msg4),
......
dict(role='user',content=msgn),
]
# `message` should contain an odd number of chat utterances, the role of utterances should be interleaved "user" and "assistant", with the role of the last utterance to be "user".
# The chat function will call `chat_inner`
response=model.chat(message)
```
### Example PRs:
- VLM that doesn't support interleaved images and texts, and does not use custom prompts: [[Model] Support glm-4v-9b](https://github.com/open-compass/VLMEvalKit/pull/221)
- VLM that supports interleaved images and texts and custom prompts: [Add MiniCPM-Llama3-V-2.5](https://github.com/open-compass/VLMEvalKit/pull/205)
To infer with API models (GPT-4v, Gemini-Pro-V, etc.) or use LLM APIs as the **judge or choice extractor**, you need to first setup API keys. VLMEvalKit will use an judge **LLM** to extract answer from the output if you set the key, otherwise it uses the **exact matching** mode (find "Yes", "No", "A", "B", "C"... in the output strings). **The exact matching can only be applied to the Yes-or-No tasks and the Multi-choice tasks.**
- You can place the required keys in `$VLMEvalKit/.env` or directly set them as the environment variable. If you choose to create a `.env` file, its content will look like:
```bash
# The .env file, place it under $VLMEvalKit
# API Keys of Proprietary VLMs
# QwenVL APIs
DASHSCOPE_API_KEY=
# Gemini w. Google Cloud Backends
GOOGLE_API_KEY=
# OpenAI API
OPENAI_API_KEY=
OPENAI_API_BASE=
# StepAI API
STEPAI_API_KEY=
# REKA API
REKA_API_KEY=
# GLMV API
GLMV_API_KEY=
# CongRong API
CW_API_BASE=
CW_API_KEY=
# SenseChat-V API
SENSECHAT_AK=
SENSECHAT_SK=
# Hunyuan-Vision API
HUNYUAN_SECRET_KEY=
HUNYUAN_SECRET_ID=
# You can also set a proxy for calling api models during the evaluation stage
EVAL_PROXY=
```
- Fill the blanks with your API keys (if necessary). Those API keys will be automatically loaded when doing the inference and evaluation.
## Step 1. Configuration
**VLM Configuration**: All VLMs are configured in `vlmeval/config.py`, for some VLMs, you need to configure the code root (MiniGPT-4, PandaGPT, etc.) or the model_weight root (LLaVA-v1-7B, etc.) before conducting the evaluation. During evaluation, you should use the model name specified in `supported_VLM` in `vlmeval/config.py` to select the VLM. For MiniGPT-4 and InstructBLIP, you also need to modify the config files in `vlmeval/vlm/misc` to configure LLM path and ckpt path.
We use `run.py` for evaluation. To use the script, you can use `$VLMEvalKit/run.py` or create a soft-link of the script (to use the script anywhere):
**Arguments**
-`--data (list[str])`: Set the dataset names that are supported in VLMEvalKit (defined in `vlmeval/utils/dataset_config.py`).
-`--model (list[str])`: Set the VLM names that are supported in VLMEvalKit (defined in `supported_VLM` in `vlmeval/config.py`).
-`--mode (str, default to 'all', choices are ['all', 'infer'])`: When `mode` set to "all", will perform both inference and evaluation; when set to "infer", will only perform the inference.
-`--nproc (int, default to 4)`: The number of threads for OpenAI API calling.
-`--work-dir (str, default to '.')`: The directory to save evaluation results.
-`--nframe (int, default to 8)`: The number of frames to sample from a video, only applicable to the evaluation of video benchmarks.
-`--pack (bool, store_true)`: A video may associate with multiple questions, if `pack==True`, will ask all questions for a video in a single query.
**Command for Evaluating Image Benchmarks **
You can run the script with `python` or `torchrun`:
```bash
# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
# That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).
# IDEFICS-80B-Instruct on MMBench_DEV_EN, MME, and SEEDBench_IMG, Inference and Evalution
The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics.
## Deploy a local language model as the judge / choice extractor
The default setting mentioned above uses OpenAI's GPT as the judge LLM. However, you can also deploy a local judge LLM with [LMDeploy](https://github.com/InternLM/lmdeploy).
First install:
```
pip install lmdeploy openai
```
And then deploy a local judge LLM with the single line of code. LMDeploy will automatically download the model from Huggingface. Assuming we use internlm2-chat-1_8b as the judge, port 23333, and the key sk-123456 (the key must start with "sk-" and follow with any number you like):
You need to get the model name registered by LMDeploy with the following python code:
```
from openai import OpenAI
client = OpenAI(
api_key='sk-123456',
base_url="http://0.0.0.0:23333/v1"
)
model_name = client.models.list().data[0].id
```
Now set some environment variables to tell VLMEvalKit how to use the local judge LLM. As mentioned above, you can also set them in `$VLMEvalKit/.env` file:
Finally, you can run the commands in step 2 to evaluate your VLM with the local judge LLM.
Note that
- If you hope to deploy the judge LLM in a single GPU and evaluate your VLM on other GPUs because of limited GPU memory, try `CUDA_VISIBLE_DEVICES=x` like
- If the local judge LLM is not good enough in following the instructions, the evaluation may fail. Please report such failures (e.g., by issues).
- It's possible to deploy the judge LLM in different ways, e.g., use a private LLM (not from HuggingFace) or use a quantized LLM. Please refer to the [LMDeploy doc](https://lmdeploy.readthedocs.io/en/latest/serving/api_server.html). You can use any other deployment framework if they support OpenAI API.
To help users get started quickly, we recommend the following process:
- For users who want to use VLMEvalKit, we recommend reading the "Start Your First Step" section to set up the environment and start a mini-experiment to familiarize yourself with the process.
- If you want to customize more modules, such as adding datasets and models, we provide an "Advanced Tutorial."
We always welcome users' PRs (Pull Requests) and Issues to improve VLMEvalKit!
[](https://github.com/open-compass/VLMEvalKit/stargazers)
title={VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models},
author={Haodong Duan and Junming Yang and Yuxuan Qiao and Xinyu Fang and Lin Chen and Yuan Liu and Xiaoyi Dong and Yuhang Zang and Pan Zhang and Jiaqi Wang and Dahua Lin and Kai Chen},
[](https://github.com/open-compass/VLMEvalKit/stargazers)
title={VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models},
author={Haodong Duan and Junming Yang and Yuxuan Qiao and Xinyu Fang and Lin Chen and Yuan Liu and Xiaoyi Dong and Yuhang Zang and Pan Zhang and Jiaqi Wang and Dahua Lin and Kai Chen},
<pathstyle="stroke: none; stroke-width: 1; stroke-dasharray: none; stroke-linecap: butt; stroke-dashoffset: 0; stroke-linejoin: miter; stroke-miterlimit: 4; fill: rgb(88,120,180); fill-rule: nonzero; opacity: 1;"transform=" translate(-19.1, -28)"d="M 16.5 22.6 L 10.1 25.700000000000003 L 15.399999999999999 25.500000000000004 L 16.5 22.6 z M 12.3 33.6 L 13.4 30.700000000000003 L 8.100000000000001 30.900000000000002 L 12.3 33.6 z M 21.6 33.3 L 28 30.199999999999996 L 22.7 30.399999999999995 L 21.6 33.3 z M 25.8 22.4 L 24.7 25.299999999999997 L 30 25.099999999999998 L 25.8 22.4 z M 31.5 26.2 L 24.4 26.4 L 22.7 25.299999999999997 L 24.2 21.299999999999997 L 22.2 20 L 19 21.5 L 17.5 25.4 L 14.8 26.7 L 7.700000000000001 26.9 L 4.500000000000001 28.4 L 6.600000000000001 29.799999999999997 L 13.700000000000001 29.599999999999998 L 13.700000000000001 29.599999999999998 L 15.4 30.7 L 13.9 34.7 L 16 36 L 19.2 34.5 L 20.7 30.6 L 20.7 30.6 L 23.3 29.400000000000002 L 23.3 29.400000000000002 L 30.5 29.200000000000003 L 33.7 27.700000000000003 L 31.5 26.2 z M 20.2 28.7 C 19.2 29.2 17.9 29.2 17.2 28.8 C 16.599999999999998 28.400000000000002 16.8 27.6 17.8 27.2 C 18.8 26.7 20.1 26.7 20.8 27.099999999999998 C 21.5 27.5 21.2 28.2 20.2 28.7 z"stroke-linecap="round"/>
<pathstyle="stroke: none; stroke-width: 1; stroke-dasharray: none; stroke-linecap: butt; stroke-dashoffset: 0; stroke-linejoin: miter; stroke-miterlimit: 4; fill: rgb(54,86,155); fill-rule: nonzero; opacity: 1;"transform=" translate(-27.05, -12.8)"d="M 33.5 19.8 L 32.2 13.3 L 30.700000000000003 15.200000000000001 L 33.5 19.8 z M 27.5 5.1 L 23.3 2.3999999999999995 L 26 7 L 27.5 5.1 z M 20.7 5.7 L 22 12.2 L 23.5 10.299999999999999 L 20.7 5.7 z M 26.8 20.4 L 31 23.099999999999998 L 28.3 18.5 L 26.8 20.4 z M 34 22.3 L 30.4 16.1 L 30.4 16.1 L 29.9 13.400000000000002 L 31.9 10.800000000000002 L 31.299999999999997 7.600000000000002 L 29.199999999999996 6.200000000000003 L 27.199999999999996 8.800000000000002 L 25.499999999999996 7.700000000000003 L 21.799999999999997 1.400000000000003 L 19.6 0 L 20.200000000000003 3.2 L 23.900000000000002 9.5 L 23.900000000000002 9.5 L 24.400000000000002 12.1 L 24.400000000000002 12.1 L 22.400000000000002 14.7 L 23.000000000000004 17.9 L 25.100000000000005 19.299999999999997 L 27.000000000000004 16.799999999999997 L 28.700000000000003 17.9 L 32.400000000000006 24.2 L 34.50000000000001 25.599999999999998 L 34 22.3 z M 27.5 14.6 C 26.9 14.2 26.2 13 26 12 C 25.8 11 26.2 10.5 26.8 10.9 C 27.400000000000002 11.3 28.1 12.5 28.3 13.5 C 28.5 14.6 28.1 15.1 27.5 14.6 z"stroke-linecap="round"/>
<pathstyle="stroke: none; stroke-width: 1; stroke-dasharray: none; stroke-linecap: butt; stroke-dashoffset: 0; stroke-linejoin: miter; stroke-miterlimit: 4; fill: rgb(27,56,130); fill-rule: nonzero; opacity: 1;"transform=" translate(-10.55, -13.4)"d="M 12 2.8 L 5.6 5.9 L 9.399999999999999 7.6000000000000005 L 12 2.8 z M 1.1 14.4 L 2.4000000000000004 20.9 L 5 16.099999999999998 L 1.1 14.4 z M 9.1 24 L 15.5 20.9 L 11.7 19.2 L 9.1 24 z M 20 12.4 L 18.7 5.9 L 16.099999999999998 10.7 L 20 12.4 z M 20.4 14.9 L 15.299999999999999 12.600000000000001 L 15.299999999999999 12.600000000000001 L 14.799999999999999 9.900000000000002 L 18.299999999999997 3.400000000000002 L 17.699999999999996 0.20000000000000195 L 14.499999999999996 1.700000000000002 L 11 8.1 L 8.3 9.4 L 8.3 9.4 L 3.2 7.1 L 0 8.6 L 0.6 11.8 L 5.8 14.100000000000001 L 6.3 16.8 L 6.3 16.8 L 2.8 23.4 L 3.4 26.599999999999998 L 6.6 25.099999999999998 L 10.1 18.599999999999998 L 12.7 17.4 L 12.7 17.4 L 17.9 19.799999999999997 L 21.099999999999998 18.299999999999997 L 20.4 14.9 z M 10.9 15.2 C 9.9 15.7 9 15.2 8.8 14.2 C 8.600000000000001 13.2 9.200000000000001 12 10.200000000000001 11.5 C 11.200000000000001 11 12.100000000000001 11.5 12.3 12.5 C 12.5 13.5 11.9 14.7 10.9 15.2 z"stroke-linecap="round"/>
你可以通过支持 `chat_inner(message, dataset)` API 为你的模型新增多轮对话功能并兼容多轮对话评测。这个 API 输出一个字符串型回复,`message` 包含一个聊天记录的列表,格式如下:
```python
# Assume msg1, msg2, msg3, ... are multi-modal messages following the previously described format
# `chat_inner` take the following chat history list as input:
message=[
dict(role='user',content=msg1),
dict(role='assistant',content=msg2),
dict(role='user',content=msg3),
dict(role='assistant',content=msg4),
......
dict(role='user',content=msgn),
]
# `message` should contain an odd number of chat utterances, the role of utterances should be interleaved "user" and "assistant", with the role of the last utterance to be "user".