init

81028572 · luopl · 81028572 · 81028572 · 81028572 · 81028572
Commit 81028572 authored Sep 28, 2024 by luopl
20 changed files
--- a/VLMEvalKit/docs/zh-CN/cp_origin_docs.sh
+++ b/VLMEvalKit/docs/zh-CN/cp_origin_docs.sh
+#!/usr/bin/env bash
+# Copy *.md files from docs/ if it doesn't have a Chinese translation
+for filename in $(find ../en/ -name '*.md' -printf "%P\n");
+do
+    mkdir -p $(dirname $filename)
+    cp -n ../en/$filename ./$filename
+done
--- a/VLMEvalKit/docs/zh-CN/docutils.conf
+++ b/VLMEvalKit/docs/zh-CN/docutils.conf
+[html writers]
+table_style: colwidths-auto
--- a/VLMEvalKit/docs/zh-CN/get_started/Quickstart.md
+++ b/VLMEvalKit/docs/zh-CN/get_started/Quickstart.md
+# 快速开始
+在运行评测脚本之前，你需要先**配置** VLMs，并正确设置模型路径。然后你可以使用脚本 `run.py` 进行多个VLMs和基准测试的推理和评估。
+## 第0步 安装和设置必要的密钥
+**安装**
+```bash
+git clone https://github.com/open-compass/VLMEvalKit.git
+cd VLMEvalKit
+pip install -e .
+```
+**设置密钥**
+要使用 API 模型（如 GPT-4v, Gemini-Pro-V 等）进行推理，或使用 LLM API 作为**评判者或选择提取器**，你需要首先设置 API 密钥。如果你设置了密钥，VLMEvalKit 将使用一个评判 LLM 从输出中提取答案，否则它将使用**精确匹配模式**（在输出字符串中查找 "Yes", "No", "A", "B", "C"...）。**精确匹配模式只能应用于是或否任务和多项选择任务。**
+- 你可以将所需的密钥放在 `$VLMEvalKit/.env` 中，或直接将它们设置为环境变量。如果你选择创建 `.env` 文件，其内容将如下所示：
+  ```bash
+  # .env 文件，将其放置在 $VLMEvalKit 下
+  # 专有 VLMs 的 API 密钥
+  # QwenVL APIs
+  DASHSCOPE_API_KEY=
+  # Gemini w. Google Cloud Backends
+  GOOGLE_API_KEY=
+  # OpenAI API
+  OPENAI_API_KEY=
+  OPENAI_API_BASE=
+  # StepAI API
+  STEPAI_API_KEY=
+  # REKA API
+  REKA_API_KEY=
+  # GLMV API
+  GLMV_API_KEY=
+  # CongRong API
+  CW_API_BASE=
+  CW_API_KEY=
+  # SenseChat-V API
+  SENSECHAT_AK=
+  SENSECHAT_SK=
+  # Hunyuan-Vision API
+  HUNYUAN_SECRET_KEY=
+  HUNYUAN_SECRET_ID=
+  # 你可以设置一个评估时代理，评估阶段产生的 API 调用将通过这个代理进行
+  EVAL_PROXY=
+  ```
+- 如果需要使用 API 在对应键值空白处填写上你的密钥。这些 API 密钥将在进行推理和评估时自动加载。
+## 第1步 配置
+**VLM 配置**：所有 VLMs 都在 `vlmeval/config.py` 中配置。对于某些 VLMs，在进行评估之前，你需要配置代码根目录（如 MiniGPT-4、PandaGPT 等）或模型权重根目录（如 LLaVA-v1-7B 等）。在评估时，你应该使用 `vlmeval/config.py` 中 `supported_VLM` 指定的模型名称来选择 VLM。对于 MiniGPT-4 和 InstructBLIP，还需要修改 `vlmeval/vlm/misc` 中的配置文件来配置 LLM 路径和 ckpt 路径。
+**以下 VLMs 需要额外配置步骤**：
+**代码准备和安装**: InstructBLIP ([LAVIS](https://github.com/salesforce/LAVIS)), LLaVA ([LLaVA](https://github.com/haotian-liu/LLaVA)), MiniGPT-4 ([MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)), mPLUG-Owl2 ([mPLUG-Owl2](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)), OpenFlamingo-v2 ([OpenFlamingo](https://github.com/mlfoundations/open_flamingo)), PandaGPT-13B ([PandaGPT](https://github.com/yxuansu/PandaGPT)), TransCore-M ([TransCore-M](https://github.com/PCIResearch/TransCore-M)).
+**手动权重文件准备与配置**: InstructBLIP, LLaVA-v1-7B, MiniGPT-4, PandaGPT-13B
+## 第2步 评测
+我们使用 `run.py` 进行评估。你可以使用 `$VLMEvalKit/run.py` 或创建脚本的软链接运行（以便在任何地方使用该脚本）：
+**参数**
+- `--data (list[str])`: 设置在 VLMEvalKit 中支持的数据集名称（在 `vlmeval/utils/dataset_config.py` 中定义）
+- `--model (list[str])`: 设置在 VLMEvalKit 中支持的 VLM 名称（在 `vlmeval/config.py` 中的 `supported_VLM` 中定义）
+- `--mode (str, 默认值为 'all', 可选值为 ['all', 'infer'])`：当 mode 设置为 "all" 时，将执行推理和评估；当设置为 "infer" 时，只执行推理
+- `--nproc (int, 默认值为 4)`: 调用 API 的线程数
+- `--work-dir (str, default to '.')`: 存放测试结果的目录
+- `--nframe (int, default to 8)`: 从视频中采样的帧数，仅对视频多模态评测集适用
+- `--pack (bool, store_true)`: 一个视频可能关联多个问题，如 `pack==True`，将会在一次询问中提问所有问题
+**用于评测图像多模态评测集的命令**
+你可以使用 `python` 或 `torchrun` 来运行脚本:
+```bash
+# 使用 `python` 运行时，只实例化一个 VLM，并且它可能使用多个 GPU。
+# 这推荐用于评估参数量非常大的 VLMs（如 IDEFICS-80B-Instruct）。
+# 在 MMBench_DEV_EN、MME 和 SEEDBench_IMG 上使用 IDEFICS-80B-Instruct 进行推理和评估
+python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose
+# 在 MMBench_DEV_EN、MME 和 SEEDBench_IMG 上使用 IDEFICS-80B-Instruct 仅进行推理
+python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose --mode infer
+# 使用 `torchrun` 运行时，每个 GPU 上实例化一个 VLM 实例。这可以加快推理速度。
+# 但是，这仅适用于消耗少量 GPU 内存的 VLMs。
+# 在 MMBench_DEV_EN、MME 和 SEEDBench_IMG 上使用 IDEFICS-9B-Instruct、Qwen-VL-Chat、mPLUG-Owl2。在具有 8 个 GPU 的节点上进行推理和评估。
+torchrun --nproc-per-node=8 run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose
+# 在 MME 上使用 Qwen-VL-Chat。在具有 2 个 GPU 的节点上进行推理和评估。
+torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
+```
+**用于评测视频多模态评测集的命令**
+```bash
+# 使用 `python` 运行时，只实例化一个 VLM，并且它可能使用多个 GPU。
+# 这推荐用于评估参数量非常大的 VLMs（如 IDEFICS-80B-Instruct）。
+# 在 MMBench-Video 上评测 IDEFCIS2-8B, 视频采样 8 帧作为输入，不采用 pack 模式评测
+torchrun --nproc-per-node=8 run.py --data MMBench-Video --model idefics2_8b --nframe 8
+# 在 MMBench-Video 上评测 GPT-4o (API 模型), 视频采样 16 帧作为输入，采用 pack 模式评测
+python run.py --data MMBench-Video --model GPT4o --nframe 16 --pack
+```
+评估结果将作为日志打印出来。此外，**结果文件**也会在目录 `$YOUR_WORKING_DIRECTORY/{model_name}` 中生成。以 `.csv` 结尾的文件包含评估的指标。
+### 部署本地语言模型作为评判 / 选择提取器
+上述默认设置使用 OpenAI 的 GPT 作为评判 LLM。你也可以使用 [LMDeploy](https://github.com/InternLM/lmdeploy) 部署本地评判 LLM。
+首先进行安装:
+```
+pip install lmdeploy openai
+```
+然后可以通过一行代码部署本地评判 LLM。LMDeploy 将自动从 Huggingface 下载模型。假设我们使用 internlm2-chat-1_8b 作为评判，端口为 23333，密钥为 sk-123456（密钥必须以 "sk-" 开头，后跟任意数字）：
+```
+lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
+```
+使用以下 Python 代码获取由 LMDeploy 注册的模型名称：
+```
+from openai import OpenAI
+client = OpenAI(
+    api_key='sk-123456',
+    base_url="http://0.0.0.0:23333/v1"
+)
+model_name = client.models.list().data[0].id
+```
+配置对应环境变量，以告诉 VLMEvalKit 如何使用本地评判 LLM。正如上面提到的，也可以在  `$VLMEvalKit/.env` 文件中设置：
+```
+OPENAI_API_KEY=sk-123456
+OPENAI_API_BASE=http://0.0.0.0:23333/v1/chat/completions
+LOCAL_LLM=<model_name you get>
+```
+最后，你可以运行第2步中的命令，使用本地评判 LLM 来评估你的 VLM。
+**请注意：**
+- 如果你希望将评判 LLM 部署在单独的一个 GPU 上，并且由于 GPU 内存有限而希望在其他 GPU 上评估你的 VLM，可以使用 `CUDA_VISIBLE_DEVICES=x` 这样的方法，例如：
+```
+CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
+CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc-per-node=3 run.py --data HallusionBench  --model qwen_chat --verbose
+```
+- 如果本地评判 LLM 在遵循指令方面不够好，评估过程可能会失败。请通过 issues 报告此类失败情况。
+- 可以以不同的方式部署评判 LLM，例如使用私有 LLM（而非来自 HuggingFace）或使用量化 LLM。请参考 [LMDeploy doc](https://lmdeploy.readthedocs.io/en/latest/serving/api_server.html) 文档。也可以使用其他支持 OpenAI API 框架的方法。
--- a/VLMEvalKit/docs/zh-CN/index.rst
+++ b/VLMEvalKit/docs/zh-CN/index.rst
+欢迎来到 VLMEvalKit 中文教程！
+==========================================
+VLMEvalKit 上手路线
+-------------------------------
+为了用户能够快速上手，我们推荐以下流程：
+- 对于想要使用 VLMEvalKit 的用户，我们推荐先阅读 开始你的第一步_ 部分来设置环境，并启动一个迷你实验熟悉流程。
+- 若您想进行更多模块的自定义，例如增加数据集和模型，我们提供了 进阶教程_ 。
+我们始终非常欢迎用户的 PRs 和 Issues 来完善 VLMEvalKit！
+.. _开始你的第一步:
+.. toctree::
+   :maxdepth: 1
+   :caption: 开始你的第一步
+   get_started/Quickstart.md
+.. .. _教程:
+.. .. toctree::
+..    :maxdepth: 1
+..    :caption: 教程
+..    user_guides/framework_overview.md
+.. _进阶教程:
+.. toctree::
+   :maxdepth: 1
+   :caption: 进阶教程
+   advanced_guides/Development.md
+.. .. _其他说明:
+.. .. toctree::
+..    :maxdepth: 1
+..    :caption: 其他说明
+..    notes/contribution_guide.md
+索引与表格
+==================
+* :ref:`genindex`
+* :ref:`search`
--- a/VLMEvalKit/requirements.txt
+++ b/VLMEvalKit/requirements.txt
+decord
+gradio
+huggingface_hub
+imageio
+matplotlib
+moviepy
+numpy>=1.23.4
+omegaconf
+openai==1.3.5
+opencv-python>=4.4.0.46
+openpyxl
+pandas
+peft
+pillow
+portalocker
+protobuf
+python-dotenv
+requests
+rich
+sentencepiece
+setuptools
+sty
+tabulate
+tiktoken
+timeout-decorator
+torch>=2.0.1
+tqdm
+transformers
+typing_extensions==4.7.1
+validators
+xlsxwriter
--- a/VLMEvalKit/requirements/docs.txt
+++ b/VLMEvalKit/requirements/docs.txt
+docutils==0.18.1
+modelindex
+myst-parser
+-e git+https://github.com/open-compass/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
+sphinx==6.1.3
+sphinx-copybutton
+sphinx-design
+sphinx-notfound-page
+sphinx-tabs
+sphinxcontrib-jquery
+tabulate
--- a/VLMEvalKit/run.py
+++ b/VLMEvalKit/run.py
+import torch
+import torch.distributed as dist
+from vlmeval.config import supported_VLM
+from vlmeval.dataset import build_dataset
+from vlmeval.inference import infer_data_job
+from vlmeval.inference_video import infer_data_job_video
+from vlmeval.inference_mt import infer_data_job_mt
+from vlmeval.smp import *
+from vlmeval.utils.result_transfer import MMMU_result_transfer, MMTBench_result_transfer
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Essential Args
+    parser.add_argument('--data', type=str, nargs='+', required=True)
+    parser.add_argument('--model', type=str, nargs='+', required=True)
+    # Args that only apply to Video Dataset
+    parser.add_argument('--nframe', type=int, default=8)
+    parser.add_argument('--pack', action='store_true')
+    parser.add_argument('--use-subtitle', action='store_true')
+    # Work Dir
+    parser.add_argument('--work-dir', type=str, default='./outputs', help='select the output directory')
+    # Infer + Eval or Infer Only
+    parser.add_argument('--mode', type=str, default='all', choices=['all', 'infer'])
+    # API Kwargs, Apply to API VLMs and Judge API LLMs
+    parser.add_argument('--nproc', type=int, default=4, help='Parallel API calling')
+    parser.add_argument('--retry', type=int, default=None, help='retry numbers for API VLMs')
+    # Explicitly Set the Judge Model
+    parser.add_argument('--judge', type=str, default=None)
+    # Logging Utils
+    parser.add_argument('--verbose', action='store_true')
+    # Configuration for Resume
+    # Ignore: will not rerun failed VLM inference
+    parser.add_argument('--ignore', action='store_true', help='Ignore failed indices. ')
+    # Rerun: will remove all evaluation temp files
+    parser.add_argument('--rerun', action='store_true')
+    args = parser.parse_args()
+    return args
+def main():
+    logger = get_logger('RUN')
+    args = parse_args()
+    assert len(args.data), '--data should be a list of data files'
+    if 'MMEVAL_ROOT' in os.environ:
+        args.work_dir = os.environ['MMEVAL_ROOT']
+    if args.retry is not None:
+        for k, v in supported_VLM.items():
+            if hasattr(v, 'keywords') and 'retry' in v.keywords:
+                v.keywords['retry'] = args.retry
+                supported_VLM[k] = v
+            if hasattr(v, 'keywords') and 'verbose' in v.keywords:
+                v.keywords['verbose'] = args.verbose
+                supported_VLM[k] = v
+    rank, world_size = get_rank_and_world_size()
+    if world_size > 1:
+        local_rank = os.environ.get('LOCAL_RANK', 0)
+        torch.cuda.set_device(int(local_rank))
+        dist.init_process_group(backend='nccl', timeout=datetime.timedelta(seconds=10800))
+    for _, model_name in enumerate(args.model):
+        model = None
+        pred_root = osp.join(args.work_dir, model_name)
+        os.makedirs(pred_root, exist_ok=True)
+        for _, dataset_name in enumerate(args.data):
+            try:
+                dataset_kwargs = {}
+                if dataset_name in ['MMLongBench_DOC', 'DUDE', 'DUDE_MINI', 'SLIDEVQA', 'SLIDEVQA_MINI']:
+                    dataset_kwargs['model'] = model_name
+                if dataset_name == 'MMBench-Video':
+                    dataset_kwargs['pack'] = args.pack
+                if dataset_name == 'Video-MME':
+                    dataset_kwargs['use_subtitle'] = args.use_subtitle
+                # If distributed, first build the dataset on the main process for doing preparation works
+                if world_size > 1:
+                    dataset = build_dataset(dataset_name, **dataset_kwargs) if rank == 0 else None
+                    dist.barrier()
+                    dataset_list = [dataset]
+                    dist.broadcast_object_list(dataset_list, src=0)
+                    dataset = dataset_list[0]
+                else:
+                    dataset = build_dataset(dataset_name, **dataset_kwargs)
+                if dataset is None:
+                    logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
+                    continue
+                result_file = f'{pred_root}/{model_name}_{dataset_name}.xlsx'
+                if dataset_name in ['MMBench-Video']:
+                    packstr = 'pack' if args.pack else 'nopack'
+                    result_file = f'{pred_root}/{model_name}_{dataset_name}_{args.nframe}frame_{packstr}.xlsx'
+                elif dataset.MODALITY == 'VIDEO':
+                    if args.pack:
+                        logger.info(f'{dataset_name} not support Pack Mode, directly change to unpack')
+                        args.pack = False
+                    packstr = 'pack' if args.pack else 'nopack'
+                    result_file = f'{pred_root}/{model_name}_{dataset_name}_{args.nframe}frame_{packstr}.xlsx'
+                    if dataset_name in ['Video-MME']:
+                        subtitlestr = 'subs' if args.use_subtitle else 'nosubs'
+                        result_file = result_file.replace('.xlsx', f'_{subtitlestr}.xlsx')
+                if dataset.TYPE == 'MT':
+                    result_file = result_file.replace('.xlsx', '.tsv')
+                if osp.exists(result_file) and args.rerun:
+                    for keyword in ['openai', 'gpt', 'auxmatch']:
+                        os.system(f'rm {pred_root}/{model_name}_{dataset_name}_{keyword}*')
+                if model is None:
+                    model = model_name  # which is only a name
+                # Perform the Inference
+                if dataset.MODALITY == 'VIDEO':
+                    model = infer_data_job_video(
+                        model,
+                        work_dir=pred_root,
+                        model_name=model_name,
+                        dataset=dataset,
+                        nframe=args.nframe,
+                        pack=args.pack,
+                        verbose=args.verbose,
+                        subtitle=args.use_subtitle,
+                        api_nproc=args.nproc)
+                elif dataset.TYPE == 'MT':
+                    model = infer_data_job_mt(
+                        model,
+                        work_dir=pred_root,
+                        model_name=model_name,
+                        dataset=dataset,
+                        verbose=args.verbose,
+                        api_nproc=args.nproc,
+                        ignore_failed=args.ignore)
+                else:
+                    model = infer_data_job(
+                        model,
+                        work_dir=pred_root,
+                        model_name=model_name,
+                        dataset=dataset,
+                        verbose=args.verbose,
+                        api_nproc=args.nproc,
+                        ignore_failed=args.ignore)
+                # Set the judge kwargs first before evaluation or dumping
+                judge_kwargs = {
+                    'nproc': args.nproc,
+                    'verbose': args.verbose,
+                }
+                if args.retry is not None:
+                    judge_kwargs['retry'] = args.retry
+                if args.judge is not None:
+                    judge_kwargs['model'] = args.judge
+                else:
+                    if dataset.TYPE in ['MCQ', 'Y/N'] or listinstr(['MathVerse'], dataset_name):
+                        judge_kwargs['model'] = 'chatgpt-0125'
+                    elif listinstr(['MMVet', 'MathVista', 'LLaVABench', 'MMBench-Video', 'MathVision'],
+                                   dataset_name):
+                        judge_kwargs['model'] = 'gpt-4-turbo'
+                    elif listinstr(['MMLongBench', 'MMDU', 'DUDE', 'DUDE_MINI', 'SLIDEVQA', 'SLIDEVQA_MINI'],
+                                   dataset_name):
+                        judge_kwargs['model'] = 'gpt-4o'
+                if 'OPENAI_API_KEY_JUDGE' in os.environ and len(os.environ['OPENAI_API_KEY_JUDGE']):
+                    judge_kwargs['key'] = os.environ['OPENAI_API_KEY_JUDGE']
+                if 'OPENAI_API_BASE_JUDGE' in os.environ and len(os.environ['OPENAI_API_BASE_JUDGE']):
+                    judge_kwargs['api_base'] = os.environ['OPENAI_API_BASE_JUDGE']
+                if rank == 0:
+                    if dataset_name in ['MMMU_TEST']:
+                        result_json = MMMU_result_transfer(result_file)
+                        logger.info(f'Transfer MMMU_TEST result to json for official evaluation, '
+                                    f'json file saved in {result_json}')  # noqa: E501
+                        continue
+                    elif 'MMT-Bench_ALL' in dataset_name:
+                        submission_file = MMTBench_result_transfer(result_file, **judge_kwargs)
+                        logger.info(f'Extract options from prediction of MMT-Bench FULL split for official evaluation '
+                                    f'(https://eval.ai/web/challenges/challenge-page/2328/overview), '
+                                    f'submission file saved in {submission_file}')  # noqa: E501
+                        continue
+                    elif 'MLLMGuard_DS' in dataset_name:
+                        logger.info('The evaluation of MLLMGuard_DS is not supported yet. ')  # noqa: E501
+                        continue
+                    elif 'AesBench_TEST' == dataset_name:
+                        logger.info(f'The results are saved in {result_file}. '
+                                    f'Please send it to the AesBench Team via huangyipo@hotmail.com.')  # noqa: E501
+                        continue
+                if dataset_name in [
+                    'MMBench_TEST_CN', 'MMBench_TEST_EN', 'MMBench', 'MMBench_CN',
+                    'MMBench_TEST_CN_V11', 'MMBench_TEST_EN_V11', 'MMBench_V11', 'MMBench_CN_V11'
+                ]:
+                    if not MMBenchOfficialServer(dataset_name):
+                        logger.error(
+                            f'Can not evaluate {dataset_name} on non-official servers, '
+                            'will skip the evaluation. '
+                        )
+                        continue
+                eval_proxy = os.environ.get('EVAL_PROXY', None)
+                old_proxy = os.environ.get('HTTP_PROXY', '')
+                if rank == 0 and args.mode == 'all':
+                    if eval_proxy is not None:
+                        proxy_set(eval_proxy)
+                    eval_results = dataset.evaluate(result_file, **judge_kwargs)
+                    if eval_results is not None:
+                        assert isinstance(eval_results, dict) or isinstance(eval_results, pd.DataFrame)
+                        logger.info(f'The evaluation of model {model_name} x dataset {dataset_name} has finished! ')
+                        logger.info('Evaluation Results:')
+                    if isinstance(eval_results, dict):
+                        logger.info('\n' + json.dumps(eval_results, indent=4))
+                    elif isinstance(eval_results, pd.DataFrame):
+                        if len(eval_results) < len(eval_results.columns):
+                            eval_results = eval_results.T
+                        logger.info('\n' + tabulate(eval_results))
+                    if eval_proxy is not None:
+                        proxy_set(old_proxy)
+            except Exception as e:
+                logger.exception(f'Model {model_name} x Dataset {dataset_name} combination failed: {e}, '
+                                 'skipping this combination.')
+                continue
+    if world_size > 1:
+        dist.barrier()
+if __name__ == '__main__':
+    load_env()
+    main()
--- a/VLMEvalKit/scripts/AI2D_preproc.ipynb
+++ b/VLMEvalKit/scripts/AI2D_preproc.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, cv2\n",
+    "import string\n",
+    "import os.path as osp\n",
+    "import numpy as np\n",
+    "from collections import defaultdict\n",
+    "from vlmeval.smp import ls, load, dump, download_file, encode_image_file_to_base64, md5, mrlines\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import multiprocessing as mp\n",
+    "from PIL import Image, ImageFont, ImageDraw\n",
+    "\n",
+    "font_URL = 'http://opencompass.openxlab.space/utils/Fonts/timesb.ttf'\n",
+    "font_file = 'timesb.ttf'\n",
+    "if not osp.exists(font_file):\n",
+    "    download_file(font_URL)\n",
+    "    \n",
+    "test_split_URL = 'https://s3-us-east-2.amazonaws.com/prior-datasets/ai2d_test_ids.csv'\n",
+    "test_split_file = 'ai2d_test_ids.csv'\n",
+    "if not osp.exists(test_split_file):\n",
+    "    download_file(test_split_URL)\n",
+    "    \n",
+    "test_ids = set(mrlines(test_split_file))\n",
+    "    \n",
+    "def proper_font_size(font_file, wh, text, ratio=1):\n",
+    "    font_size = 2\n",
+    "    while True:\n",
+    "        font = ImageFont.truetype(font_file, font_size)\n",
+    "        real_box = font.getbbox(text)\n",
+    "        real_wh = (real_box[2] - real_box[0], real_box[3] - real_box[1])\n",
+    "        if real_wh[0] > wh[0] * ratio or real_wh[1] > wh[1] * ratio:\n",
+    "            break\n",
+    "        font_size += 1\n",
+    "    return font_size\n",
+    "\n",
+    "def cover_image(ann_path):\n",
+    "    data = load(ann_path)\n",
+    "    texts = list(data['text'].values())\n",
+    "    raw_img = ann_path.replace('annotations', 'images').replace('.json', '')\n",
+    "    tgt_img = raw_img.replace('images', 'images_abc')\n",
+    "    img = Image.open(raw_img)\n",
+    "    draw = ImageDraw.Draw(img)\n",
+    "    for text in texts:\n",
+    "        st, ed = tuple(text['rectangle'][0]), tuple(text['rectangle'][1])\n",
+    "        T = text['replacementText']\n",
+    "        draw.rectangle((st, ed), fill='white')\n",
+    "        font_size = proper_font_size(font_file, (ed[0] - st[0], ed[1] - st[1]), T, ratio=1)\n",
+    "        font = ImageFont.truetype(font_file, font_size)\n",
+    "        text_box = font.getbbox(T)\n",
+    "        text_wh = (text_box[2] - text_box[0], text_box[3] - text_box[1])\n",
+    "        cx, cy = (st[0] + ed[0]) // 2, st[1]\n",
+    "        stx = cx - text_wh[0] // 2\n",
+    "        sty = cy - text_wh[1] // 2\n",
+    "        draw.text((stx, sty), T, font=font, fill='black')\n",
+    "    img.save(tgt_img)    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Process for no mask images\n",
+    "test_ids = set(mrlines(test_split_file))\n",
+    "\n",
+    "def detect_image_color(image):\n",
+    "    gray_image = image.convert('L')\n",
+    "    mean_brightness = np.mean(np.array(gray_image))\n",
+    "    if mean_brightness < 127:\n",
+    "        return 'white'\n",
+    "    else:\n",
+    "        return 'black'\n",
+    "\n",
+    "def cover_image(ann_path):\n",
+    "    data = load(ann_path)\n",
+    "    texts = list(data['text'].values())\n",
+    "    raw_img = ann_path.replace('annotations', 'images').replace('.json', '')\n",
+    "    tgt_img = raw_img.replace('images', 'images_abc')\n",
+    "    img = Image.open(raw_img)\n",
+    "    draw = ImageDraw.Draw(img)\n",
+    "    color = detect_image_color(img)\n",
+    "    font_size = 0\n",
+    "    for text in texts:\n",
+    "        st, ed = tuple(text['rectangle'][0]), tuple(text['rectangle'][1])\n",
+    "        font_size += (ed[1] - st[1])\n",
+    "    if len(texts) != 0:\n",
+    "        font_size /= len(texts)\n",
+    "    else:\n",
+    "        font_size = 2\n",
+    "    for text in texts:\n",
+    "        st, ed = tuple(text['rectangle'][0]), tuple(text['rectangle'][1])\n",
+    "        T = text['replacementText']\n",
+    "        for i in range(2):\n",
+    "            draw.rectangle(\n",
+    "                [(st[0] - i, st[1] - i), (ed[0] + i, ed[1] + i)],\n",
+    "                outline=color\n",
+    "            )\n",
+    "        font = ImageFont.truetype(font_file, font_size)\n",
+    "        text_box = font.getbbox(T)\n",
+    "        text_wh = (text_box[2] - text_box[0], text_box[3] - text_box[1])\n",
+    "        cx, cy = (st[0] + ed[0]) // 2, st[1]\n",
+    "        stx = cx - text_wh[0] // 2\n",
+    "        sty = cy - text_wh[1] * 1.5\n",
+    "        if sty < 0:\n",
+    "            sty = cy + text_wh[1] * 1.3\n",
+    "        draw.text((stx, sty), T, font=font, fill=color)\n",
+    "    img.save(tgt_img)  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "download_file('https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip')\n",
+    "os.system('unzip -o ai2d-all.zip')\n",
+    "\n",
+    "images = ls('ai2d/images/')\n",
+    "questions = ls('ai2d/questions/')\n",
+    "annotations = ls('ai2d/annotations/')\n",
+    "cates = load('ai2d/categories.json')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pool = mp.Pool(32)\n",
+    "pool.map(cover_image, annotations)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def puncproc(inText):\n",
+    "    import re\n",
+    "    outText = inText\n",
+    "    punct = [\n",
+    "        ';', r'/', '[', ']', '\"', '{', '}', '(', ')', '=', '+', '\\\\', '_', '-',\n",
+    "        '>', '<', '@', '`', ',', '?', '!'\n",
+    "    ]\n",
+    "    commaStrip = re.compile('(\\d)(,)(\\d)')  # noqa: W605\n",
+    "    periodStrip = re.compile('(?!<=\\d)(\\.)(?!\\d)')  # noqa: W605\n",
+    "    for p in punct:\n",
+    "        if (p + ' ' in inText or ' ' + p in inText) or (re.search(commaStrip, inText) is not None):\n",
+    "            outText = outText.replace(p, '')\n",
+    "        else:\n",
+    "            outText = outText.replace(p, ' ')\n",
+    "    outText = periodStrip.sub('', outText, re.UNICODE)\n",
+    "    return outText\n",
+    "\n",
+    "def check_choices(line):\n",
+    "    def ischar(s):\n",
+    "        s = str(s)\n",
+    "        if s in ['{}', 'Both', 'None of above']:\n",
+    "            return True\n",
+    "        elif s.startswith('Stage ') and ischar(s[6:]):\n",
+    "            return True\n",
+    "        elif ' and ' in s and np.all([ischar(x) for x in s.split(' and ')]):\n",
+    "            return True\n",
+    "        elif len(s) <= 2:\n",
+    "            return True\n",
+    "        elif len(puncproc(s).split()) > 1:\n",
+    "            return np.all([ischar(x) for x in puncproc(s).split()])\n",
+    "        return False\n",
+    "    n_char = sum([ischar(line[x]) for x in 'ABCD'])\n",
+    "    return n_char >= 3\n",
+    "\n",
+    "def check_question(question):\n",
+    "    words = puncproc(question).split()\n",
+    "    for ch in string.ascii_lowercase + string.ascii_uppercase:\n",
+    "        if ch in words:\n",
+    "            return True\n",
+    "    return False\n",
+    "\n",
+    "def is_abc(abc, choices, question):\n",
+    "    if abc == 0:\n",
+    "        return False\n",
+    "    if check_choices(choices):\n",
+    "        return True\n",
+    "    if check_question(question):\n",
+    "        return True\n",
+    "    return False"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_all = defaultdict(list)\n",
+    "for qfile in questions:\n",
+    "    data = load(qfile)\n",
+    "    idx = data['imageName'].split('.')[0]\n",
+    "    if idx not in test_ids:\n",
+    "        continue\n",
+    "    image_pth = qfile.replace('questions', 'images').replace('.json', '')\n",
+    "    cate = cates[image_pth.split('/')[-1]]\n",
+    "    for q, qmeta in data['questions'].items():\n",
+    "        assert '.png-' in qmeta['questionId']\n",
+    "        main, sub = qmeta['questionId'].split('.png-')\n",
+    "        idx = int(main) * 100 + int(sub)\n",
+    "        \n",
+    "        answers = qmeta['answerTexts']\n",
+    "        correct = qmeta['correctAnswer']\n",
+    "        \n",
+    "        data_all['index'].append(idx)\n",
+    "        data_all['question'].append(q)\n",
+    "        assert len(answers) == 4\n",
+    "        for c, a in zip('ABCD', answers):\n",
+    "            data_all[c].append(a)\n",
+    "        data_all['answer'].append('ABCD'[qmeta['correctAnswer']])\n",
+    "        data_all['category'].append(cate)\n",
+    "        data_all['abcLabel'].append(qmeta['abcLabel'])\n",
+    "        abc = is_abc(qmeta['abcLabel'], {x: data_all[x][-1] for x in 'ABCD'}, q)\n",
+    "        # if qmeta['abcLabel'] and not abc:\n",
+    "        #     print(qmeta['abcLabel'], {x: data_all[x][-1] for x in 'ABCD'}, q)\n",
+    "        data_all['image_path'].append(image_pth.replace('images', 'images_abc') if abc else image_pth)\n",
+    "data = pd.DataFrame(data_all)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "images = []\n",
+    "image_seen = {}\n",
+    "for idx, pth in zip(data['index'], data['image_path']):\n",
+    "    images.append(encode_image_file_to_base64(pth))\n",
+    "\n",
+    "data['image'] = images\n",
+    "dump(data, 'AI2D_TEST.tsv')\n",
+    "print(md5('AI2D_TEST.tsv'))"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/VLMEvalKit/scripts/apires_scan.py
+++ b/VLMEvalKit/scripts/apires_scan.py
+import sys
+from vlmeval import *
+from vlmeval.dataset import SUPPORTED_DATASETS
+FAIL_MSG = 'Failed to obtain answer via API.'
+root = sys.argv[1]
+if root[-1] in '/\\':
+    root = root[:-1]
+model_name = root.split('/')[-1]
+for d in SUPPORTED_DATASETS:
+    fname = f'{model_name}_{d}.xlsx'
+    pth = osp.join(root, fname)
+    if osp.exists(pth):
+        data = load(pth)
+        # Detect Failure
+        assert 'prediction' in data
+        data['prediction'] = [str(x) for x in data['prediction']]
+        fail = [FAIL_MSG in x for x in data['prediction']]
+        if sum(fail):
+            nfail = sum(fail)
+            ntot = len(fail)
+            print(f'Model {model_name} x Dataset {d}: {nfail} out of {ntot} failed. {nfail / ntot * 100: .2f}%. ')
+        eval_files = ls(root, match=f'{model_name}_{d}_')
+        eval_files = [x for x in eval_files if listinstr([f'{d}_openai', f'{d}_gpt'], x) and x.endswith('.xlsx')]
+        if len(eval_files) == 0:
+            print(f'Model {model_name} x Dataset {d} openai missing')
+            continue
+        assert len(eval_files) == 1
+        eval_file = eval_files[0]
+        data = load(eval_file)
+        if 'MMVet' in d:
+            bad = [x for x in data['log'] if 'All 5 retries failed.' in str(x)]
+            if len(bad):
+                print(f'Model {model_name} x Dataset {d} Evaluation: {len(bad)} out of {len(data)} failed.')
+        elif 'MathVista' in d:
+            bad = [x for x in data['res'] if FAIL_MSG in str(x)]
+            if len(bad):
+                print(f'Model {model_name} x Dataset {d} Evaluation: {len(bad)} out of {len(data)} failed.')
+        elif d == 'LLaVABench':
+            sub = data[data['gpt4_score'] == -1]
+            sub = sub[sub['gpt4_score'] == -1]
+            if len(sub):
+                print(f'Model {model_name} x Dataset {d} Evaluation: {len(sub)} out of {len(data)} failed.')
+        else:
+            bad = [x for x in data['log'] if FAIL_MSG in str(x)]
+            if len(bad):
+                print(f'Model {model_name} x Dataset {d} Evaluation: {len(bad)} out of {len(data)} failed.')
\ No newline at end of file
--- a/VLMEvalKit/scripts/auto_run.py
+++ b/VLMEvalKit/scripts/auto_run.py
+import argparse
+from vlmeval.smp import *
+from vlmeval.config import supported_VLM
+def is_api(x):
+    return getattr(supported_VLM[x].func, 'is_api', False)
+models = list(supported_VLM)
+models = [x for x in models if 'fs' not in x]
+models = [x for x in models if not is_api(x)]
+exclude_list = ['cogvlm-grounding-generalist', 'emu2']
+models = [x for x in models if x not in exclude_list]
+def is_large(x):
+    return '80b' in x or 'emu2' in x or '34B' in x
+small_models = [x for x in models if not is_large(x)]
+large_models = [x for x in models if is_large(x)]
+models = small_models + large_models
+parser = argparse.ArgumentParser()
+parser.add_argument('--data', type=str, nargs='+', required=True)
+args = parser.parse_args()
+# Skip some models
+models = [x for x in models if not listinstr(['MiniGPT', 'grounding-generalist'], x)]
+for m in models:
+    unknown_datasets = [x for x in args.data if not osp.exists(f'{m}/{m}_{x}.xlsx')]
+    if len(unknown_datasets) == 0:
+        continue
+    dataset_str = ' '.join(unknown_datasets)
+    if '80b' in m:
+        cmd = f'python run.py --data {dataset_str} --model {m}'
+    else:
+        cmd = f'bash run.sh --data {dataset_str} --model {m}'
+    print(cmd)
+    os.system(cmd)
\ No newline at end of file
--- a/VLMEvalKit/scripts/cover.sh
+++ b/VLMEvalKit/scripts/cover.sh
+#!/bin/bash
+DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+cp $DIR/../config.py $DIR/../vlmeval/
+cp $DIR/../misc/* $DIR/../vlmeval/vlm/misc/
\ No newline at end of file
--- a/VLMEvalKit/scripts/mmb_eval_gradio.py
+++ b/VLMEvalKit/scripts/mmb_eval_gradio.py
+from vlmeval.smp import *
+from vlmeval.evaluate.multiple_choice import multiple_choice_eval
+import gradio as gr
+HEADER = """
+# Welcome to MMBench👏👏
+We are delighted that you are willing to submit the evaluation results to the MMBench official website! The evaluation service currently can handle submissions of MMBench, MMBench-CN, and CCBench. We use `gpt-3.5-turbo-0125` to help answer matching. Evaluation Codes in VLMEvalKit: https://github.com/open-compass/VLMEvalKit. Please adopt / follow the implementation of VLMEvalKit to generate the submission files. 
+Moreover, this is a temporary solution, which **does not support ChatGPT-based answer extraction**. So you need to make sure values in the `prediction` field of your submission files are single characters in A, B, C, D. Other free-form answers can not be recognized by the evaluation script and will be marked as **WRONG**. 
+The evaluation script is available at https://github.com/open-compass/VLMEvalKit/tree/main/scripts/mmb_eval_gradio.py
+Please contact `opencompass@pjlab.org.cn` for any inquirys about this script. 
+"""
+def upload_file(file):
+    file_path = file.name
+    return file_path
+def prepare_file(file_name):
+    file_md5 = md5(file_name)
+    root = LMUDataRoot()
+    root = osp.join(root, 'eval_server')
+    os.makedirs(root, exist_ok=True)
+    suffix = file_name.split('.')[-1]
+    if suffix not in ['xlsx', 'tsv', 'csv']:
+        return False, "Please submit a file that ends with `.xlsx`, `.tsv`, or `.csv`"
+    new_file_name = osp.join(root, f'{file_md5}.{suffix}')
+    shutil.move(file_name, new_file_name)
+    eval_file = new_file_name
+    try:
+        data = load(eval_file)
+    except:
+        return False, "Your excel file can not be successfully loaded by `pd.read_excel`, please double check and submit again. "
+    for k in data.keys():
+        data[k.lower() if k not in 'ABCD' else k] = data.pop(k)
+    if "index" not in data:
+        return False, "Your excel file should have a column named `index`, please double check and submit again" , {}
+    if "prediction" not in data:
+        return False, "Your excel file should have a column named `prediction`, please double check and submit again" , {}
+    for ch in 'ABCD':
+        if ch not in data:
+            return False, f"Your excel file should have a column named `{ch}`, please double check and submit again" , {}
+    dump(data, eval_file)
+    return True, eval_file
+def determine_dataset(eval_file):
+    data = load(eval_file)
+    def cn_ratio(data):
+        iscn = [cn_string(x) for x in data['question']]
+        return np.mean(iscn)
+    if len(data) < 2500 and 'l2-category' not in data:
+        return 'CCBench' if cn_ratio(data) > 0.5 else "Unknown" 
+    else:
+        return 'MMBench_CN' if cn_ratio(data) > 0.5 else "MMBench"
+def reformat_acc(acc):
+    splits = set(acc['split'])
+    keys = list(acc.keys())
+    keys.remove('split')
+    nacc = {'Category': []}
+    for sp in splits:
+        nacc[sp.upper()] = []
+    for k in keys:
+        nacc['Category'].append(k)
+        for sp in splits:
+            nacc[sp.upper()].append(acc[acc['split'] == sp].iloc[0][k] * 100)
+    return pd.DataFrame(nacc)
+def evaluate(file):
+    file_name = file.name
+    flag, eval_file = prepare_file(file_name)
+    if not flag:
+        return "Error: " + eval_file
+    dataset = determine_dataset(eval_file)
+    if dataset == 'Unknown':
+        return "Error: Cannot determine the dataset given your submitted file. " 
+    eval_id = eval_file.split('/')[-1].split('.')[0]
+    ret = f"Evaluation ID: {eval_id}\n"
+    timestamp = datetime.datetime.now().strftime('%Y.%m.%d  %H:%M:%S')
+    ret += f'Evaluation Timestamp: {timestamp}\n'
+    acc = multiple_choice_eval(eval_file, dataset=dataset, model='exact_matching')
+    nacc = reformat_acc(acc).round(1)
+    return ret, nacc
+with gr.Blocks() as demo:
+    gr.Markdown(HEADER)
+    file_output = gr.File()
+    upload_button = gr.UploadButton("Click to upload you prediction files for a supported benchmark")
+    upload_button.upload(upload_file, upload_button, file_output)
+    btn = gr.Button("🚀 Evaluate")
+    eval_log = gr.Textbox(label="Evaluation Log", placeholder="Your evaluation log will be displayed here")
+    df_empty = pd.DataFrame([], columns=['Evaluation Result'])
+    eval_result = gr.components.DataFrame(value=df_empty)
+    btn.click(evaluate, inputs=[file_output], outputs=[eval_log, eval_result])
+if __name__ == '__main__':
+    demo.launch(server_name='0.0.0.0', debug=True, show_error=True)
\ No newline at end of file
--- a/VLMEvalKit/scripts/run.sh
+++ b/VLMEvalKit/scripts/run.sh
+#!/bin/bash
+set -x
+export GPU=$(nvidia-smi --list-gpus | wc -l)
+torchrun --nproc-per-node=$GPU run.py ${@:1}
\ No newline at end of file
--- a/VLMEvalKit/scripts/srun.sh
+++ b/VLMEvalKit/scripts/srun.sh
+#!/bin/bash
+set -x
+srun -n1 --ntasks-per-node=1 --partition $1 --gres=gpu:8 --quotatype=reserved --job-name vlmeval --cpus-per-task=64 torchrun --nproc-per-node=8 run.py ${@:2}
\ No newline at end of file
--- a/VLMEvalKit/scripts/summarize.py
+++ b/VLMEvalKit/scripts/summarize.py
+from vlmeval.smp import *
+from vlmeval.dataset import dataset_URLs
+def get_score(model, dataset):
+    file_name = f'{model}/{model}_{dataset}'
+    if listinstr([
+        'CCBench', 'MMBench', 'SEEDBench_IMG', 'MMMU', 'ScienceQA', 
+        'AI2D_TEST', 'MMStar', 'RealWorldQA', 'BLINK'
+    ], dataset):
+        file_name += '_acc.csv'
+    elif listinstr(['MME', 'Hallusion', 'LLaVABench'], dataset):
+        file_name += '_score.csv'
+    elif listinstr(['MMVet', 'MathVista'], dataset):
+        file_name += '_gpt-4-turbo_score.csv'
+    elif listinstr(['COCO', 'OCRBench'], dataset):
+        file_name += '_score.json'
+    else:
+        raise NotImplementedError
+    if not osp.exists(file_name):
+        return {}
+    data = load(file_name)
+    ret = {}
+    if dataset == 'CCBench':
+        ret[dataset] = data['Overall'][0] * 100
+    elif dataset == 'MMBench':
+        for n, a in zip(data['split'], data['Overall']):
+            if n == 'dev':
+                ret['MMBench_DEV_EN'] = a * 100
+            elif n == 'test':
+                ret['MMBench_TEST_EN'] = a * 100
+    elif dataset == 'MMBench_CN':
+        for n, a in zip(data['split'], data['Overall']):
+            if n == 'dev':
+                ret['MMBench_DEV_CN'] = a * 100
+            elif n == 'test':
+                ret['MMBench_TEST_CN'] = a * 100
+    elif listinstr(['SEEDBench', 'ScienceQA', 'MMBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'BLINK'], dataset):
+        ret[dataset] = data['Overall'][0] * 100
+    elif 'MME' == dataset:
+        ret[dataset] = data['perception'][0] + data['reasoning'][0]
+    elif 'MMVet' == dataset:
+        data = data[data['Category'] == 'Overall']
+        ret[dataset] = float(data.iloc[0]['acc'])
+    elif 'HallusionBench' == dataset:
+        data = data[data['split'] == 'Overall']
+        for met in ['aAcc', 'qAcc', 'fAcc']:
+            ret[dataset + f' ({met})'] = float(data.iloc[0][met])
+    elif 'MMMU' in dataset:
+        data = data[data['split'] == 'validation']
+        ret['MMMU (val)'] = float(data.iloc[0]['Overall']) * 100
+    elif 'MathVista' in dataset:
+        data = data[data['Task&Skill'] == 'Overall']
+        ret[dataset] = float(data.iloc[0]['acc'])
+    elif 'LLaVABench' in dataset:
+        data = data[data['split'] == 'overall'].iloc[0]
+        ret[dataset] = float(data['Relative Score (main)'])
+    elif 'OCRBench' in dataset:
+        ret[dataset] = data['Final Score']
+    return ret
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--data', type=str, nargs='+', default=[])
+    parser.add_argument("--model", type=str, nargs='+', required=True)
+    args = parser.parse_args()
+    return args
+def gen_table(models, datasets):
+    res = defaultdict(dict)
+    for m in models:
+        for d in datasets:
+            try:
+                res[m].update(get_score(m, d))
+            except:
+                pass
+    keys = []
+    for m in models:
+        for d in res[m]:
+            keys.append(d)
+    keys = list(set(keys))
+    keys.sort()
+    final = defaultdict(list)
+    for m in models:
+        final['Model'].append(m)
+        for k in keys:
+            if k in res[m]:
+                final[k].append(res[m][k])
+            else:
+                final[k].append(None)
+    final = pd.DataFrame(final)
+    dump(final, 'summ.csv')
+    if len(final) >= len(final.iloc[0].keys()):
+        print(tabulate(final))
+    else:
+        print(tabulate(final.T))
+if __name__ == '__main__':
+    args = parse_args()
+    if args.data == []:
+        args.data = list(dataset_URLs)
+    gen_table(args.model, args.data)
\ No newline at end of file
--- a/VLMEvalKit/scripts/visualize.ipynb
+++ b/VLMEvalKit/scripts/visualize.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import copy as cp\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import matplotlib.font_manager as fm\n",
+    "\n",
+    "def download_file(url, filename=None):\n",
+    "    from urllib.request import urlretrieve\n",
+    "    if filename is None:\n",
+    "        filename = url.split('/')[-1]\n",
+    "    urlretrieve(url, filename)\n",
+    "\n",
+    "font_URL = 'http://opencompass.openxlab.space/utils/Fonts/segoepr.ttf'\n",
+    "download_file(font_URL)\n",
+    "\n",
+    "font12 = fm.FontProperties(fname='segoepr.ttf', size=12)\n",
+    "font15 = fm.FontProperties(fname='segoepr.ttf', size=15, weight='bold')\n",
+    "font18 = fm.FontProperties(fname='segoepr.ttf', size=18, weight='bold')\n",
+    "\n",
+    "DATA_URL = 'http://opencompass.openxlab.space/utils/OpenVLM.json'\n",
+    "download_file(DATA_URL)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def pre_normalize(raw_data, labels):\n",
+    "    data_list = cp.deepcopy(raw_data)\n",
+    "    minimum, maximum, max_range, range_map = {}, {}, 0, {}\n",
+    "    for lb in labels:\n",
+    "        minimum[lb] = min([x[lb] for x in data_list])\n",
+    "        maximum[lb] = max([x[lb] for x in data_list])\n",
+    "        max_range = max(max_range, maximum[lb] - minimum[lb])\n",
+    "    max_range *= 1.25\n",
+    "    for lb in labels:\n",
+    "        mid = (minimum[lb] + maximum[lb]) / 2\n",
+    "        new_range = (mid - max_range / 2, mid + max_range / 2) if (mid + max_range / 2) < 100 else (100 - max_range, 100)\n",
+    "        range_map[lb] = new_range\n",
+    "        for item in data_list:\n",
+    "            assert new_range[0] <= item[lb] <= new_range[1]\n",
+    "            item[lb] = (item[lb] - new_range[0]) / max_range * 100\n",
+    "    return data_list, range_map"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Draw MMBench Radar Graph\n",
+    "data = json.loads(open('OpenVLM.json').read())['results']\n",
+    "models = list(data)\n",
+    "print(models)\n",
+    "\n",
+    "model2vis = [\n",
+    "    'GPT-4v (detail: low)', 'GeminiProVision', 'Qwen-VL-Plus', \n",
+    "    'InternLM-XComposer2-VL', 'LLaVA-v1.5-13B', 'CogVLM-17B-Chat',\n",
+    "    'mPLUG-Owl2', 'Qwen-VL-Chat', 'IDEFICS-80B-Instruct'\n",
+    "]\n",
+    "colors = [\n",
+    "    '#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', \n",
+    "    '#e377c2', '#7f7f7f', '#bcbd22'\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "split = 'MMBench_TEST_EN'\n",
+    "data_sub = {k: v[split] for k, v in data.items()}\n",
+    "\n",
+    "labels = list(data_sub[model2vis[0]])\n",
+    "labels.remove('Overall')\n",
+    "num_vars = len(labels)\n",
+    "\n",
+    "raw_data = [data_sub[m] for m in model2vis]\n",
+    "data_list, range_map = pre_normalize(raw_data, labels)\n",
+    "\n",
+    "alpha = 0.25\n",
+    "angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()\n",
+    "angles_deg = np.linspace(0, 360, num_vars, endpoint=False).tolist()\n",
+    "fig, ax_base = plt.subplots(nrows=1, ncols=1, figsize=(10, 10), subplot_kw=dict(polar=True))\n",
+    "\n",
+    "for i in range(len(data_list)):\n",
+    "    item = data_list[i]\n",
+    "    model_name = model2vis[i]\n",
+    "    color = colors[i]\n",
+    "    tmp_angles = angles[:] + [angles[0]]\n",
+    "    tmp_values = [item[lb] for lb in labels] + [item[labels[0]]]\n",
+    "    ax_base.plot(tmp_angles, tmp_values, color=color, linewidth=1, linestyle='solid', label=model_name)\n",
+    "    ax_base.fill(tmp_angles, tmp_values, color=color, alpha=alpha)\n",
+    "    \n",
+    "angles += [angles[0]]\n",
+    "ax_base.set_ylim(0, 100)\n",
+    "ax_base.set_yticks([40, 60, 80, 100])\n",
+    "ax_base.set_yticklabels([''] * 4)\n",
+    "\n",
+    "ax_base.tick_params(pad=25)\n",
+    "ax_base.set_xticks(angles[:-1])\n",
+    "ax_base.set_xticklabels(labels, fontproperties=font18)\n",
+    "\n",
+    "leg = ax_base.legend(loc='center right', bbox_to_anchor=(1.6, 0.5), prop=font15, ncol=1, frameon=True, labelspacing=1.2)\n",
+    "for line in leg.get_lines():\n",
+    "    line.set_linewidth(2.5)\n",
+    "\n",
+    "cx, cy, sz = 0.44, 0.435, 0.34\n",
+    "axes = [fig.add_axes([cx - sz, cy - sz, cx + sz, cy + sz], projection='polar', label='axes%d' % i) for i in range(num_vars)]\n",
+    "    \n",
+    "for ax, angle, label in zip(axes, angles_deg, labels):\n",
+    "    ax.patch.set_visible(False)\n",
+    "    ax.grid(False)\n",
+    "    ax.xaxis.set_visible(False)\n",
+    "    cur_range = range_map[label]\n",
+    "    label_list = [cur_range[0] + (cur_range[1] - cur_range[0]) / 5 * i for i in range(2, 6)]\n",
+    "    label_list = [f'{x:.1f}' for x in label_list]\n",
+    "    ax.set_rgrids(range(40, 120, 20), angle=angle, labels=label_list, font_properties=font12)\n",
+    "    ax.spines['polar'].set_visible(False)\n",
+    "    ax.set_ylim(0, 100)\n",
+    "\n",
+    "title_text = f'{len(model2vis)} Representative VLMs on MMBench Test.'\n",
+    "plt.figtext(.7, .95, title_text, fontproperties=font18, ha='center')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "labels = ['SEEDBench_IMG', 'CCBench', 'MMBench_TEST_EN', 'MMBench_TEST_CN', 'MME', 'MMVet', 'MMMU_VAL', 'MathVista', 'HallusionBench', 'LLaVABench']\n",
+    "num_vars = len(labels)\n",
+    "\n",
+    "raw_data = [{k: data[m][k]['Overall'] for k in labels} for m in model2vis]\n",
+    "data_list, range_map = pre_normalize(raw_data, labels)\n",
+    "\n",
+    "alpha = 0.25\n",
+    "angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()\n",
+    "angles_deg = np.linspace(0, 360, num_vars, endpoint=False).tolist()\n",
+    "fig, ax_base = plt.subplots(nrows=1, ncols=1, figsize=(10, 10), subplot_kw=dict(polar=True))\n",
+    "\n",
+    "for i in range(len(data_list)):\n",
+    "    item = data_list[i]\n",
+    "    model_name = model2vis[i]\n",
+    "    color = colors[i]\n",
+    "    tmp_angles = angles[:] + [angles[0]]\n",
+    "    tmp_values = [item[lb] for lb in labels] + [item[labels[0]]]\n",
+    "    ax_base.plot(tmp_angles, tmp_values, color=color, linewidth=1, linestyle='solid', label=model_name)\n",
+    "    ax_base.fill(tmp_angles, tmp_values, color=color, alpha=alpha)\n",
+    "    \n",
+    "angles += [angles[0]]\n",
+    "ax_base.set_ylim(0, 100)\n",
+    "ax_base.set_yticks([40, 60, 80, 100])\n",
+    "ax_base.set_yticklabels([''] * 4)\n",
+    "\n",
+    "ax_base.tick_params(pad=15)\n",
+    "ax_base.set_xticks(angles[:-1])\n",
+    "ax_base.set_xticklabels(labels, fontproperties=font18)\n",
+    "\n",
+    "dataset_map = {\n",
+    "    'MMBench_TEST_EN': 'MMBench (Test)', \n",
+    "    'MMBench_TEST_CN': 'MMBenchCN (Test)', \n",
+    "    'MathVista': 'MathVista (TestMini)', \n",
+    "    'MMMU_VAL': 'MMMU (Val)'\n",
+    "}\n",
+    "for i, label in enumerate(ax_base.get_xticklabels()):\n",
+    "    x,y = label.get_position()\n",
+    "    text = label.get_text()\n",
+    "    text = dataset_map[text] if text in dataset_map else text\n",
+    "    lab = ax_base.text(x, y, text, transform=label.get_transform(),\n",
+    "                  ha=label.get_ha(), va=label.get_va(), font_properties=font15)\n",
+    "    lab.set_rotation(360 / num_vars * i + 270)\n",
+    "    labels.append(lab)\n",
+    "ax_base.set_xticklabels([])\n",
+    "\n",
+    "leg = ax_base.legend(loc='center right', bbox_to_anchor=(1.6, 0.5), prop=font15, ncol=1, frameon=True, labelspacing=1.2)\n",
+    "for line in leg.get_lines():\n",
+    "    line.set_linewidth(2.5)\n",
+    "\n",
+    "cx, cy, sz = 0.44, 0.435, 0.34\n",
+    "axes = [fig.add_axes([cx - sz, cy - sz, cx + sz, cy + sz], projection='polar', label='axes%d' % i) for i in range(num_vars)]\n",
+    "    \n",
+    "for ax, angle, label in zip(axes, angles_deg, labels):\n",
+    "    ax.patch.set_visible(False)\n",
+    "    ax.grid(False)\n",
+    "    ax.xaxis.set_visible(False)\n",
+    "    cur_range = range_map[label]\n",
+    "    label_list = [cur_range[0] + (cur_range[1] - cur_range[0]) / 5 * i for i in range(2, 6)]\n",
+    "    label_list = [f'{x:.1f}' for x in label_list]\n",
+    "    ax.set_rgrids(range(40, 120, 20), angle=angle, labels=label_list, font_properties=font12)\n",
+    "    ax.spines['polar'].set_visible(False)\n",
+    "    ax.set_ylim(0, 100)\n",
+    "\n",
+    "title_text = f'{len(model2vis)} Representative VLMs on {num_vars} Benchmarks in OpenCompass Multi-Modal Leaderboard.'\n",
+    "plt.figtext(.7, .95, title_text, fontproperties=font18, ha='center')\n",
+    "plt.show()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/VLMEvalKit/setup.py
+++ b/VLMEvalKit/setup.py
+import re
+import sys
+from os.path import exists
+from setuptools import find_packages, setup
+def parse_requirements(fname='requirements.txt', with_version=True):
+    """Parse the package dependencies listed in a requirements file but strips
+    specific versioning information.
+    Args:
+        fname (str): path to requirements file
+        with_version (bool, default=False): if True include version specs
+    Returns:
+        List[str]: list of requirements items
+    CommandLine:
+        python -c "import setup; print(setup.parse_requirements())"
+    """
+    require_fpath = fname
+    def parse_line(line):
+        """Parse information from a line in a requirements text file."""
+        if line.startswith('-r '):
+            # Allow specifying requirements in other files
+            target = line.split(' ')[1]
+            for info in parse_require_file(target):
+                yield info
+        else:
+            info = {'line': line}
+            if line.startswith('-e '):
+                info['package'] = line.split('#egg=')[1]
+            elif '@git+' in line:
+                info['package'] = line
+            else:
+                # Remove versioning from the package
+                pat = '(' + '|'.join(['>=', '==', '>']) + ')'
+                parts = re.split(pat, line, maxsplit=1)
+                parts = [p.strip() for p in parts]
+                info['package'] = parts[0]
+                if len(parts) > 1:
+                    op, rest = parts[1:]
+                    if ';' in rest:
+                        # Handle platform specific dependencies
+                        # http://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-platform-specific-dependencies
+                        version, platform_deps = map(str.strip,
+                                                     rest.split(';'))
+                        info['platform_deps'] = platform_deps
+                    else:
+                        version = rest  # NOQA
+                    info['version'] = (op, version)
+            yield info
+    def parse_require_file(fpath):
+        with open(fpath, 'r') as f:
+            for line in f.readlines():
+                line = line.strip()
+                if line and not line.startswith('#'):
+                    for info in parse_line(line):
+                        yield info
+    def gen_packages_items():
+        if exists(require_fpath):
+            for info in parse_require_file(require_fpath):
+                parts = [info['package']]
+                if with_version and 'version' in info:
+                    parts.extend(info['version'])
+                if not sys.version.startswith('3.4'):
+                    # apparently package_deps are broken in 3.4
+                    platform_deps = info.get('platform_deps')
+                    if platform_deps is not None:
+                        parts.append(';' + platform_deps)
+                item = ''.join(parts)
+                yield item
+    packages = list(gen_packages_items())
+    return packages
+with open('README.md') as f:
+    readme = f.read()
+def do_setup():
+    setup(
+        name='vlmeval',
+        version='0.1.0',
+        description='OpenCompass VLM Evaluation Kit',
+        author='Haodong Duan',
+        author_email='dhd.efz@gmail.com',
+        maintainer='Haodong Duan',
+        maintainer_email='dhd.efz@gmail.com',
+        long_description=readme,
+        long_description_content_type='text/markdown',
+        cmdclass={},
+        install_requires=parse_requirements('requirements.txt'),
+        setup_requires=[],
+        python_requires='>=3.7.0',
+        packages=find_packages(exclude=[
+            'test*',
+            'paper_test*',
+        ]),
+        keywords=['AI', 'NLP', 'in-context learning'],
+        entry_points={
+            'console_scripts': ['vlmutil = vlmeval:cli']
+        },
+        classifiers=[
+            'Programming Language :: Python :: 3.7',
+            'Programming Language :: Python :: 3.8',
+            'Programming Language :: Python :: 3.9',
+            'Programming Language :: Python :: 3.10',
+            'Intended Audience :: Developers',
+            'Intended Audience :: Education',
+            'Intended Audience :: Science/Research',
+        ])
+if __name__ == '__main__':
+    do_setup()
--- a/VLMEvalKit/vlmeval/__init__.py
+++ b/VLMEvalKit/vlmeval/__init__.py
+try:
+    import torch
+except ImportError:
+    pass
+from .smp import *
+from .api import *
+from .dataset import *
+from .utils import *
+from .vlm import *
+from .config import *
+from .tools import cli
+load_env()
+__version__ = '0.2rc1'
--- a/VLMEvalKit/vlmeval/api/__init__.py
+++ b/VLMEvalKit/vlmeval/api/__init__.py
+from .gpt import OpenAIWrapper, GPT4V
+from .hf_chat_model import HFChatModel
+from .gemini import GeminiWrapper, GeminiProVision
+from .qwen_vl_api import QwenVLWrapper, QwenVLAPI, Qwen2VLAPI
+from .qwen_api import QwenAPI
+from .claude import Claude_Wrapper, Claude3V
+from .reka import Reka
+from .glm_vision import GLMVisionAPI
+from .cloudwalk import CWWrapper
+from .sensechat_vision import SenseChatVisionAPI
+from .hunyuan import HunyuanVision
+from .bluelm_v_api import BlueLMWrapper, BlueLM_V_API
+__all__ = [
+    'OpenAIWrapper', 'HFChatModel', 'GeminiWrapper', 'GPT4V',
+    'GeminiProVision', 'QwenVLWrapper', 'QwenVLAPI', 'QwenAPI',
+    'Claude3V', 'Claude_Wrapper', 'Reka', 'GLMVisionAPI',
+    'CWWrapper', 'SenseChatVisionAPI', 'HunyuanVision', 'Qwen2VLAPI',
+    'BlueLMWrapper', 'BlueLM_V_API',
+]
--- a/VLMEvalKit/vlmeval/api/base.py
+++ b/VLMEvalKit/vlmeval/api/base.py
+import time
+import random as rd
+from abc import abstractmethod
+import os.path as osp
+import copy as cp
+from ..smp import get_logger, parse_file, concat_images_vlmeval
+class BaseAPI:
+    allowed_types = ['text', 'image']
+    INTERLEAVE = True
+    INSTALL_REQ = False
+    def __init__(self,
+                 retry=10,
+                 wait=3,
+                 system_prompt=None,
+                 verbose=True,
+                 fail_msg='Failed to obtain answer via API.',
+                 **kwargs):
+        """Base Class for all APIs.
+        Args:
+            retry (int, optional): The retry times for `generate_inner`. Defaults to 10.
+            wait (int, optional): The wait time after each failed retry of `generate_inner`. Defaults to 3.
+            system_prompt (str, optional): Defaults to None.
+            verbose (bool, optional): Defaults to True.
+            fail_msg (str, optional): The message to return when failed to obtain answer.
+                Defaults to 'Failed to obtain answer via API.'.
+            **kwargs: Other kwargs for `generate_inner`.
+        """
+        self.wait = wait
+        self.retry = retry
+        self.system_prompt = system_prompt
+        self.verbose = verbose
+        self.fail_msg = fail_msg
+        self.logger = get_logger('ChatAPI')
+        if len(kwargs):
+            self.logger.info(f'BaseAPI received the following kwargs: {kwargs}')
+            self.logger.info('Will try to use them as kwargs for `generate`. ')
+        self.default_kwargs = kwargs
+    @abstractmethod
+    def generate_inner(self, inputs, **kwargs):
+        """The inner function to generate the answer.
+        Returns:
+            tuple(int, str, str): ret_code, response, log
+        """
+        self.logger.warning('For APIBase, generate_inner is an abstract method. ')
+        assert 0, 'generate_inner not defined'
+        ret_code, answer, log = None, None, None
+        # if ret_code is 0, means succeed
+        return ret_code, answer, log
+    def working(self):
+        """If the API model is working, return True, else return False.
+        Returns:
+            bool: If the API model is working, return True, else return False.
+        """
+        self.old_timeout = None
+        if hasattr(self, 'timeout'):
+            self.old_timeout = self.timeout
+            self.timeout = 120
+        retry = 5
+        while retry > 0:
+            ret = self.generate('hello')
+            if ret is not None and ret != '' and self.fail_msg not in ret:
+                if self.old_timeout is not None:
+                    self.timeout = self.old_timeout
+                return True
+            retry -= 1
+        if self.old_timeout is not None:
+            self.timeout = self.old_timeout
+        return False
+    def check_content(self, msgs):
+        """Check the content type of the input. Four types are allowed: str, dict, liststr, listdict.
+        Args:
+            msgs: Raw input messages.
+        Returns:
+            str: The message type.
+        """
+        if isinstance(msgs, str):
+            return 'str'
+        if isinstance(msgs, dict):
+            return 'dict'
+        if isinstance(msgs, list):
+            types = [self.check_content(m) for m in msgs]
+            if all(t == 'str' for t in types):
+                return 'liststr'
+            if all(t == 'dict' for t in types):
+                return 'listdict'
+        return 'unknown'
+    def preproc_content(self, inputs):
+        """Convert the raw input messages to a list of dicts.
+        Args:
+            inputs: raw input messages.
+        Returns:
+            list(dict): The preprocessed input messages. Will return None if failed to preprocess the input.
+        """
+        if self.check_content(inputs) == 'str':
+            return [dict(type='text', value=inputs)]
+        elif self.check_content(inputs) == 'dict':
+            assert 'type' in inputs and 'value' in inputs
+            return [inputs]
+        elif self.check_content(inputs) == 'liststr':
+            res = []
+            for s in inputs:
+                mime, pth = parse_file(s)
+                if mime is None or mime == 'unknown':
+                    res.append(dict(type='text', value=s))
+                else:
+                    res.append(dict(type=mime.split('/')[0], value=pth))
+            return res
+        elif self.check_content(inputs) == 'listdict':
+            for item in inputs:
+                assert 'type' in item and 'value' in item
+                mime, s = parse_file(item['value'])
+                if mime is None:
+                    assert item['type'] == 'text', item['value']
+                else:
+                    assert mime.split('/')[0] == item['type']
+                    item['value'] = s
+            return inputs
+        else:
+            return None
+    # May exceed the context windows size, so try with different turn numbers.
+    def chat_inner(self, inputs, **kwargs):
+        _ = kwargs.pop('dataset', None)
+        while len(inputs):
+            try:
+                return self.generate_inner(inputs, **kwargs)
+            except:
+                inputs = inputs[1:]
+                while len(inputs) and inputs[0]['role'] != 'user':
+                    inputs = inputs[1:]
+                continue
+        return -1, self.fail_msg + ': ' + 'Failed with all possible conversation turns.', None
+    def chat(self, messages, **kwargs1):
+        """The main function for multi-turn chatting. Will call `chat_inner` with the preprocessed input messages."""
+        assert hasattr(self, 'chat_inner'), 'The API model should has the `chat_inner` method. '
+        for msg in messages:
+            assert isinstance(msg, dict) and 'role' in msg and 'content' in msg, msg
+            assert self.check_content(msg['content']) in ['str', 'dict', 'liststr', 'listdict'], msg
+            msg['content'] = self.preproc_content(msg['content'])
+        # merge kwargs
+        kwargs = cp.deepcopy(self.default_kwargs)
+        kwargs.update(kwargs1)
+        answer = None
+        # a very small random delay [0s - 0.5s]
+        T = rd.random() * 0.5
+        time.sleep(T)
+        assert messages[-1]['role'] == 'user'
+        for i in range(self.retry):
+            try:
+                ret_code, answer, log = self.chat_inner(messages, **kwargs)
+                if ret_code == 0 and self.fail_msg not in answer and answer != '':
+                    if self.verbose:
+                        print(answer)
+                    return answer
+                elif self.verbose:
+                    if not isinstance(log, str):
+                        try:
+                            log = log.text
+                        except:
+                            self.logger.warning(f'Failed to parse {log} as an http response. ')
+                    self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
+            except Exception as err:
+                if self.verbose:
+                    self.logger.error(f'An error occured during try {i}:')
+                    self.logger.error(err)
+            # delay before each retry
+            T = rd.random() * self.wait * 2
+            time.sleep(T)
+        return self.fail_msg if answer in ['', None] else answer
+    def generate(self, message, **kwargs1):
+        """The main function to generate the answer. Will call `generate_inner` with the preprocessed input messages.
+        Args:
+            message: raw input messages.
+        Returns:
+            str: The generated answer of the Failed Message if failed to obtain answer.
+        """
+        assert self.check_content(message) in ['str', 'dict', 'liststr', 'listdict'], f'Invalid input type: {message}'
+        message = self.preproc_content(message)
+        assert message is not None and self.check_content(message) == 'listdict'
+        for item in message:
+            assert item['type'] in self.allowed_types, f'Invalid input type: {item["type"]}'
+        # merge kwargs
+        kwargs = cp.deepcopy(self.default_kwargs)
+        kwargs.update(kwargs1)
+        answer = None
+        # a very small random delay [0s - 0.5s]
+        T = rd.random() * 0.5
+        time.sleep(T)
+        for i in range(self.retry):
+            try:
+                ret_code, answer, log = self.generate_inner(message, **kwargs)
+                if ret_code == 0 and self.fail_msg not in answer and answer != '':
+                    if self.verbose:
+                        print(answer)
+                    return answer
+                elif self.verbose:
+                    if not isinstance(log, str):
+                        try:
+                            log = log.text
+                        except:
+                            self.logger.warning(f'Failed to parse {log} as an http response. ')
+                    self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
+            except Exception as err:
+                if self.verbose:
+                    self.logger.error(f'An error occured during try {i}:')
+                    self.logger.error(err)
+            # delay before each retry
+            T = rd.random() * self.wait * 2
+            time.sleep(T)
+        return self.fail_msg if answer in ['', None] else answer
+    def message_to_promptimg(self, message, dataset=None):
+        assert not self.INTERLEAVE
+        model_name = self.__class__.__name__
+        import warnings
+        warnings.warn(
+            f'Model {model_name} does not support interleaved input. '
+            'Will use the first image and aggregated texts as prompt. ')
+        num_images = len([x for x in message if x['type'] == 'image'])
+        if num_images == 0:
+            prompt = '\n'.join([x['value'] for x in message if x['type'] == 'text'])
+            image = None
+        elif num_images == 1:
+            prompt = '\n'.join([x['value'] for x in message if x['type'] == 'text'])
+            image = [x['value'] for x in message if x['type'] == 'image'][0]
+        else:
+            prompt = '\n'.join([x['value'] if x['type'] == 'text' else '<image>' for x in message])
+            if dataset == 'BLINK':
+                image = concat_images_vlmeval(
+                    [x['value'] for x in message if x['type'] == 'image'],
+                    target_size=512)
+            else:
+                image = [x['value'] for x in message if x['type'] == 'image'][0]
+        return prompt, image