Commit 81028572 authored by luopl's avatar luopl
Browse files

init

parents
Pipeline #1722 canceled with stages
#!/usr/bin/env bash
# Copy *.md files from docs/ if it doesn't have a Chinese translation
for filename in $(find ../en/ -name '*.md' -printf "%P\n");
do
mkdir -p $(dirname $filename)
cp -n ../en/$filename ./$filename
done
[html writers]
table_style: colwidths-auto
# 快速开始
在运行评测脚本之前,你需要先**配置** VLMs,并正确设置模型路径。然后你可以使用脚本 `run.py` 进行多个VLMs和基准测试的推理和评估。
## 第0步 安装和设置必要的密钥
**安装**
```bash
git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .
```
**设置密钥**
要使用 API 模型(如 GPT-4v, Gemini-Pro-V 等)进行推理,或使用 LLM API 作为**评判者或选择提取器**,你需要首先设置 API 密钥。如果你设置了密钥,VLMEvalKit 将使用一个评判 LLM 从输出中提取答案,否则它将使用**精确匹配模式**(在输出字符串中查找 "Yes", "No", "A", "B", "C"...)。**精确匹配模式只能应用于是或否任务和多项选择任务。**
- 你可以将所需的密钥放在 `$VLMEvalKit/.env` 中,或直接将它们设置为环境变量。如果你选择创建 `.env` 文件,其内容将如下所示:
```bash
# .env 文件,将其放置在 $VLMEvalKit 下
# 专有 VLMs 的 API 密钥
# QwenVL APIs
DASHSCOPE_API_KEY=
# Gemini w. Google Cloud Backends
GOOGLE_API_KEY=
# OpenAI API
OPENAI_API_KEY=
OPENAI_API_BASE=
# StepAI API
STEPAI_API_KEY=
# REKA API
REKA_API_KEY=
# GLMV API
GLMV_API_KEY=
# CongRong API
CW_API_BASE=
CW_API_KEY=
# SenseChat-V API
SENSECHAT_AK=
SENSECHAT_SK=
# Hunyuan-Vision API
HUNYUAN_SECRET_KEY=
HUNYUAN_SECRET_ID=
# 你可以设置一个评估时代理,评估阶段产生的 API 调用将通过这个代理进行
EVAL_PROXY=
```
- 如果需要使用 API 在对应键值空白处填写上你的密钥。这些 API 密钥将在进行推理和评估时自动加载。
## 第1步 配置
**VLM 配置**:所有 VLMs 都在 `vlmeval/config.py` 中配置。对于某些 VLMs,在进行评估之前,你需要配置代码根目录(如 MiniGPT-4、PandaGPT 等)或模型权重根目录(如 LLaVA-v1-7B 等)。在评估时,你应该使用 `vlmeval/config.py``supported_VLM` 指定的模型名称来选择 VLM。对于 MiniGPT-4 和 InstructBLIP,还需要修改 `vlmeval/vlm/misc` 中的配置文件来配置 LLM 路径和 ckpt 路径。
**以下 VLMs 需要额外配置步骤**
**代码准备和安装**: InstructBLIP ([LAVIS](https://github.com/salesforce/LAVIS)), LLaVA ([LLaVA](https://github.com/haotian-liu/LLaVA)), MiniGPT-4 ([MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)), mPLUG-Owl2 ([mPLUG-Owl2](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)), OpenFlamingo-v2 ([OpenFlamingo](https://github.com/mlfoundations/open_flamingo)), PandaGPT-13B ([PandaGPT](https://github.com/yxuansu/PandaGPT)), TransCore-M ([TransCore-M](https://github.com/PCIResearch/TransCore-M)).
**手动权重文件准备与配置**: InstructBLIP, LLaVA-v1-7B, MiniGPT-4, PandaGPT-13B
## 第2步 评测
我们使用 `run.py` 进行评估。你可以使用 `$VLMEvalKit/run.py` 或创建脚本的软链接运行(以便在任何地方使用该脚本):
**参数**
- `--data (list[str])`: 设置在 VLMEvalKit 中支持的数据集名称(在 `vlmeval/utils/dataset_config.py` 中定义)
- `--model (list[str])`: 设置在 VLMEvalKit 中支持的 VLM 名称(在 `vlmeval/config.py` 中的 `supported_VLM` 中定义)
- `--mode (str, 默认值为 'all', 可选值为 ['all', 'infer'])`:当 mode 设置为 "all" 时,将执行推理和评估;当设置为 "infer" 时,只执行推理
- `--nproc (int, 默认值为 4)`: 调用 API 的线程数
- `--work-dir (str, default to '.')`: 存放测试结果的目录
- `--nframe (int, default to 8)`: 从视频中采样的帧数,仅对视频多模态评测集适用
- `--pack (bool, store_true)`: 一个视频可能关联多个问题,如 `pack==True`,将会在一次询问中提问所有问题
**用于评测图像多模态评测集的命令**
你可以使用 `python``torchrun` 来运行脚本:
```bash
# 使用 `python` 运行时,只实例化一个 VLM,并且它可能使用多个 GPU。
# 这推荐用于评估参数量非常大的 VLMs(如 IDEFICS-80B-Instruct)。
# 在 MMBench_DEV_EN、MME 和 SEEDBench_IMG 上使用 IDEFICS-80B-Instruct 进行推理和评估
python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose
# 在 MMBench_DEV_EN、MME 和 SEEDBench_IMG 上使用 IDEFICS-80B-Instruct 仅进行推理
python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose --mode infer
# 使用 `torchrun` 运行时,每个 GPU 上实例化一个 VLM 实例。这可以加快推理速度。
# 但是,这仅适用于消耗少量 GPU 内存的 VLMs。
# 在 MMBench_DEV_EN、MME 和 SEEDBench_IMG 上使用 IDEFICS-9B-Instruct、Qwen-VL-Chat、mPLUG-Owl2。在具有 8 个 GPU 的节点上进行推理和评估。
torchrun --nproc-per-node=8 run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose
# 在 MME 上使用 Qwen-VL-Chat。在具有 2 个 GPU 的节点上进行推理和评估。
torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
```
**用于评测视频多模态评测集的命令**
```bash
# 使用 `python` 运行时,只实例化一个 VLM,并且它可能使用多个 GPU。
# 这推荐用于评估参数量非常大的 VLMs(如 IDEFICS-80B-Instruct)。
# 在 MMBench-Video 上评测 IDEFCIS2-8B, 视频采样 8 帧作为输入,不采用 pack 模式评测
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model idefics2_8b --nframe 8
# 在 MMBench-Video 上评测 GPT-4o (API 模型), 视频采样 16 帧作为输入,采用 pack 模式评测
python run.py --data MMBench-Video --model GPT4o --nframe 16 --pack
```
评估结果将作为日志打印出来。此外,**结果文件**也会在目录 `$YOUR_WORKING_DIRECTORY/{model_name}` 中生成。以 `.csv` 结尾的文件包含评估的指标。
### 部署本地语言模型作为评判 / 选择提取器
上述默认设置使用 OpenAI 的 GPT 作为评判 LLM。你也可以使用 [LMDeploy](https://github.com/InternLM/lmdeploy) 部署本地评判 LLM。
首先进行安装:
```
pip install lmdeploy openai
```
然后可以通过一行代码部署本地评判 LLM。LMDeploy 将自动从 Huggingface 下载模型。假设我们使用 internlm2-chat-1_8b 作为评判,端口为 23333,密钥为 sk-123456(密钥必须以 "sk-" 开头,后跟任意数字):
```
lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
```
使用以下 Python 代码获取由 LMDeploy 注册的模型名称:
```
from openai import OpenAI
client = OpenAI(
api_key='sk-123456',
base_url="http://0.0.0.0:23333/v1"
)
model_name = client.models.list().data[0].id
```
配置对应环境变量,以告诉 VLMEvalKit 如何使用本地评判 LLM。正如上面提到的,也可以在 `$VLMEvalKit/.env` 文件中设置:
```
OPENAI_API_KEY=sk-123456
OPENAI_API_BASE=http://0.0.0.0:23333/v1/chat/completions
LOCAL_LLM=<model_name you get>
```
最后,你可以运行第2步中的命令,使用本地评判 LLM 来评估你的 VLM。
**请注意:**
- 如果你希望将评判 LLM 部署在单独的一个 GPU 上,并且由于 GPU 内存有限而希望在其他 GPU 上评估你的 VLM,可以使用 `CUDA_VISIBLE_DEVICES=x` 这样的方法,例如:
```
CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc-per-node=3 run.py --data HallusionBench --model qwen_chat --verbose
```
- 如果本地评判 LLM 在遵循指令方面不够好,评估过程可能会失败。请通过 issues 报告此类失败情况。
- 可以以不同的方式部署评判 LLM,例如使用私有 LLM(而非来自 HuggingFace)或使用量化 LLM。请参考 [LMDeploy doc](https://lmdeploy.readthedocs.io/en/latest/serving/api_server.html) 文档。也可以使用其他支持 OpenAI API 框架的方法。
欢迎来到 VLMEvalKit 中文教程!
==========================================
VLMEvalKit 上手路线
-------------------------------
为了用户能够快速上手,我们推荐以下流程:
- 对于想要使用 VLMEvalKit 的用户,我们推荐先阅读 开始你的第一步_ 部分来设置环境,并启动一个迷你实验熟悉流程。
- 若您想进行更多模块的自定义,例如增加数据集和模型,我们提供了 进阶教程_ 。
我们始终非常欢迎用户的 PRs 和 Issues 来完善 VLMEvalKit!
.. _开始你的第一步:
.. toctree::
:maxdepth: 1
:caption: 开始你的第一步
get_started/Quickstart.md
.. .. _教程:
.. .. toctree::
.. :maxdepth: 1
.. :caption: 教程
.. user_guides/framework_overview.md
.. _进阶教程:
.. toctree::
:maxdepth: 1
:caption: 进阶教程
advanced_guides/Development.md
.. .. _其他说明:
.. .. toctree::
.. :maxdepth: 1
.. :caption: 其他说明
.. notes/contribution_guide.md
索引与表格
==================
* :ref:`genindex`
* :ref:`search`
decord
gradio
huggingface_hub
imageio
matplotlib
moviepy
numpy>=1.23.4
omegaconf
openai==1.3.5
opencv-python>=4.4.0.46
openpyxl
pandas
peft
pillow
portalocker
protobuf
python-dotenv
requests
rich
sentencepiece
setuptools
sty
tabulate
tiktoken
timeout-decorator
torch>=2.0.1
tqdm
transformers
typing_extensions==4.7.1
validators
xlsxwriter
docutils==0.18.1
modelindex
myst-parser
-e git+https://github.com/open-compass/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
sphinx==6.1.3
sphinx-copybutton
sphinx-design
sphinx-notfound-page
sphinx-tabs
sphinxcontrib-jquery
tabulate
import torch
import torch.distributed as dist
from vlmeval.config import supported_VLM
from vlmeval.dataset import build_dataset
from vlmeval.inference import infer_data_job
from vlmeval.inference_video import infer_data_job_video
from vlmeval.inference_mt import infer_data_job_mt
from vlmeval.smp import *
from vlmeval.utils.result_transfer import MMMU_result_transfer, MMTBench_result_transfer
def parse_args():
parser = argparse.ArgumentParser()
# Essential Args
parser.add_argument('--data', type=str, nargs='+', required=True)
parser.add_argument('--model', type=str, nargs='+', required=True)
# Args that only apply to Video Dataset
parser.add_argument('--nframe', type=int, default=8)
parser.add_argument('--pack', action='store_true')
parser.add_argument('--use-subtitle', action='store_true')
# Work Dir
parser.add_argument('--work-dir', type=str, default='./outputs', help='select the output directory')
# Infer + Eval or Infer Only
parser.add_argument('--mode', type=str, default='all', choices=['all', 'infer'])
# API Kwargs, Apply to API VLMs and Judge API LLMs
parser.add_argument('--nproc', type=int, default=4, help='Parallel API calling')
parser.add_argument('--retry', type=int, default=None, help='retry numbers for API VLMs')
# Explicitly Set the Judge Model
parser.add_argument('--judge', type=str, default=None)
# Logging Utils
parser.add_argument('--verbose', action='store_true')
# Configuration for Resume
# Ignore: will not rerun failed VLM inference
parser.add_argument('--ignore', action='store_true', help='Ignore failed indices. ')
# Rerun: will remove all evaluation temp files
parser.add_argument('--rerun', action='store_true')
args = parser.parse_args()
return args
def main():
logger = get_logger('RUN')
args = parse_args()
assert len(args.data), '--data should be a list of data files'
if 'MMEVAL_ROOT' in os.environ:
args.work_dir = os.environ['MMEVAL_ROOT']
if args.retry is not None:
for k, v in supported_VLM.items():
if hasattr(v, 'keywords') and 'retry' in v.keywords:
v.keywords['retry'] = args.retry
supported_VLM[k] = v
if hasattr(v, 'keywords') and 'verbose' in v.keywords:
v.keywords['verbose'] = args.verbose
supported_VLM[k] = v
rank, world_size = get_rank_and_world_size()
if world_size > 1:
local_rank = os.environ.get('LOCAL_RANK', 0)
torch.cuda.set_device(int(local_rank))
dist.init_process_group(backend='nccl', timeout=datetime.timedelta(seconds=10800))
for _, model_name in enumerate(args.model):
model = None
pred_root = osp.join(args.work_dir, model_name)
os.makedirs(pred_root, exist_ok=True)
for _, dataset_name in enumerate(args.data):
try:
dataset_kwargs = {}
if dataset_name in ['MMLongBench_DOC', 'DUDE', 'DUDE_MINI', 'SLIDEVQA', 'SLIDEVQA_MINI']:
dataset_kwargs['model'] = model_name
if dataset_name == 'MMBench-Video':
dataset_kwargs['pack'] = args.pack
if dataset_name == 'Video-MME':
dataset_kwargs['use_subtitle'] = args.use_subtitle
# If distributed, first build the dataset on the main process for doing preparation works
if world_size > 1:
dataset = build_dataset(dataset_name, **dataset_kwargs) if rank == 0 else None
dist.barrier()
dataset_list = [dataset]
dist.broadcast_object_list(dataset_list, src=0)
dataset = dataset_list[0]
else:
dataset = build_dataset(dataset_name, **dataset_kwargs)
if dataset is None:
logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
continue
result_file = f'{pred_root}/{model_name}_{dataset_name}.xlsx'
if dataset_name in ['MMBench-Video']:
packstr = 'pack' if args.pack else 'nopack'
result_file = f'{pred_root}/{model_name}_{dataset_name}_{args.nframe}frame_{packstr}.xlsx'
elif dataset.MODALITY == 'VIDEO':
if args.pack:
logger.info(f'{dataset_name} not support Pack Mode, directly change to unpack')
args.pack = False
packstr = 'pack' if args.pack else 'nopack'
result_file = f'{pred_root}/{model_name}_{dataset_name}_{args.nframe}frame_{packstr}.xlsx'
if dataset_name in ['Video-MME']:
subtitlestr = 'subs' if args.use_subtitle else 'nosubs'
result_file = result_file.replace('.xlsx', f'_{subtitlestr}.xlsx')
if dataset.TYPE == 'MT':
result_file = result_file.replace('.xlsx', '.tsv')
if osp.exists(result_file) and args.rerun:
for keyword in ['openai', 'gpt', 'auxmatch']:
os.system(f'rm {pred_root}/{model_name}_{dataset_name}_{keyword}*')
if model is None:
model = model_name # which is only a name
# Perform the Inference
if dataset.MODALITY == 'VIDEO':
model = infer_data_job_video(
model,
work_dir=pred_root,
model_name=model_name,
dataset=dataset,
nframe=args.nframe,
pack=args.pack,
verbose=args.verbose,
subtitle=args.use_subtitle,
api_nproc=args.nproc)
elif dataset.TYPE == 'MT':
model = infer_data_job_mt(
model,
work_dir=pred_root,
model_name=model_name,
dataset=dataset,
verbose=args.verbose,
api_nproc=args.nproc,
ignore_failed=args.ignore)
else:
model = infer_data_job(
model,
work_dir=pred_root,
model_name=model_name,
dataset=dataset,
verbose=args.verbose,
api_nproc=args.nproc,
ignore_failed=args.ignore)
# Set the judge kwargs first before evaluation or dumping
judge_kwargs = {
'nproc': args.nproc,
'verbose': args.verbose,
}
if args.retry is not None:
judge_kwargs['retry'] = args.retry
if args.judge is not None:
judge_kwargs['model'] = args.judge
else:
if dataset.TYPE in ['MCQ', 'Y/N'] or listinstr(['MathVerse'], dataset_name):
judge_kwargs['model'] = 'chatgpt-0125'
elif listinstr(['MMVet', 'MathVista', 'LLaVABench', 'MMBench-Video', 'MathVision'],
dataset_name):
judge_kwargs['model'] = 'gpt-4-turbo'
elif listinstr(['MMLongBench', 'MMDU', 'DUDE', 'DUDE_MINI', 'SLIDEVQA', 'SLIDEVQA_MINI'],
dataset_name):
judge_kwargs['model'] = 'gpt-4o'
if 'OPENAI_API_KEY_JUDGE' in os.environ and len(os.environ['OPENAI_API_KEY_JUDGE']):
judge_kwargs['key'] = os.environ['OPENAI_API_KEY_JUDGE']
if 'OPENAI_API_BASE_JUDGE' in os.environ and len(os.environ['OPENAI_API_BASE_JUDGE']):
judge_kwargs['api_base'] = os.environ['OPENAI_API_BASE_JUDGE']
if rank == 0:
if dataset_name in ['MMMU_TEST']:
result_json = MMMU_result_transfer(result_file)
logger.info(f'Transfer MMMU_TEST result to json for official evaluation, '
f'json file saved in {result_json}') # noqa: E501
continue
elif 'MMT-Bench_ALL' in dataset_name:
submission_file = MMTBench_result_transfer(result_file, **judge_kwargs)
logger.info(f'Extract options from prediction of MMT-Bench FULL split for official evaluation '
f'(https://eval.ai/web/challenges/challenge-page/2328/overview), '
f'submission file saved in {submission_file}') # noqa: E501
continue
elif 'MLLMGuard_DS' in dataset_name:
logger.info('The evaluation of MLLMGuard_DS is not supported yet. ') # noqa: E501
continue
elif 'AesBench_TEST' == dataset_name:
logger.info(f'The results are saved in {result_file}. '
f'Please send it to the AesBench Team via huangyipo@hotmail.com.') # noqa: E501
continue
if dataset_name in [
'MMBench_TEST_CN', 'MMBench_TEST_EN', 'MMBench', 'MMBench_CN',
'MMBench_TEST_CN_V11', 'MMBench_TEST_EN_V11', 'MMBench_V11', 'MMBench_CN_V11'
]:
if not MMBenchOfficialServer(dataset_name):
logger.error(
f'Can not evaluate {dataset_name} on non-official servers, '
'will skip the evaluation. '
)
continue
eval_proxy = os.environ.get('EVAL_PROXY', None)
old_proxy = os.environ.get('HTTP_PROXY', '')
if rank == 0 and args.mode == 'all':
if eval_proxy is not None:
proxy_set(eval_proxy)
eval_results = dataset.evaluate(result_file, **judge_kwargs)
if eval_results is not None:
assert isinstance(eval_results, dict) or isinstance(eval_results, pd.DataFrame)
logger.info(f'The evaluation of model {model_name} x dataset {dataset_name} has finished! ')
logger.info('Evaluation Results:')
if isinstance(eval_results, dict):
logger.info('\n' + json.dumps(eval_results, indent=4))
elif isinstance(eval_results, pd.DataFrame):
if len(eval_results) < len(eval_results.columns):
eval_results = eval_results.T
logger.info('\n' + tabulate(eval_results))
if eval_proxy is not None:
proxy_set(old_proxy)
except Exception as e:
logger.exception(f'Model {model_name} x Dataset {dataset_name} combination failed: {e}, '
'skipping this combination.')
continue
if world_size > 1:
dist.barrier()
if __name__ == '__main__':
load_env()
main()
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os, cv2\n",
"import string\n",
"import os.path as osp\n",
"import numpy as np\n",
"from collections import defaultdict\n",
"from vlmeval.smp import ls, load, dump, download_file, encode_image_file_to_base64, md5, mrlines\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import multiprocessing as mp\n",
"from PIL import Image, ImageFont, ImageDraw\n",
"\n",
"font_URL = 'http://opencompass.openxlab.space/utils/Fonts/timesb.ttf'\n",
"font_file = 'timesb.ttf'\n",
"if not osp.exists(font_file):\n",
" download_file(font_URL)\n",
" \n",
"test_split_URL = 'https://s3-us-east-2.amazonaws.com/prior-datasets/ai2d_test_ids.csv'\n",
"test_split_file = 'ai2d_test_ids.csv'\n",
"if not osp.exists(test_split_file):\n",
" download_file(test_split_URL)\n",
" \n",
"test_ids = set(mrlines(test_split_file))\n",
" \n",
"def proper_font_size(font_file, wh, text, ratio=1):\n",
" font_size = 2\n",
" while True:\n",
" font = ImageFont.truetype(font_file, font_size)\n",
" real_box = font.getbbox(text)\n",
" real_wh = (real_box[2] - real_box[0], real_box[3] - real_box[1])\n",
" if real_wh[0] > wh[0] * ratio or real_wh[1] > wh[1] * ratio:\n",
" break\n",
" font_size += 1\n",
" return font_size\n",
"\n",
"def cover_image(ann_path):\n",
" data = load(ann_path)\n",
" texts = list(data['text'].values())\n",
" raw_img = ann_path.replace('annotations', 'images').replace('.json', '')\n",
" tgt_img = raw_img.replace('images', 'images_abc')\n",
" img = Image.open(raw_img)\n",
" draw = ImageDraw.Draw(img)\n",
" for text in texts:\n",
" st, ed = tuple(text['rectangle'][0]), tuple(text['rectangle'][1])\n",
" T = text['replacementText']\n",
" draw.rectangle((st, ed), fill='white')\n",
" font_size = proper_font_size(font_file, (ed[0] - st[0], ed[1] - st[1]), T, ratio=1)\n",
" font = ImageFont.truetype(font_file, font_size)\n",
" text_box = font.getbbox(T)\n",
" text_wh = (text_box[2] - text_box[0], text_box[3] - text_box[1])\n",
" cx, cy = (st[0] + ed[0]) // 2, st[1]\n",
" stx = cx - text_wh[0] // 2\n",
" sty = cy - text_wh[1] // 2\n",
" draw.text((stx, sty), T, font=font, fill='black')\n",
" img.save(tgt_img) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Process for no mask images\n",
"test_ids = set(mrlines(test_split_file))\n",
"\n",
"def detect_image_color(image):\n",
" gray_image = image.convert('L')\n",
" mean_brightness = np.mean(np.array(gray_image))\n",
" if mean_brightness < 127:\n",
" return 'white'\n",
" else:\n",
" return 'black'\n",
"\n",
"def cover_image(ann_path):\n",
" data = load(ann_path)\n",
" texts = list(data['text'].values())\n",
" raw_img = ann_path.replace('annotations', 'images').replace('.json', '')\n",
" tgt_img = raw_img.replace('images', 'images_abc')\n",
" img = Image.open(raw_img)\n",
" draw = ImageDraw.Draw(img)\n",
" color = detect_image_color(img)\n",
" font_size = 0\n",
" for text in texts:\n",
" st, ed = tuple(text['rectangle'][0]), tuple(text['rectangle'][1])\n",
" font_size += (ed[1] - st[1])\n",
" if len(texts) != 0:\n",
" font_size /= len(texts)\n",
" else:\n",
" font_size = 2\n",
" for text in texts:\n",
" st, ed = tuple(text['rectangle'][0]), tuple(text['rectangle'][1])\n",
" T = text['replacementText']\n",
" for i in range(2):\n",
" draw.rectangle(\n",
" [(st[0] - i, st[1] - i), (ed[0] + i, ed[1] + i)],\n",
" outline=color\n",
" )\n",
" font = ImageFont.truetype(font_file, font_size)\n",
" text_box = font.getbbox(T)\n",
" text_wh = (text_box[2] - text_box[0], text_box[3] - text_box[1])\n",
" cx, cy = (st[0] + ed[0]) // 2, st[1]\n",
" stx = cx - text_wh[0] // 2\n",
" sty = cy - text_wh[1] * 1.5\n",
" if sty < 0:\n",
" sty = cy + text_wh[1] * 1.3\n",
" draw.text((stx, sty), T, font=font, fill=color)\n",
" img.save(tgt_img) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"download_file('https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip')\n",
"os.system('unzip -o ai2d-all.zip')\n",
"\n",
"images = ls('ai2d/images/')\n",
"questions = ls('ai2d/questions/')\n",
"annotations = ls('ai2d/annotations/')\n",
"cates = load('ai2d/categories.json')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pool = mp.Pool(32)\n",
"pool.map(cover_image, annotations)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def puncproc(inText):\n",
" import re\n",
" outText = inText\n",
" punct = [\n",
" ';', r'/', '[', ']', '\"', '{', '}', '(', ')', '=', '+', '\\\\', '_', '-',\n",
" '>', '<', '@', '`', ',', '?', '!'\n",
" ]\n",
" commaStrip = re.compile('(\\d)(,)(\\d)') # noqa: W605\n",
" periodStrip = re.compile('(?!<=\\d)(\\.)(?!\\d)') # noqa: W605\n",
" for p in punct:\n",
" if (p + ' ' in inText or ' ' + p in inText) or (re.search(commaStrip, inText) is not None):\n",
" outText = outText.replace(p, '')\n",
" else:\n",
" outText = outText.replace(p, ' ')\n",
" outText = periodStrip.sub('', outText, re.UNICODE)\n",
" return outText\n",
"\n",
"def check_choices(line):\n",
" def ischar(s):\n",
" s = str(s)\n",
" if s in ['{}', 'Both', 'None of above']:\n",
" return True\n",
" elif s.startswith('Stage ') and ischar(s[6:]):\n",
" return True\n",
" elif ' and ' in s and np.all([ischar(x) for x in s.split(' and ')]):\n",
" return True\n",
" elif len(s) <= 2:\n",
" return True\n",
" elif len(puncproc(s).split()) > 1:\n",
" return np.all([ischar(x) for x in puncproc(s).split()])\n",
" return False\n",
" n_char = sum([ischar(line[x]) for x in 'ABCD'])\n",
" return n_char >= 3\n",
"\n",
"def check_question(question):\n",
" words = puncproc(question).split()\n",
" for ch in string.ascii_lowercase + string.ascii_uppercase:\n",
" if ch in words:\n",
" return True\n",
" return False\n",
"\n",
"def is_abc(abc, choices, question):\n",
" if abc == 0:\n",
" return False\n",
" if check_choices(choices):\n",
" return True\n",
" if check_question(question):\n",
" return True\n",
" return False"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_all = defaultdict(list)\n",
"for qfile in questions:\n",
" data = load(qfile)\n",
" idx = data['imageName'].split('.')[0]\n",
" if idx not in test_ids:\n",
" continue\n",
" image_pth = qfile.replace('questions', 'images').replace('.json', '')\n",
" cate = cates[image_pth.split('/')[-1]]\n",
" for q, qmeta in data['questions'].items():\n",
" assert '.png-' in qmeta['questionId']\n",
" main, sub = qmeta['questionId'].split('.png-')\n",
" idx = int(main) * 100 + int(sub)\n",
" \n",
" answers = qmeta['answerTexts']\n",
" correct = qmeta['correctAnswer']\n",
" \n",
" data_all['index'].append(idx)\n",
" data_all['question'].append(q)\n",
" assert len(answers) == 4\n",
" for c, a in zip('ABCD', answers):\n",
" data_all[c].append(a)\n",
" data_all['answer'].append('ABCD'[qmeta['correctAnswer']])\n",
" data_all['category'].append(cate)\n",
" data_all['abcLabel'].append(qmeta['abcLabel'])\n",
" abc = is_abc(qmeta['abcLabel'], {x: data_all[x][-1] for x in 'ABCD'}, q)\n",
" # if qmeta['abcLabel'] and not abc:\n",
" # print(qmeta['abcLabel'], {x: data_all[x][-1] for x in 'ABCD'}, q)\n",
" data_all['image_path'].append(image_pth.replace('images', 'images_abc') if abc else image_pth)\n",
"data = pd.DataFrame(data_all)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"images = []\n",
"image_seen = {}\n",
"for idx, pth in zip(data['index'], data['image_path']):\n",
" images.append(encode_image_file_to_base64(pth))\n",
"\n",
"data['image'] = images\n",
"dump(data, 'AI2D_TEST.tsv')\n",
"print(md5('AI2D_TEST.tsv'))"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
import sys
from vlmeval import *
from vlmeval.dataset import SUPPORTED_DATASETS
FAIL_MSG = 'Failed to obtain answer via API.'
root = sys.argv[1]
if root[-1] in '/\\':
root = root[:-1]
model_name = root.split('/')[-1]
for d in SUPPORTED_DATASETS:
fname = f'{model_name}_{d}.xlsx'
pth = osp.join(root, fname)
if osp.exists(pth):
data = load(pth)
# Detect Failure
assert 'prediction' in data
data['prediction'] = [str(x) for x in data['prediction']]
fail = [FAIL_MSG in x for x in data['prediction']]
if sum(fail):
nfail = sum(fail)
ntot = len(fail)
print(f'Model {model_name} x Dataset {d}: {nfail} out of {ntot} failed. {nfail / ntot * 100: .2f}%. ')
eval_files = ls(root, match=f'{model_name}_{d}_')
eval_files = [x for x in eval_files if listinstr([f'{d}_openai', f'{d}_gpt'], x) and x.endswith('.xlsx')]
if len(eval_files) == 0:
print(f'Model {model_name} x Dataset {d} openai missing')
continue
assert len(eval_files) == 1
eval_file = eval_files[0]
data = load(eval_file)
if 'MMVet' in d:
bad = [x for x in data['log'] if 'All 5 retries failed.' in str(x)]
if len(bad):
print(f'Model {model_name} x Dataset {d} Evaluation: {len(bad)} out of {len(data)} failed.')
elif 'MathVista' in d:
bad = [x for x in data['res'] if FAIL_MSG in str(x)]
if len(bad):
print(f'Model {model_name} x Dataset {d} Evaluation: {len(bad)} out of {len(data)} failed.')
elif d == 'LLaVABench':
sub = data[data['gpt4_score'] == -1]
sub = sub[sub['gpt4_score'] == -1]
if len(sub):
print(f'Model {model_name} x Dataset {d} Evaluation: {len(sub)} out of {len(data)} failed.')
else:
bad = [x for x in data['log'] if FAIL_MSG in str(x)]
if len(bad):
print(f'Model {model_name} x Dataset {d} Evaluation: {len(bad)} out of {len(data)} failed.')
\ No newline at end of file
import argparse
from vlmeval.smp import *
from vlmeval.config import supported_VLM
def is_api(x):
return getattr(supported_VLM[x].func, 'is_api', False)
models = list(supported_VLM)
models = [x for x in models if 'fs' not in x]
models = [x for x in models if not is_api(x)]
exclude_list = ['cogvlm-grounding-generalist', 'emu2']
models = [x for x in models if x not in exclude_list]
def is_large(x):
return '80b' in x or 'emu2' in x or '34B' in x
small_models = [x for x in models if not is_large(x)]
large_models = [x for x in models if is_large(x)]
models = small_models + large_models
parser = argparse.ArgumentParser()
parser.add_argument('--data', type=str, nargs='+', required=True)
args = parser.parse_args()
# Skip some models
models = [x for x in models if not listinstr(['MiniGPT', 'grounding-generalist'], x)]
for m in models:
unknown_datasets = [x for x in args.data if not osp.exists(f'{m}/{m}_{x}.xlsx')]
if len(unknown_datasets) == 0:
continue
dataset_str = ' '.join(unknown_datasets)
if '80b' in m:
cmd = f'python run.py --data {dataset_str} --model {m}'
else:
cmd = f'bash run.sh --data {dataset_str} --model {m}'
print(cmd)
os.system(cmd)
\ No newline at end of file
#!/bin/bash
DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cp $DIR/../config.py $DIR/../vlmeval/
cp $DIR/../misc/* $DIR/../vlmeval/vlm/misc/
\ No newline at end of file
from vlmeval.smp import *
from vlmeval.evaluate.multiple_choice import multiple_choice_eval
import gradio as gr
HEADER = """
# Welcome to MMBench👏👏
We are delighted that you are willing to submit the evaluation results to the MMBench official website! The evaluation service currently can handle submissions of MMBench, MMBench-CN, and CCBench. We use `gpt-3.5-turbo-0125` to help answer matching. Evaluation Codes in VLMEvalKit: https://github.com/open-compass/VLMEvalKit. Please adopt / follow the implementation of VLMEvalKit to generate the submission files.
Moreover, this is a temporary solution, which **does not support ChatGPT-based answer extraction**. So you need to make sure values in the `prediction` field of your submission files are single characters in A, B, C, D. Other free-form answers can not be recognized by the evaluation script and will be marked as **WRONG**.
The evaluation script is available at https://github.com/open-compass/VLMEvalKit/tree/main/scripts/mmb_eval_gradio.py
Please contact `opencompass@pjlab.org.cn` for any inquirys about this script.
"""
def upload_file(file):
file_path = file.name
return file_path
def prepare_file(file_name):
file_md5 = md5(file_name)
root = LMUDataRoot()
root = osp.join(root, 'eval_server')
os.makedirs(root, exist_ok=True)
suffix = file_name.split('.')[-1]
if suffix not in ['xlsx', 'tsv', 'csv']:
return False, "Please submit a file that ends with `.xlsx`, `.tsv`, or `.csv`"
new_file_name = osp.join(root, f'{file_md5}.{suffix}')
shutil.move(file_name, new_file_name)
eval_file = new_file_name
try:
data = load(eval_file)
except:
return False, "Your excel file can not be successfully loaded by `pd.read_excel`, please double check and submit again. "
for k in data.keys():
data[k.lower() if k not in 'ABCD' else k] = data.pop(k)
if "index" not in data:
return False, "Your excel file should have a column named `index`, please double check and submit again" , {}
if "prediction" not in data:
return False, "Your excel file should have a column named `prediction`, please double check and submit again" , {}
for ch in 'ABCD':
if ch not in data:
return False, f"Your excel file should have a column named `{ch}`, please double check and submit again" , {}
dump(data, eval_file)
return True, eval_file
def determine_dataset(eval_file):
data = load(eval_file)
def cn_ratio(data):
iscn = [cn_string(x) for x in data['question']]
return np.mean(iscn)
if len(data) < 2500 and 'l2-category' not in data:
return 'CCBench' if cn_ratio(data) > 0.5 else "Unknown"
else:
return 'MMBench_CN' if cn_ratio(data) > 0.5 else "MMBench"
def reformat_acc(acc):
splits = set(acc['split'])
keys = list(acc.keys())
keys.remove('split')
nacc = {'Category': []}
for sp in splits:
nacc[sp.upper()] = []
for k in keys:
nacc['Category'].append(k)
for sp in splits:
nacc[sp.upper()].append(acc[acc['split'] == sp].iloc[0][k] * 100)
return pd.DataFrame(nacc)
def evaluate(file):
file_name = file.name
flag, eval_file = prepare_file(file_name)
if not flag:
return "Error: " + eval_file
dataset = determine_dataset(eval_file)
if dataset == 'Unknown':
return "Error: Cannot determine the dataset given your submitted file. "
eval_id = eval_file.split('/')[-1].split('.')[0]
ret = f"Evaluation ID: {eval_id}\n"
timestamp = datetime.datetime.now().strftime('%Y.%m.%d %H:%M:%S')
ret += f'Evaluation Timestamp: {timestamp}\n'
acc = multiple_choice_eval(eval_file, dataset=dataset, model='exact_matching')
nacc = reformat_acc(acc).round(1)
return ret, nacc
with gr.Blocks() as demo:
gr.Markdown(HEADER)
file_output = gr.File()
upload_button = gr.UploadButton("Click to upload you prediction files for a supported benchmark")
upload_button.upload(upload_file, upload_button, file_output)
btn = gr.Button("🚀 Evaluate")
eval_log = gr.Textbox(label="Evaluation Log", placeholder="Your evaluation log will be displayed here")
df_empty = pd.DataFrame([], columns=['Evaluation Result'])
eval_result = gr.components.DataFrame(value=df_empty)
btn.click(evaluate, inputs=[file_output], outputs=[eval_log, eval_result])
if __name__ == '__main__':
demo.launch(server_name='0.0.0.0', debug=True, show_error=True)
\ No newline at end of file
#!/bin/bash
set -x
export GPU=$(nvidia-smi --list-gpus | wc -l)
torchrun --nproc-per-node=$GPU run.py ${@:1}
\ No newline at end of file
#!/bin/bash
set -x
srun -n1 --ntasks-per-node=1 --partition $1 --gres=gpu:8 --quotatype=reserved --job-name vlmeval --cpus-per-task=64 torchrun --nproc-per-node=8 run.py ${@:2}
\ No newline at end of file
from vlmeval.smp import *
from vlmeval.dataset import dataset_URLs
def get_score(model, dataset):
file_name = f'{model}/{model}_{dataset}'
if listinstr([
'CCBench', 'MMBench', 'SEEDBench_IMG', 'MMMU', 'ScienceQA',
'AI2D_TEST', 'MMStar', 'RealWorldQA', 'BLINK'
], dataset):
file_name += '_acc.csv'
elif listinstr(['MME', 'Hallusion', 'LLaVABench'], dataset):
file_name += '_score.csv'
elif listinstr(['MMVet', 'MathVista'], dataset):
file_name += '_gpt-4-turbo_score.csv'
elif listinstr(['COCO', 'OCRBench'], dataset):
file_name += '_score.json'
else:
raise NotImplementedError
if not osp.exists(file_name):
return {}
data = load(file_name)
ret = {}
if dataset == 'CCBench':
ret[dataset] = data['Overall'][0] * 100
elif dataset == 'MMBench':
for n, a in zip(data['split'], data['Overall']):
if n == 'dev':
ret['MMBench_DEV_EN'] = a * 100
elif n == 'test':
ret['MMBench_TEST_EN'] = a * 100
elif dataset == 'MMBench_CN':
for n, a in zip(data['split'], data['Overall']):
if n == 'dev':
ret['MMBench_DEV_CN'] = a * 100
elif n == 'test':
ret['MMBench_TEST_CN'] = a * 100
elif listinstr(['SEEDBench', 'ScienceQA', 'MMBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'BLINK'], dataset):
ret[dataset] = data['Overall'][0] * 100
elif 'MME' == dataset:
ret[dataset] = data['perception'][0] + data['reasoning'][0]
elif 'MMVet' == dataset:
data = data[data['Category'] == 'Overall']
ret[dataset] = float(data.iloc[0]['acc'])
elif 'HallusionBench' == dataset:
data = data[data['split'] == 'Overall']
for met in ['aAcc', 'qAcc', 'fAcc']:
ret[dataset + f' ({met})'] = float(data.iloc[0][met])
elif 'MMMU' in dataset:
data = data[data['split'] == 'validation']
ret['MMMU (val)'] = float(data.iloc[0]['Overall']) * 100
elif 'MathVista' in dataset:
data = data[data['Task&Skill'] == 'Overall']
ret[dataset] = float(data.iloc[0]['acc'])
elif 'LLaVABench' in dataset:
data = data[data['split'] == 'overall'].iloc[0]
ret[dataset] = float(data['Relative Score (main)'])
elif 'OCRBench' in dataset:
ret[dataset] = data['Final Score']
return ret
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--data', type=str, nargs='+', default=[])
parser.add_argument("--model", type=str, nargs='+', required=True)
args = parser.parse_args()
return args
def gen_table(models, datasets):
res = defaultdict(dict)
for m in models:
for d in datasets:
try:
res[m].update(get_score(m, d))
except:
pass
keys = []
for m in models:
for d in res[m]:
keys.append(d)
keys = list(set(keys))
keys.sort()
final = defaultdict(list)
for m in models:
final['Model'].append(m)
for k in keys:
if k in res[m]:
final[k].append(res[m][k])
else:
final[k].append(None)
final = pd.DataFrame(final)
dump(final, 'summ.csv')
if len(final) >= len(final.iloc[0].keys()):
print(tabulate(final))
else:
print(tabulate(final.T))
if __name__ == '__main__':
args = parse_args()
if args.data == []:
args.data = list(dataset_URLs)
gen_table(args.model, args.data)
\ No newline at end of file
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import copy as cp\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib.font_manager as fm\n",
"\n",
"def download_file(url, filename=None):\n",
" from urllib.request import urlretrieve\n",
" if filename is None:\n",
" filename = url.split('/')[-1]\n",
" urlretrieve(url, filename)\n",
"\n",
"font_URL = 'http://opencompass.openxlab.space/utils/Fonts/segoepr.ttf'\n",
"download_file(font_URL)\n",
"\n",
"font12 = fm.FontProperties(fname='segoepr.ttf', size=12)\n",
"font15 = fm.FontProperties(fname='segoepr.ttf', size=15, weight='bold')\n",
"font18 = fm.FontProperties(fname='segoepr.ttf', size=18, weight='bold')\n",
"\n",
"DATA_URL = 'http://opencompass.openxlab.space/utils/OpenVLM.json'\n",
"download_file(DATA_URL)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def pre_normalize(raw_data, labels):\n",
" data_list = cp.deepcopy(raw_data)\n",
" minimum, maximum, max_range, range_map = {}, {}, 0, {}\n",
" for lb in labels:\n",
" minimum[lb] = min([x[lb] for x in data_list])\n",
" maximum[lb] = max([x[lb] for x in data_list])\n",
" max_range = max(max_range, maximum[lb] - minimum[lb])\n",
" max_range *= 1.25\n",
" for lb in labels:\n",
" mid = (minimum[lb] + maximum[lb]) / 2\n",
" new_range = (mid - max_range / 2, mid + max_range / 2) if (mid + max_range / 2) < 100 else (100 - max_range, 100)\n",
" range_map[lb] = new_range\n",
" for item in data_list:\n",
" assert new_range[0] <= item[lb] <= new_range[1]\n",
" item[lb] = (item[lb] - new_range[0]) / max_range * 100\n",
" return data_list, range_map"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Draw MMBench Radar Graph\n",
"data = json.loads(open('OpenVLM.json').read())['results']\n",
"models = list(data)\n",
"print(models)\n",
"\n",
"model2vis = [\n",
" 'GPT-4v (detail: low)', 'GeminiProVision', 'Qwen-VL-Plus', \n",
" 'InternLM-XComposer2-VL', 'LLaVA-v1.5-13B', 'CogVLM-17B-Chat',\n",
" 'mPLUG-Owl2', 'Qwen-VL-Chat', 'IDEFICS-80B-Instruct'\n",
"]\n",
"colors = [\n",
" '#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', \n",
" '#e377c2', '#7f7f7f', '#bcbd22'\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"split = 'MMBench_TEST_EN'\n",
"data_sub = {k: v[split] for k, v in data.items()}\n",
"\n",
"labels = list(data_sub[model2vis[0]])\n",
"labels.remove('Overall')\n",
"num_vars = len(labels)\n",
"\n",
"raw_data = [data_sub[m] for m in model2vis]\n",
"data_list, range_map = pre_normalize(raw_data, labels)\n",
"\n",
"alpha = 0.25\n",
"angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()\n",
"angles_deg = np.linspace(0, 360, num_vars, endpoint=False).tolist()\n",
"fig, ax_base = plt.subplots(nrows=1, ncols=1, figsize=(10, 10), subplot_kw=dict(polar=True))\n",
"\n",
"for i in range(len(data_list)):\n",
" item = data_list[i]\n",
" model_name = model2vis[i]\n",
" color = colors[i]\n",
" tmp_angles = angles[:] + [angles[0]]\n",
" tmp_values = [item[lb] for lb in labels] + [item[labels[0]]]\n",
" ax_base.plot(tmp_angles, tmp_values, color=color, linewidth=1, linestyle='solid', label=model_name)\n",
" ax_base.fill(tmp_angles, tmp_values, color=color, alpha=alpha)\n",
" \n",
"angles += [angles[0]]\n",
"ax_base.set_ylim(0, 100)\n",
"ax_base.set_yticks([40, 60, 80, 100])\n",
"ax_base.set_yticklabels([''] * 4)\n",
"\n",
"ax_base.tick_params(pad=25)\n",
"ax_base.set_xticks(angles[:-1])\n",
"ax_base.set_xticklabels(labels, fontproperties=font18)\n",
"\n",
"leg = ax_base.legend(loc='center right', bbox_to_anchor=(1.6, 0.5), prop=font15, ncol=1, frameon=True, labelspacing=1.2)\n",
"for line in leg.get_lines():\n",
" line.set_linewidth(2.5)\n",
"\n",
"cx, cy, sz = 0.44, 0.435, 0.34\n",
"axes = [fig.add_axes([cx - sz, cy - sz, cx + sz, cy + sz], projection='polar', label='axes%d' % i) for i in range(num_vars)]\n",
" \n",
"for ax, angle, label in zip(axes, angles_deg, labels):\n",
" ax.patch.set_visible(False)\n",
" ax.grid(False)\n",
" ax.xaxis.set_visible(False)\n",
" cur_range = range_map[label]\n",
" label_list = [cur_range[0] + (cur_range[1] - cur_range[0]) / 5 * i for i in range(2, 6)]\n",
" label_list = [f'{x:.1f}' for x in label_list]\n",
" ax.set_rgrids(range(40, 120, 20), angle=angle, labels=label_list, font_properties=font12)\n",
" ax.spines['polar'].set_visible(False)\n",
" ax.set_ylim(0, 100)\n",
"\n",
"title_text = f'{len(model2vis)} Representative VLMs on MMBench Test.'\n",
"plt.figtext(.7, .95, title_text, fontproperties=font18, ha='center')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"labels = ['SEEDBench_IMG', 'CCBench', 'MMBench_TEST_EN', 'MMBench_TEST_CN', 'MME', 'MMVet', 'MMMU_VAL', 'MathVista', 'HallusionBench', 'LLaVABench']\n",
"num_vars = len(labels)\n",
"\n",
"raw_data = [{k: data[m][k]['Overall'] for k in labels} for m in model2vis]\n",
"data_list, range_map = pre_normalize(raw_data, labels)\n",
"\n",
"alpha = 0.25\n",
"angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()\n",
"angles_deg = np.linspace(0, 360, num_vars, endpoint=False).tolist()\n",
"fig, ax_base = plt.subplots(nrows=1, ncols=1, figsize=(10, 10), subplot_kw=dict(polar=True))\n",
"\n",
"for i in range(len(data_list)):\n",
" item = data_list[i]\n",
" model_name = model2vis[i]\n",
" color = colors[i]\n",
" tmp_angles = angles[:] + [angles[0]]\n",
" tmp_values = [item[lb] for lb in labels] + [item[labels[0]]]\n",
" ax_base.plot(tmp_angles, tmp_values, color=color, linewidth=1, linestyle='solid', label=model_name)\n",
" ax_base.fill(tmp_angles, tmp_values, color=color, alpha=alpha)\n",
" \n",
"angles += [angles[0]]\n",
"ax_base.set_ylim(0, 100)\n",
"ax_base.set_yticks([40, 60, 80, 100])\n",
"ax_base.set_yticklabels([''] * 4)\n",
"\n",
"ax_base.tick_params(pad=15)\n",
"ax_base.set_xticks(angles[:-1])\n",
"ax_base.set_xticklabels(labels, fontproperties=font18)\n",
"\n",
"dataset_map = {\n",
" 'MMBench_TEST_EN': 'MMBench (Test)', \n",
" 'MMBench_TEST_CN': 'MMBenchCN (Test)', \n",
" 'MathVista': 'MathVista (TestMini)', \n",
" 'MMMU_VAL': 'MMMU (Val)'\n",
"}\n",
"for i, label in enumerate(ax_base.get_xticklabels()):\n",
" x,y = label.get_position()\n",
" text = label.get_text()\n",
" text = dataset_map[text] if text in dataset_map else text\n",
" lab = ax_base.text(x, y, text, transform=label.get_transform(),\n",
" ha=label.get_ha(), va=label.get_va(), font_properties=font15)\n",
" lab.set_rotation(360 / num_vars * i + 270)\n",
" labels.append(lab)\n",
"ax_base.set_xticklabels([])\n",
"\n",
"leg = ax_base.legend(loc='center right', bbox_to_anchor=(1.6, 0.5), prop=font15, ncol=1, frameon=True, labelspacing=1.2)\n",
"for line in leg.get_lines():\n",
" line.set_linewidth(2.5)\n",
"\n",
"cx, cy, sz = 0.44, 0.435, 0.34\n",
"axes = [fig.add_axes([cx - sz, cy - sz, cx + sz, cy + sz], projection='polar', label='axes%d' % i) for i in range(num_vars)]\n",
" \n",
"for ax, angle, label in zip(axes, angles_deg, labels):\n",
" ax.patch.set_visible(False)\n",
" ax.grid(False)\n",
" ax.xaxis.set_visible(False)\n",
" cur_range = range_map[label]\n",
" label_list = [cur_range[0] + (cur_range[1] - cur_range[0]) / 5 * i for i in range(2, 6)]\n",
" label_list = [f'{x:.1f}' for x in label_list]\n",
" ax.set_rgrids(range(40, 120, 20), angle=angle, labels=label_list, font_properties=font12)\n",
" ax.spines['polar'].set_visible(False)\n",
" ax.set_ylim(0, 100)\n",
"\n",
"title_text = f'{len(model2vis)} Representative VLMs on {num_vars} Benchmarks in OpenCompass Multi-Modal Leaderboard.'\n",
"plt.figtext(.7, .95, title_text, fontproperties=font18, ha='center')\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
import re
import sys
from os.path import exists
from setuptools import find_packages, setup
def parse_requirements(fname='requirements.txt', with_version=True):
"""Parse the package dependencies listed in a requirements file but strips
specific versioning information.
Args:
fname (str): path to requirements file
with_version (bool, default=False): if True include version specs
Returns:
List[str]: list of requirements items
CommandLine:
python -c "import setup; print(setup.parse_requirements())"
"""
require_fpath = fname
def parse_line(line):
"""Parse information from a line in a requirements text file."""
if line.startswith('-r '):
# Allow specifying requirements in other files
target = line.split(' ')[1]
for info in parse_require_file(target):
yield info
else:
info = {'line': line}
if line.startswith('-e '):
info['package'] = line.split('#egg=')[1]
elif '@git+' in line:
info['package'] = line
else:
# Remove versioning from the package
pat = '(' + '|'.join(['>=', '==', '>']) + ')'
parts = re.split(pat, line, maxsplit=1)
parts = [p.strip() for p in parts]
info['package'] = parts[0]
if len(parts) > 1:
op, rest = parts[1:]
if ';' in rest:
# Handle platform specific dependencies
# http://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-platform-specific-dependencies
version, platform_deps = map(str.strip,
rest.split(';'))
info['platform_deps'] = platform_deps
else:
version = rest # NOQA
info['version'] = (op, version)
yield info
def parse_require_file(fpath):
with open(fpath, 'r') as f:
for line in f.readlines():
line = line.strip()
if line and not line.startswith('#'):
for info in parse_line(line):
yield info
def gen_packages_items():
if exists(require_fpath):
for info in parse_require_file(require_fpath):
parts = [info['package']]
if with_version and 'version' in info:
parts.extend(info['version'])
if not sys.version.startswith('3.4'):
# apparently package_deps are broken in 3.4
platform_deps = info.get('platform_deps')
if platform_deps is not None:
parts.append(';' + platform_deps)
item = ''.join(parts)
yield item
packages = list(gen_packages_items())
return packages
with open('README.md') as f:
readme = f.read()
def do_setup():
setup(
name='vlmeval',
version='0.1.0',
description='OpenCompass VLM Evaluation Kit',
author='Haodong Duan',
author_email='dhd.efz@gmail.com',
maintainer='Haodong Duan',
maintainer_email='dhd.efz@gmail.com',
long_description=readme,
long_description_content_type='text/markdown',
cmdclass={},
install_requires=parse_requirements('requirements.txt'),
setup_requires=[],
python_requires='>=3.7.0',
packages=find_packages(exclude=[
'test*',
'paper_test*',
]),
keywords=['AI', 'NLP', 'in-context learning'],
entry_points={
'console_scripts': ['vlmutil = vlmeval:cli']
},
classifiers=[
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: 3.9',
'Programming Language :: Python :: 3.10',
'Intended Audience :: Developers',
'Intended Audience :: Education',
'Intended Audience :: Science/Research',
])
if __name__ == '__main__':
do_setup()
try:
import torch
except ImportError:
pass
from .smp import *
from .api import *
from .dataset import *
from .utils import *
from .vlm import *
from .config import *
from .tools import cli
load_env()
__version__ = '0.2rc1'
from .gpt import OpenAIWrapper, GPT4V
from .hf_chat_model import HFChatModel
from .gemini import GeminiWrapper, GeminiProVision
from .qwen_vl_api import QwenVLWrapper, QwenVLAPI, Qwen2VLAPI
from .qwen_api import QwenAPI
from .claude import Claude_Wrapper, Claude3V
from .reka import Reka
from .glm_vision import GLMVisionAPI
from .cloudwalk import CWWrapper
from .sensechat_vision import SenseChatVisionAPI
from .hunyuan import HunyuanVision
from .bluelm_v_api import BlueLMWrapper, BlueLM_V_API
__all__ = [
'OpenAIWrapper', 'HFChatModel', 'GeminiWrapper', 'GPT4V',
'GeminiProVision', 'QwenVLWrapper', 'QwenVLAPI', 'QwenAPI',
'Claude3V', 'Claude_Wrapper', 'Reka', 'GLMVisionAPI',
'CWWrapper', 'SenseChatVisionAPI', 'HunyuanVision', 'Qwen2VLAPI',
'BlueLMWrapper', 'BlueLM_V_API',
]
import time
import random as rd
from abc import abstractmethod
import os.path as osp
import copy as cp
from ..smp import get_logger, parse_file, concat_images_vlmeval
class BaseAPI:
allowed_types = ['text', 'image']
INTERLEAVE = True
INSTALL_REQ = False
def __init__(self,
retry=10,
wait=3,
system_prompt=None,
verbose=True,
fail_msg='Failed to obtain answer via API.',
**kwargs):
"""Base Class for all APIs.
Args:
retry (int, optional): The retry times for `generate_inner`. Defaults to 10.
wait (int, optional): The wait time after each failed retry of `generate_inner`. Defaults to 3.
system_prompt (str, optional): Defaults to None.
verbose (bool, optional): Defaults to True.
fail_msg (str, optional): The message to return when failed to obtain answer.
Defaults to 'Failed to obtain answer via API.'.
**kwargs: Other kwargs for `generate_inner`.
"""
self.wait = wait
self.retry = retry
self.system_prompt = system_prompt
self.verbose = verbose
self.fail_msg = fail_msg
self.logger = get_logger('ChatAPI')
if len(kwargs):
self.logger.info(f'BaseAPI received the following kwargs: {kwargs}')
self.logger.info('Will try to use them as kwargs for `generate`. ')
self.default_kwargs = kwargs
@abstractmethod
def generate_inner(self, inputs, **kwargs):
"""The inner function to generate the answer.
Returns:
tuple(int, str, str): ret_code, response, log
"""
self.logger.warning('For APIBase, generate_inner is an abstract method. ')
assert 0, 'generate_inner not defined'
ret_code, answer, log = None, None, None
# if ret_code is 0, means succeed
return ret_code, answer, log
def working(self):
"""If the API model is working, return True, else return False.
Returns:
bool: If the API model is working, return True, else return False.
"""
self.old_timeout = None
if hasattr(self, 'timeout'):
self.old_timeout = self.timeout
self.timeout = 120
retry = 5
while retry > 0:
ret = self.generate('hello')
if ret is not None and ret != '' and self.fail_msg not in ret:
if self.old_timeout is not None:
self.timeout = self.old_timeout
return True
retry -= 1
if self.old_timeout is not None:
self.timeout = self.old_timeout
return False
def check_content(self, msgs):
"""Check the content type of the input. Four types are allowed: str, dict, liststr, listdict.
Args:
msgs: Raw input messages.
Returns:
str: The message type.
"""
if isinstance(msgs, str):
return 'str'
if isinstance(msgs, dict):
return 'dict'
if isinstance(msgs, list):
types = [self.check_content(m) for m in msgs]
if all(t == 'str' for t in types):
return 'liststr'
if all(t == 'dict' for t in types):
return 'listdict'
return 'unknown'
def preproc_content(self, inputs):
"""Convert the raw input messages to a list of dicts.
Args:
inputs: raw input messages.
Returns:
list(dict): The preprocessed input messages. Will return None if failed to preprocess the input.
"""
if self.check_content(inputs) == 'str':
return [dict(type='text', value=inputs)]
elif self.check_content(inputs) == 'dict':
assert 'type' in inputs and 'value' in inputs
return [inputs]
elif self.check_content(inputs) == 'liststr':
res = []
for s in inputs:
mime, pth = parse_file(s)
if mime is None or mime == 'unknown':
res.append(dict(type='text', value=s))
else:
res.append(dict(type=mime.split('/')[0], value=pth))
return res
elif self.check_content(inputs) == 'listdict':
for item in inputs:
assert 'type' in item and 'value' in item
mime, s = parse_file(item['value'])
if mime is None:
assert item['type'] == 'text', item['value']
else:
assert mime.split('/')[0] == item['type']
item['value'] = s
return inputs
else:
return None
# May exceed the context windows size, so try with different turn numbers.
def chat_inner(self, inputs, **kwargs):
_ = kwargs.pop('dataset', None)
while len(inputs):
try:
return self.generate_inner(inputs, **kwargs)
except:
inputs = inputs[1:]
while len(inputs) and inputs[0]['role'] != 'user':
inputs = inputs[1:]
continue
return -1, self.fail_msg + ': ' + 'Failed with all possible conversation turns.', None
def chat(self, messages, **kwargs1):
"""The main function for multi-turn chatting. Will call `chat_inner` with the preprocessed input messages."""
assert hasattr(self, 'chat_inner'), 'The API model should has the `chat_inner` method. '
for msg in messages:
assert isinstance(msg, dict) and 'role' in msg and 'content' in msg, msg
assert self.check_content(msg['content']) in ['str', 'dict', 'liststr', 'listdict'], msg
msg['content'] = self.preproc_content(msg['content'])
# merge kwargs
kwargs = cp.deepcopy(self.default_kwargs)
kwargs.update(kwargs1)
answer = None
# a very small random delay [0s - 0.5s]
T = rd.random() * 0.5
time.sleep(T)
assert messages[-1]['role'] == 'user'
for i in range(self.retry):
try:
ret_code, answer, log = self.chat_inner(messages, **kwargs)
if ret_code == 0 and self.fail_msg not in answer and answer != '':
if self.verbose:
print(answer)
return answer
elif self.verbose:
if not isinstance(log, str):
try:
log = log.text
except:
self.logger.warning(f'Failed to parse {log} as an http response. ')
self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
except Exception as err:
if self.verbose:
self.logger.error(f'An error occured during try {i}:')
self.logger.error(err)
# delay before each retry
T = rd.random() * self.wait * 2
time.sleep(T)
return self.fail_msg if answer in ['', None] else answer
def generate(self, message, **kwargs1):
"""The main function to generate the answer. Will call `generate_inner` with the preprocessed input messages.
Args:
message: raw input messages.
Returns:
str: The generated answer of the Failed Message if failed to obtain answer.
"""
assert self.check_content(message) in ['str', 'dict', 'liststr', 'listdict'], f'Invalid input type: {message}'
message = self.preproc_content(message)
assert message is not None and self.check_content(message) == 'listdict'
for item in message:
assert item['type'] in self.allowed_types, f'Invalid input type: {item["type"]}'
# merge kwargs
kwargs = cp.deepcopy(self.default_kwargs)
kwargs.update(kwargs1)
answer = None
# a very small random delay [0s - 0.5s]
T = rd.random() * 0.5
time.sleep(T)
for i in range(self.retry):
try:
ret_code, answer, log = self.generate_inner(message, **kwargs)
if ret_code == 0 and self.fail_msg not in answer and answer != '':
if self.verbose:
print(answer)
return answer
elif self.verbose:
if not isinstance(log, str):
try:
log = log.text
except:
self.logger.warning(f'Failed to parse {log} as an http response. ')
self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
except Exception as err:
if self.verbose:
self.logger.error(f'An error occured during try {i}:')
self.logger.error(err)
# delay before each retry
T = rd.random() * self.wait * 2
time.sleep(T)
return self.fail_msg if answer in ['', None] else answer
def message_to_promptimg(self, message, dataset=None):
assert not self.INTERLEAVE
model_name = self.__class__.__name__
import warnings
warnings.warn(
f'Model {model_name} does not support interleaved input. '
'Will use the first image and aggregated texts as prompt. ')
num_images = len([x for x in message if x['type'] == 'image'])
if num_images == 0:
prompt = '\n'.join([x['value'] for x in message if x['type'] == 'text'])
image = None
elif num_images == 1:
prompt = '\n'.join([x['value'] for x in message if x['type'] == 'text'])
image = [x['value'] for x in message if x['type'] == 'image'][0]
else:
prompt = '\n'.join([x['value'] if x['type'] == 'text' else '<image>' for x in message])
if dataset == 'BLINK':
image = concat_images_vlmeval(
[x['value'] for x in message if x['type'] == 'image'],
target_size=512)
else:
image = [x['value'] for x in message if x['type'] == 'image'][0]
return prompt, image
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment