init

81028572 · luopl · 81028572 · 81028572 · 81028572 · 81028572
Commit 81028572 authored Sep 28, 2024 by luopl
20 changed files
--- a/VLMEvalKit/docs/en/_templates/callable.rst
+++ b/VLMEvalKit/docs/en/_templates/callable.rst
+.. role:: hidden
+    :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :members:
+    :special-members: __call__
+
+..
+  autogenerated from _templates/callable.rst
+  note it does not have :inherited-members:
--- a/VLMEvalKit/docs/en/advanced_guides/Contributors.md
+++ b/VLMEvalKit/docs/en/advanced_guides/Contributors.md
+# Contributors
+
+## Contributors w. 3+ Major Contributions
+
+> In this section, we list all the contributors who have made significant contributions (3+) to the development of VLMEvalKit.
+
+New Qualified Contributors (2024.09):
+
+1. [amitbcp](https://github.com/amitbcp): The contributor helped support MUIRBench, Phi-3.5, Idefics3, VILA, and xGen-MM
+2. [czczup](https://github.com/czczup): The contributor helped support the InternVL Series (V1.5, Mini-InternVL, V2, etc.)
+3. [DseidLi](https://github.com/DseidLi): The contributor helped support LLaVA-OneVision, GQA, and developed the readthedocs site for VLMEvalKit
+4. [mayubo2333](https://github.com/mayubo2333): The contributor helped support MMLongBench, SlideVQA, and DUDE
+5. [sun-hailong](https://github.com/sun-hailong): The contributor helped support A-OKVQA, Parrot, MMMB, and MTL-MMBench
+6. [PhoenixZ810](https://github.com/PhoenixZ810): The contributor helped support Video-ChatGPT, Chat-UniVI, and Llama-VID
+7. [Cuiunbo](https://github.com/Cuiunbo): The contributor helped support OmniLMM-12B, MiniCPM-V Series (V1, V2, V2.5)
+
+## Full Contributor List
+
+> In this section, we list all the contributors as well as their corresponding contributions to the development of VLMEvalKit.
+
+TBD.
--- a/VLMEvalKit/docs/en/advanced_guides/Development.md
+++ b/VLMEvalKit/docs/en/advanced_guides/Development.md
+# 🛠️ How to implement a new Benchmark / VLM in VLMEvalKit?
+
+## Implement a new benchmark
+
+Example PR: **Math-Vision Benchmark** ([#292](https://github.com/open-compass/VLMEvalKit/pull/292/files))
+
+In VLMEvalKit, benchmarks are organized as dataset classes. When you try to implement a new benchmark, you can either reuse existing dataset classes (*e.g.*, You can reuse `ImageMCQDataset` when implementing a new multi-choice benchmark), or support a new dataset class. Each dataset must have the following two member functions (either reuse the one of the parent class or implement your own):
+
+- `build_prompt(self, line)`: The function input `line` is an integer (the sample index) or a `pd.Series` object (the raw record of the sample). The function outputs a `multi-modal message`, serving as the input of an MLLM. The `multi-modal message` is an interleaved list of multi-modal messages adopting the following format (the example includes an image and a text message): `[dict(type='image', value=IMAGE_PTH), dict(type='text', value=prompt)]`.
+- `evaluate(self, eval_file,  **judge_kwargs)`: The function input `eval_file` is the MLLM prediction (typically in `.xlsx` format). If the benchmark requires an external LLM (typically GPT) for evaluation, then `judge_kwargs` can pass the arguments for the LLM. The function outputs the benchmark evaluation results (metrics) in the form of `dict` or `pd.DataFrame`.
+
+We then brief the typical steps to implement a new benchmark under VLMEvalKit:
+
+### 1. Prepare your benchmark tsv file
+
+Currently, we organize a benchmark as one single TSV file. During inference, the data file will be automatically downloaded from the definited `DATASET_URL` link to `$LMUData` file (default path is `$HOME/LMUData`, if not set explicitly). You can upload the prepared TSV file to a downloadable address (e.g., Huggingface) or send it to us at <opencompass@pjlab.org.cn>. We will assist in uploading the dataset to the server. You can also customize `LMUData` path in the environment variable `LMUData=/path/to/your/data`.
+
+The contents of the TSV file consist of:
+
+| Dataset Name \ Fields                   | index | image | image_path | question | hint | multi-choice<br>options | answer | category | l2-category | split |
+| --------------------------------------- | ----- | ----- | ---------- | -------- | ---- | ----------------------- | ------ | -------- | ----------- | ----- |
+| MMBench_DEV_[CN/EN]                     | ✅     | ✅     |            | ✅        | ✅    | ✅                       | ✅      | ✅        | ✅           | ✅     |
+| MMBench_TEST_[CN/EN]                    | ✅     | ✅     |            | ✅        | ✅    | ✅                       |        | ✅        | ✅           | ✅     |
+| CCBench                                 | ✅     | ✅     |            | ✅        |      | ✅                       | ✅      | ✅        |             |       |
+| SEEDBench_IMG                           | ✅     | ✅     |            | ✅        |      | ✅                       | ✅      | ✅        |             |       |
+| MME                                     | ✅     | ✅     |            | ✅        |      |                         | ✅      | ✅        |             |       |
+| CORE_MM                                 | ✅     | ✅     | ✅          | ✅        |      |                         |        | ✅        |             |       |
+| MMVet                                   | ✅     | ✅     |            | ✅        |      |                         | ✅      | ✅        |             |       |
+| MMMU_DEV_VAL                            | ✅     | ✅     | ✅          | ✅        |      | ✅                       | ✅      | ✅        | ✅           | ✅     |
+| COCO_VAL                                | ✅     | ✅     |            |          |      |                         | ✅      |          |             |       |
+| OCRVQA_[TEST/TESTCORE]                  | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
+| TextVQA_VAL                             | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
+| VCR_[EN/ZH]\_[EASY/HARD]\_[ALL/500/100] | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
+| MMMB_[en/cn/pt/ar/tr/ru] | ✅     | ✅     |            | ✅        | ✅     | ✅     | ✅      | ✅         |             |✅       |
+| MMBench_dev_[en/cn/pt/ar/tr/ru] | ✅     | ✅     |            | ✅        | ✅     | ✅     | ✅      | ✅         | ✅            |✅       |
+
+<div align="center"><b>Table 1. TSV fields of supported datasets.</b></div>
+
+**Intro to mandatory fields in the `TSV` file:**
+
+- **index:** Integer, Unique for each line in `tsv`
+- **image:** The base64 of the image, you can use APIs implemented in `vlmeval/smp/vlm.py` for encoding and decoding:
+  - Encoding: `encode_image_to_base64 `(for PIL Image) / `encode_image_file_to_base64` (for image file path)
+  - Decoding: `decode_base64_to_image`(for PIL Image) / `decode_base64_to_image_file` (for image file path)
+- **question**: The question corresponding to the image, a string
+- **answer**: The answer to the question, a string. The `test` split does not need this field
+
+### 2. Cutomize your benchmark prompt
+
+`ImageBaseDataset` defines the default prompt format. If you need to add prompts specific to the dataset or input data in the `Interleave` format to the model, you can implement this through the `build_prompt(line)` function. This function takes a line from a TSV file as input, containing fields such as index, image, question, etc. The function returns a dictionary list of multimodal messages `msg` in the format `[dict(type='image', value=IMAGE_PTH), dict(type='text', value=prompt)]`, including the image path and the text prompt to be input into VLMs. For interleave type inputs, you can directly place the dictionary of the image path at the image token position.
+
+### 3. Cutomize your benchmark metrics
+
+To add evaluation for a new benchmark, you need to customize a class object to implement the dataset’s metrics calculation. Multimodal datasets inherit from the `ImageBaseDataset` object in `vlmeval/dataset/image_base.py`. The TYPE defines the type of dataset, `DATASET_URL` is the download address of the dataset, and `DATASET_MD5` is the MD5 checksum for consistency checking of the dataset file.
+
+In this class, **you need to implement** the `evaluate(eval_file, **judge_kwargs)` class function to calculate metrics and output results for the custom dataset. The function input `eval_file` is the path to the model prediction results file `{model_name}_{dataset}.xlsx`. This file can be read as a pandas.DataFrame using the `load(eval_file)` method, containing fields such as index, question, answer, category, prediction, etc. The judge_kwargs will pass a dictionary related to evaluation, such as the name of the `judge model`, the number of API request threads, etc. **The return value** of the function is the calculated accuracy and other metrics, formatted as a dictionary composed of lists, organized into a pandas.DataFrame.
+
+## Implement a new model
+
+Example PR: **Support LLaVA-Next-Interleave** ([#294](https://github.com/open-compass/VLMEvalKit/pull/294))
+
+**1. Support `generate_inner` API (mandatory).**
+
+All existing models are implemented in `vlmeval/vlm`. For a minimal model, your model class **must implement the method** `generate_inner(msgs, dataset=None)`. In this function, you feed a multi-modal message to your VLM and return the VLM prediction (which is a string). The optional argument `dataset` can be used as the flag for the model to switch among various inference strategies.
+
+The multi-modal messages `msgs` is a list of dictionaries, each dictionary has two keys: type and value:
+- `type`: We currently support two types, choices are ["image", "text"].
+- `value`: When type=='text' , the value is the text message (a single string); when type=='image', the value can be the local path of an image file, or the image URL.
+
+Currently a multi-modal message may contain arbitrarily interleaved images and texts. If your model do not support that, a practice can be taking the 1st image and concatenated text messages as the input. You can set the `INTERLEAVE = False` in your model class and use `self.message_to_promptimg(message, dataset=dataset)` to build your prompt and the first image's path.
+
+Here are some examples of multi-modal messages:
+
+```python
+IMAGE_PTH = 'assets/apple.jpg'
+IMAGE_URL = 'https://raw.githubusercontent.com/open-compass/VLMEvalKit/main/assets/apple.jpg'
+msg1 = [
+    dict(type='image', value=IMAGE_PTH),
+    dict(type='text', value='What is in this image?')
+]
+msg2 = [
+    dict(type='image', value=IMAGE_URL),
+    dict(type='image', value=IMAGE_URL),
+    dict(type='text', value='How many apples are there in these images?')
+]
+response = model.generate(msg1)
+```
+
+For convenience sake, we also support to take a list of string as inputs. In that case, we will check if a string is an image path or image URL and automatically convert it to the list[dict] format:
+
+```python
+IMAGE_PTH = 'assets/apple.jpg'
+IMAGE_URL = 'https://raw.githubusercontent.com/open-compass/VLMEvalKit/main/assets/apple.jpg'
+msg1 = [IMAGE_PTH, 'What is in this image?']
+msg2 = [IMAGE_URL, IMAGE_URL,  'How many apples are there in these images?']
+response = model.generate(msg1)
+```
+
+**Support Custom Prompt (optional).**
+
+Besides, your model can support **custom prompt building** by implementing two optional methods: `use_custom_prompt(dataset)` and `build_prompt(line, dataset=None)`.
+
+Both functions take the dataset name as the input：
+
+-  `use_custom_prompt(dataset)` returns a boolean flag, indicating whether the model should use the custom prompt building strategy.
+- If `use_custom_prompt(dataset)` returns True, `build_prompt(line, dataset)` should return a customly bulit multimodal message for the corresponding `dataset`, given `line`, which is a dictionary that includes the necessary information of a data sample. If `use_custom_prompt(dataset)` returns False, the default prompt building strategy will be used.
+
+**Support multi-turn chatting (optional).**
+
+You can also support the multi-turn chatting and evaluation with your VLM by supporting the `chat_inner(message, dataset)` function. The function outputs a single string response, and the `message` is a list of chat history, following the below format.
+
+```python
+# Assume msg1, msg2, msg3, ... are multi-modal messages following the previously described format
+# `chat_inner` take the following chat history list as input:
+message = [
+    dict(role='user', content=msg1),
+    dict(role='assistant', content=msg2),
+    dict(role='user', content=msg3),
+    dict(role='assistant', content=msg4),
+	......
+    dict(role='user', content=msgn),
+]
+# `message` should contain an odd number of chat utterances, the role of utterances should be interleaved "user" and "assistant", with the role of the last utterance to be "user".
+# The chat function will call `chat_inner`
+response = model.chat(message)
+```
+
+### Example PRs:
+
+- VLM that doesn't support interleaved images and texts, and does not use custom prompts: [[Model] Support glm-4v-9b](https://github.com/open-compass/VLMEvalKit/pull/221)
+- VLM that supports interleaved images and texts and custom prompts: [Add MiniCPM-Llama3-V-2.5](https://github.com/open-compass/VLMEvalKit/pull/205)
+- VLM API: [Feature add glmv](https://github.com/open-compass/VLMEvalKit/pull/201)
+
+## Contribute to VLMEvalKit
+
+If you want to contribute codes to **VLMEvalKit**, please do the pre-commit check before you submit a PR. That helps to keep the code tidy.
+
+```bash
+# Under the directory of VLMEvalKit, install the pre-commit hook:
+pip install pre-commit
+pre-commit install
+pre-commit run --all-files
+# Then you can commit your code.
+```
--- a/VLMEvalKit/docs/en/conf.py
+++ b/VLMEvalKit/docs/en/conf.py
+# flake8: noqa
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import ast
+import subprocess
+import sys
+
+import pytorch_sphinx_theme
+from sphinx.builders.html import StandaloneHTMLBuilder
+
+sys.path.insert(0, os.path.abspath('../../'))
+
+# -- Project information -----------------------------------------------------
+
+project = 'VLMEvalKit'
+copyright = '2023, VLMEvalKit'
+author = 'VLMEvalKit Authors'
+
+# The full version, including alpha/beta/rc tags
+version_file = '../../vlmeval/__init__.py'
+
+
+def get_version():
+    with open(version_file, 'r') as f:
+        file_content = f.read()
+    # Parse the file content into an abstract syntax tree (AST)
+    tree = ast.parse(file_content, filename=version_file)
+
+    # Iterate through the body of the AST, looking for an assignment to __version__
+    for node in tree.body:
+        if isinstance(node, ast.Assign):
+            for target in node.targets:
+                if isinstance(target, ast.Name) and target.id == '__version__':
+                    return node.value.s
+    raise ValueError('__version__ not found')
+
+
+release = get_version()
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
+    'sphinx.ext.intersphinx',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    'myst_parser',
+    'sphinx_copybutton',
+    'sphinx_tabs.tabs',
+    'notfound.extension',
+    'sphinxcontrib.jquery',
+    'sphinx_design',
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+source_suffix = {
+    '.rst': 'restructuredtext',
+    '.md': 'markdown',
+}
+
+language = 'en'
+
+# The master toctree document.
+root_doc = 'index'
+html_context = {
+    'github_version': 'latest',
+}
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'pytorch_sphinx_theme'
+html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()]
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+# yapf: disable
+html_theme_options = {
+    'menu': [
+        {
+            'name': 'GitHub',
+            'url': 'https://github.com/open-compass/VLMEvalKit'
+        },
+    ],
+    # Specify the language of shared menu
+    'menu_lang': 'en',
+    # Disable the default edit on GitHub
+    'default_edit_on_github': False,
+}
+# yapf: enable
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+html_css_files = [
+    'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.css',
+    'css/readthedocs.css'
+]
+html_js_files = [
+    'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.js',
+    'js/custom.js'
+]
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'vlmevalkitdoc'
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (root_doc, 'vlmevalkit.tex', 'VLMEvalKit Documentation', author,
+     'manual'),
+]
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [(root_doc, 'vlmevalkit', 'VLMEvalKit Documentation', [author],
+              1)]
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (root_doc, 'vlmevalkit', 'VLMEvalKit Documentation', author,
+     'VLMEvalKit Authors', 'AGI evaluation toolbox and benchmark.',
+     'Miscellaneous'),
+]
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+
+# set priority when building html
+StandaloneHTMLBuilder.supported_image_types = [
+    'image/svg+xml', 'image/gif', 'image/png', 'image/jpeg'
+]
+
+# -- Extension configuration -------------------------------------------------
+# Ignore >>> when copying code
+copybutton_prompt_text = r'>>> |\.\.\. '
+copybutton_prompt_is_regexp = True
+
+# Auto-generated header anchors
+myst_heading_anchors = 3
+# Enable "colon_fence" extension of myst.
+myst_enable_extensions = ['colon_fence', 'dollarmath']
+
+# Configuration for intersphinx
+intersphinx_mapping = {
+    'python': ('https://docs.python.org/3', None),
+    'numpy': ('https://numpy.org/doc/stable', None),
+    'torch': ('https://pytorch.org/docs/stable/', None),
+    'mmengine': ('https://mmengine.readthedocs.io/en/latest/', None),
+    'transformers':
+    ('https://huggingface.co/docs/transformers/main/en/', None),
+}
+napoleon_custom_sections = [
+    # Custom sections for data elements.
+    ('Meta fields', 'params_style'),
+    ('Data fields', 'params_style'),
+]
+
+# Disable docstring inheritance
+autodoc_inherit_docstrings = False
+# Mock some imports during generate API docs.
+autodoc_mock_imports = ['rich', 'attr', 'einops']
+# Disable displaying type annotations, these can be very verbose
+autodoc_typehints = 'none'
+
+# The not found page
+notfound_template = '404.html'
--- a/VLMEvalKit/docs/en/docutils.conf
+++ b/VLMEvalKit/docs/en/docutils.conf
+[html writers]
+table_style: colwidths-auto
--- a/VLMEvalKit/docs/en/get_started/Quickstart.md
+++ b/VLMEvalKit/docs/en/get_started/Quickstart.md
+# Quickstart
+
+Before running the evaluation script, you need to **configure** the VLMs and set the model_paths properly.
+
+After that, you can use a single script `run.py` to inference and evaluate multiple VLMs and benchmarks at a same time.
+
+## Step 0. Installation & Setup essential keys
+
+**Installation.**
+
+```bash
+git clone https://github.com/open-compass/VLMEvalKit.git
+cd VLMEvalKit
+pip install -e .
+```
+
+**Setup Keys.**
+
+To infer with API models (GPT-4v, Gemini-Pro-V, etc.) or use LLM APIs as the **judge or choice extractor**, you need to first setup API keys. VLMEvalKit will use an judge **LLM** to extract answer from the output if you set the key, otherwise it uses the **exact matching** mode (find "Yes", "No", "A", "B", "C"... in the output strings). **The exact matching can only be applied to the Yes-or-No tasks and the Multi-choice tasks.**
+- You can place the required keys in `$VLMEvalKit/.env` or directly set them as the environment variable. If you choose to create a `.env` file, its content will look like:
+
+  ```bash
+  # The .env file, place it under $VLMEvalKit
+  # API Keys of Proprietary VLMs
+  # QwenVL APIs
+  DASHSCOPE_API_KEY=
+  # Gemini w. Google Cloud Backends
+  GOOGLE_API_KEY=
+  # OpenAI API
+  OPENAI_API_KEY=
+  OPENAI_API_BASE=
+  # StepAI API
+  STEPAI_API_KEY=
+  # REKA API
+  REKA_API_KEY=
+  # GLMV API
+  GLMV_API_KEY=
+  # CongRong API
+  CW_API_BASE=
+  CW_API_KEY=
+  # SenseChat-V API
+  SENSECHAT_AK=
+  SENSECHAT_SK=
+  # Hunyuan-Vision API
+  HUNYUAN_SECRET_KEY=
+  HUNYUAN_SECRET_ID=
+  # You can also set a proxy for calling api models during the evaluation stage
+  EVAL_PROXY=
+  ```
+
+- Fill the blanks with your API keys (if necessary). Those API keys will be automatically loaded when doing the inference and evaluation.
+## Step 1. Configuration
+
+**VLM Configuration**: All VLMs are configured in `vlmeval/config.py`, for some VLMs, you need to configure the code root (MiniGPT-4, PandaGPT, etc.) or the model_weight root (LLaVA-v1-7B, etc.) before conducting the evaluation. During evaluation, you should use the model name specified in `supported_VLM` in `vlmeval/config.py` to select the VLM. For MiniGPT-4 and InstructBLIP, you also need to modify the config files in `vlmeval/vlm/misc` to configure LLM path and ckpt path.
+
+Following VLMs require the configuration step:
+
+**Code Preparation & Installation**: InstructBLIP ([LAVIS](https://github.com/salesforce/LAVIS)), LLaVA ([LLaVA](https://github.com/haotian-liu/LLaVA)), MiniGPT-4 ([MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)), mPLUG-Owl2 ([mPLUG-Owl2](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)), OpenFlamingo-v2 ([OpenFlamingo](https://github.com/mlfoundations/open_flamingo)), PandaGPT-13B ([PandaGPT](https://github.com/yxuansu/PandaGPT)), TransCore-M ([TransCore-M](https://github.com/PCIResearch/TransCore-M)).
+
+**Manual Weight Preparation & Configuration**: InstructBLIP, LLaVA-v1-7B, MiniGPT-4, PandaGPT-13B
+
+## Step 2. Evaluation
+
+We use `run.py` for evaluation. To use the script, you can use `$VLMEvalKit/run.py` or create a soft-link of the script (to use the script anywhere):
+
+**Arguments**
+
+- `--data (list[str])`: Set the dataset names that are supported in VLMEvalKit (defined in `vlmeval/utils/dataset_config.py`).
+- `--model (list[str])`: Set the VLM names that are supported in VLMEvalKit (defined in `supported_VLM` in `vlmeval/config.py`).
+- `--mode (str, default to 'all', choices are ['all', 'infer'])`: When `mode` set to "all", will perform both inference and evaluation; when set to "infer", will only perform the inference.
+- `--nproc (int, default to 4)`: The number of threads for OpenAI API calling.
+- `--work-dir (str, default to '.')`: The directory to save evaluation results.
+- `--nframe (int, default to 8)`: The number of frames to sample from a video, only applicable to the evaluation of video benchmarks.
+- `--pack (bool, store_true)`: A video may associate with multiple questions, if `pack==True`, will ask all questions for a video in a single query.
+
+**Command for Evaluating Image Benchmarks **
+
+You can run the script with `python` or `torchrun`:
+
+```bash
+# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
+# That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).
+
+# IDEFICS-80B-Instruct on MMBench_DEV_EN, MME, and SEEDBench_IMG, Inference and Evalution
+python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose
+# IDEFICS-80B-Instruct on MMBench_DEV_EN, MME, and SEEDBench_IMG, Inference only
+python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose --mode infer
+
+# When running with `torchrun`, one VLM instance is instantiated on each GPU. It can speed up the inference.
+# However, that is only suitable for VLMs that consume small amounts of GPU memory.
+
+# IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2 on MMBench_DEV_EN, MME, and SEEDBench_IMG. On a node with 8 GPU. Inference and Evaluation.
+torchrun --nproc-per-node=8 run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose
+# Qwen-VL-Chat on MME. On a node with 2 GPU. Inference and Evaluation.
+torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
+```
+
+**Command for Evaluating Video Benchmarks **
+
+```bash
+# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
+# That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).
+
+# IDEFICS2-8B on MMBench-Video, with 8 frames as inputs and vanilla evaluation. On a node with 8 GPUs.
+torchrun --nproc-per-node=8 run.py --data MMBench-Video --model idefics2_8b --nframe 8
+# GPT-4o (API model) on MMBench-Video, with 16 frames as inputs and pack evaluation (all questions of a video in a single query).
+python run.py --data MMBench-Video --model GPT4o --nframe 16 --pack
+```
+
+The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics.
+
+## Deploy a local language model as the judge / choice extractor
+The default setting mentioned above uses OpenAI's GPT as the judge LLM. However, you can also deploy a local judge LLM with [LMDeploy](https://github.com/InternLM/lmdeploy).
+
+First install:
+```
+pip install lmdeploy openai
+```
+
+And then deploy a local judge LLM with the single line of code. LMDeploy will automatically download the model from Huggingface. Assuming we use internlm2-chat-1_8b as the judge, port 23333, and the key sk-123456 (the key must start with "sk-" and follow with any number you like):
+```
+lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
+```
+
+You need to get the model name registered by LMDeploy with the following python code:
+```
+from openai import OpenAI
+client = OpenAI(
+    api_key='sk-123456',
+    base_url="http://0.0.0.0:23333/v1"
+)
+model_name = client.models.list().data[0].id
+```
+
+Now set some environment variables to tell VLMEvalKit how to use the local judge LLM. As mentioned above, you can also set them in `$VLMEvalKit/.env` file:
+```
+OPENAI_API_KEY=sk-123456
+OPENAI_API_BASE=http://0.0.0.0:23333/v1/chat/completions
+LOCAL_LLM=<model_name you get>
+```
+
+Finally, you can run the commands in step 2 to evaluate your VLM with the local judge LLM.
+
+Note that
+
+- If you hope to deploy the judge LLM in a single GPU and evaluate your VLM on other GPUs because of limited GPU memory, try `CUDA_VISIBLE_DEVICES=x` like
+```
+CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
+CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc-per-node=3 run.py --data HallusionBench  --model qwen_chat --verbose
+```
+- If the local judge LLM is not good enough in following the instructions, the evaluation may fail. Please report such failures (e.g., by issues).
+- It's possible to deploy the judge LLM in different ways, e.g., use a private LLM (not from HuggingFace) or use a quantized LLM. Please refer to the [LMDeploy doc](https://lmdeploy.readthedocs.io/en/latest/serving/api_server.html). You can use any other deployment framework if they support OpenAI API.
--- a/VLMEvalKit/docs/en/index.rst
+++ b/VLMEvalKit/docs/en/index.rst
+Welcome to the VLMEvalKit Tutorial!
+==========================================
+
+VLMEvalKit Getting Started Guide
+-------------------------------
+
+To help users get started quickly, we recommend the following process:
+
+- For users who want to use VLMEvalKit, we recommend reading the "Start Your First Step" section to set up the environment and start a mini-experiment to familiarize yourself with the process.
+
+- If you want to customize more modules, such as adding datasets and models, we provide an "Advanced Tutorial."
+
+We always welcome users' PRs (Pull Requests) and Issues to improve VLMEvalKit!
+
+.. _Start Your First Step:
+.. toctree::
+   :maxdepth: 1
+   :caption: Start Your First Step
+
+   get_started/Quickstart.md
+
+
+.. .. _Tutorials:
+.. .. toctree::
+..    :maxdepth: 1
+..    :caption: Tutorials
+
+..    user_guides/framework_overview.md
+
+.. _Advanced Tutorial:
+.. toctree::
+   :maxdepth: 1
+   :caption: Advanced Tutorial
+
+   advanced_guides/Development.md
+
+.. .. _Other Notes:
+.. .. toctree::
+..    :maxdepth: 1
+..    :caption: Other Notes
+
+..    notes/contribution_guide.md
+
+Index and Tables
+==================
+
+* :ref:`genindex`
+* :ref:`search`
--- a/VLMEvalKit/docs/ja/README_ja.md
+++ b/VLMEvalKit/docs/ja/README_ja.md
--- a/VLMEvalKit/docs/zh-CN/.readthedocs.yaml
+++ b/VLMEvalKit/docs/zh-CN/.readthedocs.yaml
+version: 2
+
+# Set the version of Python and other tools you might need
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.8"
+
+formats:
+    - epub
+
+sphinx:
+  configuration: docs/zh-CN/conf.py
+
+python:
+  install:
+    - requirements: requirements/docs.txt
--- a/VLMEvalKit/docs/zh-CN/Makefile
+++ b/VLMEvalKit/docs/zh-CN/Makefile
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/VLMEvalKit/docs/zh-CN/README_zh-CN.md
+++ b/VLMEvalKit/docs/zh-CN/README_zh-CN.md
--- a/VLMEvalKit/docs/zh-CN/_static/css/readthedocs.css
+++ b/VLMEvalKit/docs/zh-CN/_static/css/readthedocs.css
+.header-logo {
+  background-image: url("../image/logo.svg");
+  background-size: 275px 80px;
+  height: 80px;
+  width: 275px;
+}
+
+
+@media screen and (min-width: 1100px) {
+  .header-logo {
+    top: -25px;
+  }
+}
+
+pre {
+    white-space: pre;
+}
+
+@media screen and (min-width: 2000px) {
+  .pytorch-content-left {
+    width: 1200px;
+    margin-left: 30px;
+  }
+  article.pytorch-article {
+    max-width: 1200px;
+  }
+  .pytorch-breadcrumbs-wrapper {
+    width: 1200px;
+  }
+  .pytorch-right-menu.scrolling-fixed {
+    position: fixed;
+    top: 45px;
+    left: 1580px;
+  }
+}
+
+
+article.pytorch-article section code {
+  padding: .2em .4em;
+  background-color: #f3f4f7;
+  border-radius: 5px;
+}
+
+/* Disable the change in tables */
+article.pytorch-article section table code {
+  padding: unset;
+  background-color: unset;
+  border-radius: unset;
+}
+
+table.autosummary td {
+  width: 50%
+}
+
+img.align-center {
+  display: block;
+  margin-left: auto;
+  margin-right: auto;
+}
+
+article.pytorch-article p.rubric {
+  font-weight: bold;
+}
--- a/VLMEvalKit/docs/zh-CN/_static/image/logo.svg
+++ b/VLMEvalKit/docs/zh-CN/_static/image/logo.svg
--- a/VLMEvalKit/docs/zh-CN/_static/image/logo_icon.svg
+++ b/VLMEvalKit/docs/zh-CN/_static/image/logo_icon.svg
+<?xml version="1.0" encoding="UTF-8"?>
+<svg id="_图层_2" data-name="图层 2" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 34.59 36">
+  <defs>
+    <style>
+      .cls-1 {
+        fill: #36569b;
+      }
+
+      .cls-2 {
+        fill: #1b3882;
+      }
+
+      .cls-3 {
+        fill: #5878b4;
+      }
+    </style>
+  </defs>
+  <g id="_图层_1-2" data-name="图层 1">
+    <g>
+      <g id="_3" data-name="3">
+        <path class="cls-3" d="m16.53,22.65l-6.37,3.07,5.27-.16,1.1-2.91Zm-4.19,10.95l1.12-2.91-5.27.17,4.15,2.74Zm9.3-.29l6.37-3.07-5.27.16-1.1,2.91Zm4.19-10.95l-1.12,2.91,5.27-.17-4.15-2.74Zm5.72,3.81l-7.08.23-1.73-1.14,1.5-3.95-2.06-1.36-3.16,1.53-1.48,3.89-2.67,1.29-7.14.23-3.16,1.53,2.07,1.36,7.13-.23h0s1.69,1.11,1.69,1.11l-1.51,3.98,2.06,1.36,3.16-1.53,1.5-3.95h0s2.56-1.24,2.56-1.24h0s7.23-.24,7.23-.24l3.16-1.53-2.06-1.36Zm-11.29,2.56c-.99.48-2.31.52-2.96.1-.65-.42-.37-1.15.62-1.63.99-.48,2.31-.52,2.96-.1.65.42.37,1.15-.62,1.63Z"/>
+      </g>
+      <g id="_2" data-name="2">
+        <path class="cls-1" d="m33.5,19.84l-1.26-6.51-1.46,1.88,2.72,4.63Zm-6.05-14.69l-4.16-2.74,2.71,4.64,1.45-1.89Zm-6.73.58l1.26,6.51,1.46-1.88-2.72-4.63Zm6.05,14.69l4.16,2.74-2.71-4.64-1.45,1.89Zm7.19,1.91l-3.63-6.2h0s-.53-2.74-.53-2.74l1.96-2.56-.63-3.23-2.07-1.36-1.96,2.56-1.69-1.11-3.71-6.33-2.07-1.36.63,3.23,3.68,6.28h0s.51,2.62.51,2.62h0s-1.99,2.6-1.99,2.6l.63,3.23,2.06,1.36,1.95-2.54,1.73,1.14,3.69,6.29,2.07,1.36-.63-3.23Zm-6.47-7.7c-.65-.42-1.33-1.59-1.52-2.6-.2-1.01.17-1.49.81-1.06.65.42,1.33,1.59,1.52,2.6.2,1.01-.17,1.49-.81,1.06Z"/>
+      </g>
+      <g id="_1" data-name="1">
+        <path class="cls-2" d="m11.96,2.82l-6.37,3.07,3.81,1.74,2.55-4.81ZM1.07,14.37l1.26,6.53,2.56-4.8-3.82-1.73Zm7.99,9.59l6.37-3.07-3.81-1.74-2.55,4.81Zm10.89-11.55l-1.26-6.53-2.56,4.8,3.82,1.73Zm.45,2.53l-5.13-2.32h0s-.53-2.71-.53-2.71l3.47-6.53-.63-3.24-3.16,1.53-3.42,6.43-2.67,1.29h0s-5.17-2.34-5.17-2.34l-3.16,1.53.63,3.24,5.17,2.33.51,2.65h0s-3.49,6.57-3.49,6.57l.63,3.24,3.16-1.53,3.46-6.52,2.56-1.24h0s5.24,2.37,5.24,2.37l3.16-1.53-.63-3.24Zm-9.52.24c-.99.48-1.95.04-2.14-.97-.2-1.01.44-2.22,1.43-2.69.99-.48,1.95-.04,2.14.97.2,1.01-.44,2.22-1.43,2.7Z"/>
+      </g>
+    </g>
+  </g>
+</svg>
--- a/VLMEvalKit/docs/zh-CN/_static/js/custom.js
+++ b/VLMEvalKit/docs/zh-CN/_static/js/custom.js
+var collapsedSections = [];
+
+$(document).ready(function () {
+  $('.model-summary').DataTable({
+    "stateSave": false,
+    "lengthChange": false,
+    "pageLength": 20,
+    "order": []
+  });
+});
--- a/VLMEvalKit/docs/zh-CN/_templates/404.html
+++ b/VLMEvalKit/docs/zh-CN/_templates/404.html
+{% extends "layout.html" %}
+
+{% block body %}
+
+<h1>Page Not Found</h1>
+<p>
+  The page you are looking for cannot be found.
+</p>
+<p>
+  If you just switched documentation versions, it is likely that the page you were on is moved. You can look for it in
+  the content table left, or go to <a href="{{ pathto(root_doc) }}">the homepage</a>.
+</p>
+<!-- <p>
+  If you cannot find documentation you want, please <a
+    href="">open an issue</a> to tell us!
+</p> -->
+
+{% endblock %}
--- a/VLMEvalKit/docs/zh-CN/_templates/autosummary/class.rst
+++ b/VLMEvalKit/docs/zh-CN/_templates/autosummary/class.rst
+.. role:: hidden
+    :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :members:
+
+..
+  autogenerated from _templates/autosummary/class.rst
+  note it does not have :inherited-members:
--- a/VLMEvalKit/docs/zh-CN/_templates/callable.rst
+++ b/VLMEvalKit/docs/zh-CN/_templates/callable.rst
+.. role:: hidden
+    :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :members:
+    :special-members: __call__
+
+..
+  autogenerated from _templates/callable.rst
+  note it does not have :inherited-members:
--- a/VLMEvalKit/docs/zh-CN/advanced_guides/Development.md
+++ b/VLMEvalKit/docs/zh-CN/advanced_guides/Development.md
+# 🛠️ 如何在 VLMEvalKit 中实现一个新的 Benchmark 或多模态模型（VLM）
+
+## 实现一个新的 benchmark
+
+示例 PR: **添加 Math-Vision Benchmark** ([#292](https://github.com/open-compass/VLMEvalKit/pull/292/files))
+
+目前在 VLMEvalKit 中，benchmark 以数据集类的形式呈现，当你新增一个 benchmark 时，你可以选择复用现有的数据集类 (如单选题 benchmark 可复用 `ImageMCQDataset`)，或是实现新的数据集类。你的数据集类必须支持以下两种方法 (复用父类或自行实现):
+
+- `build_prompt(self, line)`: 方法输入 `line` 类型为 int (对应数据 index) 或 `pd.Series` (对应数据原始 record)。方法输出一条 `multi-modal message` 作为多模态模型输入，`multi-modal message` 是一个图文交错的列表，如以下格式 (一图一文): `[dict(type='image', value=IMAGE_PTH), dict(type='text', value=prompt)]`。
+- `evaluate(self, eval_file, **judge_kwargs)`: 方法输入 `eval_file` 为多模态模型的预测结果 (多以 `.xlsx` 格式存在)，如 benchmark evaluation 需要大语言模型 (一般为 GPT) 辅助，则 `judge_kwargs` 传入大语言模型的参数。方法输出 benchmark 的评测结果，以 `dict` 或 `pd.DataFrame` 的形式。
+
+以下，我们简述新增数据集的通常步骤：
+
+### 1. TSV 数据文件准备 (图文评测集)
+
+目前，我们将每一个 benchmark 数据集设置为一个单独的 TSV 文件。在推理过程中，数据文件将从数据集定义的 `DATASET_URL` 链接地址自动下载到 `$LMUData` 中（如果没有明确设置的话，默认路径是 `$HOME/LMUData`）。你可以将准备好的 TSV 文件上传到一个可下载的地址（如：huggingface），或发送给我们 <opencompass@pjlab.org.cn>，我们将帮助上传数据集到服务器中。此外，你也可以在环境变量中自定义设置下载路径 `LMUData=/path/to/your/data`。
+
+TSV 文件中的内容组成为：
+
+| 数据集名称 \ 字段  | index | image | image_path | question | hint | multi-choice<br>options | answer | category | l2-category | split |
+| ---------------------- | ----- | ----- | ---------- | -------- | ---- | ----------------------- | ------ | -------- | ----------- | ----- |
+| MMBench_DEV_[CN/EN]    | ✅     | ✅     |            | ✅        | ✅    | ✅                       | ✅      | ✅        | ✅           | ✅     |
+| MMBench_TEST_[CN/EN]   | ✅     | ✅     |            | ✅        | ✅    | ✅                       |        | ✅        | ✅           | ✅     |
+| CCBench                | ✅     | ✅     |            | ✅        |      | ✅                       | ✅      | ✅        |             |       |
+| SEEDBench_IMG          | ✅     | ✅     |            | ✅        |      | ✅                       | ✅      | ✅        |             |       |
+| MME                    | ✅     | ✅     |            | ✅        |      |                         | ✅      | ✅        |             |       |
+| CORE_MM                | ✅     | ✅     | ✅          | ✅        |      |                         |        | ✅        |             |       |
+| MMVet                  | ✅     | ✅     |            | ✅        |      |                         | ✅      | ✅        |             |       |
+| MMMU_DEV_VAL           | ✅     | ✅     | ✅          | ✅        |      | ✅                       | ✅      | ✅        | ✅           | ✅     |
+| COCO_VAL               | ✅     | ✅     |            |          |      |                         | ✅      |          |             |       |
+| OCRVQA_[TEST/TESTCORE] | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
+| TextVQA_VAL            | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
+| VCR_[EN/ZH]\_[EASY/HARD]_[ALL/500/100]            | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
+
+<div align="center"><b>表 1. 支持的数据集的 TSV 字段。</b></div>
+
+**TSV 中必须字段的介绍：**
+
+- **index:** 一个整数，`tsv` 中每一行的唯一标识
+- **image:** 图片的 base64 编码，你可以使用 `vlmeval/smp/vlm.py` 中实现的API进行编码和解码：
+    - 编码：`encode_image_to_base64`（对于PIL Image）/ `encode_image_file_to_base64`（对于图片文件路径）
+    - 解码：`decode_base64_to_image`（对于PIL Image）/ `decode_base64_to_image_file`（对于图片文件路径）
+- **question:** 针对图像所提取出的问题，类型为字符串
+- **answer:** 问题的答案，类型为字符串，Test 集可缺失这一字段
+
+### 2. 自定义数据集的 prompt 构建
+
+`ImageBaseDataset` 定义了默认的 prompt 格式。如果需要针对数据集添加 prompt，或给模型输入 `Interleave` 的数据格式，可以通过 `build_prompt(line)` 函数实现。该函数输入为，每次给定 TSV 文件中的一行，包含 index, image, question 等内容作为 line。该函数将返回一个多模态消息 `msg` 的字典列表 `[dict(type='image', value=IMAGE_PTH), dict(type='text', value=prompt)]`，包括图片路径和将被输入到 VLMs 的文本 prompt。对于 interleave 类型输入，可以直接将图片路径的字典放置到 image token 位置。
+
+### 3. 自定义数据集的指标实现
+
+增加对 benchmark 的评测需要自定义一个该数据集的 class 对象，从而实现数据集的指标计算。图文多模态数据集均继承自 `vlmeval/dataset/image_base.py` 中的 `ImageBaseDataset` 对象。其中 `TYPE` 定义了数据集的类型；`DATASET_URL` 为数据集的下载地址；`DATASET_MD5` 为数据集文件的 md5 一致性编码检查。
+
+在 class 中**需要实现** `evaluate(eval_file, **judge_kwargs)` 类函数，对自定义的数据集结果进行指标计算和结果输出。函数输入 `eval_file` 为模型预测结果 `{model_name}_{dataset}.xlsx` 的路径。可以通过 `load(eval_file)` 文件将其读取为 panda.DataFrames 类型，其中包含 index, question, answer, category, prediction 等字段。`judge_kwargs` 参数将传递一个评测相关的字典，如：judge 模型的名称，api 请求线程数等。**函数的返回值**为评估完成的准确度等指标，其格式为由 list 组成的字典，并组织成 panda.DataFrames 类型。
+
+## 实现一个新的模型
+
+示例 PR: **支持 LLaVA-Next-Interleave** ([#294](https://github.com/open-compass/VLMEvalKit/pull/294))
+
+**1. 支持 `generate_inner` API (必须)**
+
+现有所有的模型都在 `vlmeval/vlm` 中实现。对于一个最基本的模型，你的模型类**应该实现方法** `generate_inner(msgs, dataset=None)`。这个函数将向 VLM 输入一个多模态数据，并返回 VLM 的预测（一个字符串）。可选参数 `dataset` 可以用作模型在不同推理策略之间切换的标志。
+
+其中多模态消息 `msgs` 是一个字典列表，每个字典有两个键：类型和值：
+- `type`：我们目前支持两种类型，选项是 ["image", "text"]。
+- `value`：当类型为 `text` 时，值是文本消息（一个字符串）；当类型为 `image` 时，值可以是图像文件的本地路径，或者是图像的URL。
+
+> 目前，一个多模态消息可能包含任意交错的图像和文本。如果你的模型不支持这一点，我们推荐的做法是取第一张图像和连接的文本消息作为模型的输入。你可以在模型的 class 中设置 `INTERLEAVE = False` 并调用 `self.message_to_promptimg(message, dataset=dataset)` 函数来获取你的 prompt 和第一张图片的地址。
+
+一些多模态消息的例子:
+
+```python
+IMAGE_PTH = 'assets/apple.jpg'
+IMAGE_URL = 'https://raw.githubusercontent.com/open-compass/VLMEvalKit/main/assets/apple.jpg'
+msg1 = [
+    dict(type='image', value=IMAGE_PTH),
+    dict(type='text', value='What is in this image?')
+]
+msg2 = [
+    dict(type='image', value=IMAGE_URL),
+    dict(type='image', value=IMAGE_URL),
+    dict(type='text', value='How many apples are there in these images?')
+]
+response = model.generate(msg1)
+```
+
+为了方便起见，我们还支持接受字符串列表作为输入。在这种情况下，我们将检查一个字符串是图像路径还是图像 URL，并自动将其转换为 `list[dict]` 格式：
+
+```python
+IMAGE_PTH = 'assets/apple.jpg'
+IMAGE_URL = 'https://raw.githubusercontent.com/open-compass/VLMEvalKit/main/assets/apple.jpg'
+msg1 = [IMAGE_PTH, 'What is in this image?']
+msg2 = [IMAGE_URL, IMAGE_URL,  'How many apples are there in these images?']
+response = model.generate(msg1)
+```
+
+**2. 支持自定义提示词构建 (可选)**
+
+此外，你的模型可以通过实现两个可选方法来支持自定义提示构建：`use_custom_prompt(dataset)` 和 `build_prompt(line, dataset=None)`。
+
+- `use_custom_prompt(dataset)` 将返回一个布尔值，指示模型是否应使用自定义提示构建策略。
+- 如果`use_custom_prompt(dataset)`返回 True，`build_prompt(line, dataset)` 应该为相应的数据集返回一个自定义构建的多模态消息，line 数据是一个包含数据样本所需信息的字典。如果`use_custom_prompt(dataset)` 返回False，则将使用默认的 prompt 构建策略。
+
+**3. 支持多轮对话 (可选)**
+
+你可以通过支持 `chat_inner(message, dataset)` API 为你的模型新增多轮对话功能并兼容多轮对话评测。这个 API 输出一个字符串型回复，`message` 包含一个聊天记录的列表，格式如下：
+
+```python
+# Assume msg1, msg2, msg3, ... are multi-modal messages following the previously described format
+# `chat_inner` take the following chat history list as input:
+message = [
+    dict(role='user', content=msg1),
+    dict(role='assistant', content=msg2),
+    dict(role='user', content=msg3),
+    dict(role='assistant', content=msg4),
+	......
+    dict(role='user', content=msgn),
+]
+# `message` should contain an odd number of chat utterances, the role of utterances should be interleaved "user" and "assistant", with the role of the last utterance to be "user".
+# The chat function will call `chat_inner`
+response = model.chat(message)
+```
+
+### 示例 PRs：
+
+- 不支持交错的图像和文本，且不使用自定义提示的VLM：[[模型] 支持 glm-4v-9b](https://github.com/open-compass/VLMEvalKit/pull/221)
+- 支持交错的图像和文本及自定义提示的VLM：[添加 MiniCPM-Llama3-V-2.5](https://github.com/open-compass/VLMEvalKit/pull/205)
+- VLM API：[特征添加 glmv](https://github.com/open-compass/VLMEvalKit/pull/201)
+
+## 为 VLMEvalKit 贡献代码
+
+如果你想为 **VLMEvalKit** 贡献代码，请在提交PR之前进行预提交检查。这有助于保持代码整洁。
+
+```bash
+# 在VLMEvalKit的目录下，安装预提交 hook:
+pip install pre-commit
+pre-commit install
+pre-commit run --all-files
+# 然后提交你的代码。
+```
--- a/VLMEvalKit/docs/zh-CN/conf.py
+++ b/VLMEvalKit/docs/zh-CN/conf.py
+# flake8: noqa
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import ast
+import subprocess
+import sys
+
+import pytorch_sphinx_theme
+from sphinx.builders.html import StandaloneHTMLBuilder
+
+sys.path.insert(0, os.path.abspath('../../'))
+
+# -- Project information -----------------------------------------------------
+
+project = 'VLMEvalKit'
+copyright = '2023, VLMEvalKit'
+author = 'VLMEvalKit Authors'
+
+# The full version, including alpha/beta/rc tags
+version_file = '../../vlmeval/__init__.py'
+
+
+def get_version():
+    with open(version_file, 'r') as f:
+        file_content = f.read()
+    # Parse the file content into an abstract syntax tree (AST)
+    tree = ast.parse(file_content, filename=version_file)
+
+    # Iterate through the body of the AST, looking for an assignment to __version__
+    for node in tree.body:
+        if isinstance(node, ast.Assign):
+            for target in node.targets:
+                if isinstance(target, ast.Name) and target.id == '__version__':
+                    return node.value.s
+    raise ValueError('__version__ not found')
+
+
+release = get_version()
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
+    'sphinx.ext.intersphinx',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    'myst_parser',
+    'sphinx_copybutton',
+    'sphinx_tabs.tabs',
+    'notfound.extension',
+    'sphinxcontrib.jquery',
+    'sphinx_design',
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+source_suffix = {
+    '.rst': 'restructuredtext',
+    '.md': 'markdown',
+}
+
+language = 'cn'
+
+# The master toctree document.
+root_doc = 'index'
+html_context = {
+    'github_version': 'latest',
+}
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'pytorch_sphinx_theme'
+html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()]
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+# yapf: disable
+html_theme_options = {
+    'menu': [
+        {
+            'name': 'GitHub',
+            'url': 'https://github.com/open-compass/VLMEvalKit'
+        },
+    ],
+    # Specify the language of shared menu
+    'menu_lang': 'cn',
+    # Disable the default edit on GitHub
+    'default_edit_on_github': False,
+}
+# yapf: enable
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+html_css_files = [
+    'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.css',
+    'css/readthedocs.css'
+]
+html_js_files = [
+    'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.js',
+    'js/custom.js'
+]
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'vlmevalkitdoc'
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (root_doc, 'vlmevalkit.tex', 'VLMEvalKit Documentation', author,
+     'manual'),
+]
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [(root_doc, 'vlmevalkit', 'VLMEvalKit Documentation', [author],
+              1)]
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (root_doc, 'vlmevalkit', 'VLMEvalKit Documentation', author,
+     'VLMEvalKit Authors', 'AGI evaluation toolbox and benchmark.',
+     'Miscellaneous'),
+]
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+
+# set priority when building html
+StandaloneHTMLBuilder.supported_image_types = [
+    'image/svg+xml', 'image/gif', 'image/png', 'image/jpeg'
+]
+
+# -- Extension configuration -------------------------------------------------
+# Ignore >>> when copying code
+copybutton_prompt_text = r'>>> |\.\.\. '
+copybutton_prompt_is_regexp = True
+
+# Auto-generated header anchors
+myst_heading_anchors = 3
+# Enable "colon_fence" extension of myst.
+myst_enable_extensions = ['colon_fence', 'dollarmath']
+
+# Configuration for intersphinx
+intersphinx_mapping = {
+    'python': ('https://docs.python.org/3', None),
+    'numpy': ('https://numpy.org/doc/stable', None),
+    'torch': ('https://pytorch.org/docs/stable/', None),
+    'mmengine': ('https://mmengine.readthedocs.io/en/latest/', None),
+    'transformers':
+    ('https://huggingface.co/docs/transformers/main/en/', None),
+}
+napoleon_custom_sections = [
+    # Custom sections for data elements.
+    ('Meta fields', 'params_style'),
+    ('Data fields', 'params_style'),
+]
+
+# Disable docstring inheritance
+autodoc_inherit_docstrings = False
+# Mock some imports during generate API docs.
+autodoc_mock_imports = ['rich', 'attr', 'einops']
+# Disable displaying type annotations, these can be very verbose
+autodoc_typehints = 'none'
+
+# The not found page
+notfound_template = '404.html'
+
+
+def builder_inited_handler(app):
+    subprocess.run(['./cp_origin_docs.sh'])
+
+
+def setup(app):
+    app.connect('builder-inited', builder_inited_handler)