Initial commit

1d5a34cf · wanglch · 1d5a34cf · 1d5a34cf · 1d5a34cf · 1d5a34cf
Commit 1d5a34cf authored Jul 31, 2024 by wanglch
20 changed files
--- a/.flake8
+++ b/.flake8
+[flake8]
+ignore = E501, F403, C901, W504, W605, E251, E122, E126, E127, E722, W503, E128, E741
+select = E1, E3, E502, E7, E9, W1, W5, W6
+max-line-length = 180
+exclude=*.egg/*,build,dist,detection/configs/*
--- a/.github/CONTRIBUTING.md
+++ b/.github/CONTRIBUTING.md
+## Contributing to InternLM
+
+Welcome to the InternLM community, all kinds of contributions are welcomed, including but not limited to
+
+**Fix bug**
+
+You can directly post a Pull Request to fix typo in code or documents
+
+The steps to fix the bug of code implementation are as follows.
+
+1. If the modification involve significant changes, you should create an issue first and describe the error information and how to trigger the bug. Other developers will discuss with you and propose an proper solution.
+
+2. Posting a pull request after fixing the bug and adding corresponding unit test.
+
+**New Feature or Enhancement**
+
+1. If the modification involve significant changes, you should create an issue to discuss with our developers to propose an proper design.
+2. Post a Pull Request after implementing the new feature or enhancement and add corresponding unit test.
+
+**Document**
+
+You can directly post a pull request to fix documents. If you want to add a document, you should first create an issue to check if it is reasonable.
+
+### Pull Request Workflow
+
+If you're not familiar with Pull Request, don't worry! The following guidance will tell you how to create a Pull Request step by step. If you want to dive into the develop mode of Pull Request, you can refer to the [official documents](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests)
+
+#### 1. Fork and clone
+
+If you are posting a pull request for the first time, you should fork the OpenMMLab repositories by clicking the **Fork** button in the top right corner of the GitHub page, and the forked repositories will appear under your GitHub profile.
+
+<img src="https://user-images.githubusercontent.com/57566630/167305749-43c7f4e9-449b-4e98-ade5-0c9276d5c9ce.png" width="1200">
+
+Then, you can clone the repositories to local:
+
+```shell
+git clone git@github.com:{username}/lmdeploy.git
+```
+
+After that, you should add official repository as the upstream repository
+
+```bash
+git remote add upstream git@github.com:InternLM/lmdeploy.git
+```
+
+Check whether remote repository has been added successfully by `git remote -v`
+
+```bash
+origin	git@github.com:{username}/lmdeploy.git (fetch)
+origin	git@github.com:{username}/lmdeploy.git (push)
+upstream	git@github.com:InternLM/lmdeploy.git (fetch)
+upstream	git@github.com:InternLM/lmdeploy.git (push)
+```
+
+> Here's a brief introduction to origin and upstream. When we use "git clone", we create an "origin" remote by default, which points to the repository cloned from. As for "upstream", we add it ourselves to point to the target repository. Of course, if you don't like the name "upstream", you could name it as you wish. Usually, we'll push the code to "origin". If the pushed code conflicts with the latest code in official("upstream"), we should pull the latest code from upstream to resolve the conflicts, and then push to "origin" again. The posted Pull Request will be updated automatically.
+
+#### 2. Configure pre-commit
+
+You should configure [pre-commit](https://pre-commit.com/#intro) in the local development environment to make sure the code style matches that of InternLM. **Note**: The following code should be executed under the lmdeploy directory.
+
+```shell
+pip install -U pre-commit
+pre-commit install
+```
+
+Check that pre-commit is configured successfully, and install the hooks defined in `.pre-commit-config.yaml`.
+
+```shell
+pre-commit run --all-files
+```
+
+<img src="https://user-images.githubusercontent.com/57566630/173660750-3df20a63-cb66-4d33-a986-1f643f1d8aaf.png" width="1200">
+
+<img src="https://user-images.githubusercontent.com/57566630/202368856-0465a90d-8fce-4345-918e-67b8b9c82614.png" width="1200">
+
+If the installation process is interrupted, you can repeatedly run `pre-commit run ... ` to continue the installation.
+
+If the code does not conform to the code style specification, pre-commit will raise a warning and  fixes some of the errors automatically.
+
+<img src="https://user-images.githubusercontent.com/57566630/202369176-67642454-0025-4023-a095-263529107aa3.png" width="1200">
+
+If we want to commit our code bypassing the pre-commit hook, we can use the `--no-verify` option(**only for temporarily commit**).
+
+```shell
+git commit -m "xxx" --no-verify
+```
+
+#### 3. Create a development branch
+
+After configuring the pre-commit, we should create a branch based on the master branch to develop the new feature or fix the bug. The proposed branch name is `username/pr_name`
+
+```shell
+git checkout -b yhc/refactor_contributing_doc
+```
+
+In subsequent development, if the master branch of the local repository is behind the master branch of "upstream", we need to pull the upstream for synchronization, and then execute the above command:
+
+```shell
+git pull upstream master
+```
+
+#### 4. Commit the code and pass the unit test
+
+- lmdeploy introduces mypy to do static type checking to increase the robustness of the code. Therefore, we need to add Type Hints to our code and pass the mypy check. If you are not familiar with Type Hints, you can refer to [this tutorial](https://docs.python.org/3/library/typing.html).
+
+- The committed code should pass through the unit test
+
+  ```shell
+  # Pass all unit tests
+  pytest tests
+
+  # Pass the unit test of runner
+  pytest tests/test_runner/test_runner.py
+  ```
+
+  If the unit test fails for lack of dependencies, you can install the dependencies referring to the [guidance](#unit-test)
+
+- If the documents are modified/added, we should check the rendering result referring to [guidance](#document-rendering)
+
+#### 5. Push the code to remote
+
+We could push the local commits to remote after passing through the check of unit test and pre-commit. You can associate the local branch with remote branch by adding `-u` option.
+
+```shell
+git push -u origin {branch_name}
+```
+
+This will allow you to use the `git push` command to push code directly next time, without having to specify a branch or the remote repository.
+
+#### 6. Create a Pull Request
+
+(1) Create a pull request in GitHub's Pull request interface
+
+<img src="https://user-images.githubusercontent.com/57566630/201533288-516f7ac4-0b14-4dc8-afbd-912475c368b5.png" width="1200">
+
+(2) Modify the PR description according to the guidelines so that other developers can better understand your changes
+
+<img src="https://user-images.githubusercontent.com/57566630/202242953-c91a18ff-e388-4ff9-8591-5fae0ead6c1e.png" width="1200">
+
+Find more details about Pull Request description in [pull request guidelines](#pr-specs).
+
+**note**
+
+(a) The Pull Request description should contain the reason for the change, the content of the change, and the impact of the change, and be associated with the relevant Issue (see [documentation](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue))
+
+(b) If it is your first contribution, please sign the CLA
+
+<img src="https://user-images.githubusercontent.com/57566630/167307569-a794b967-6e28-4eac-a942-00deb657815f.png" width="1200">
+
+(c) Check whether the Pull Request pass through the CI
+
+<img src="https://user-images.githubusercontent.com/57566630/167307490-f9ebf9fa-63c0-4d83-8ba1-081ea169eb3a.png" width="1200">
+
+IternLM will run unit test for the posted Pull Request on different platforms (Linux, Window, Mac), based on different versions of Python, PyTorch, CUDA to make sure the code is correct. We can see the specific test information by clicking `Details` in the above image so that we can modify the code.
+
+(3) If the Pull Request passes the CI, then you can wait for the review from other developers. You'll modify the code based on the reviewer's comments, and repeat the steps [4](#4-commit-the-code-and-pass-the-unit-test)-[5](#5-push-the-code-to-remote) until all reviewers approve it. Then, we will merge it ASAP.
+
+<img src="https://user-images.githubusercontent.com/57566630/202145400-cc2cd8c4-10b0-472f-ba37-07e6f50acc67.png" width="1200">
+
+#### 7. Resolve conflicts
+
+If your local branch conflicts with the latest master branch of "upstream", you'll need to resolove them. There are two ways to do this:
+
+```shell
+git fetch --all --prune
+git rebase upstream/master
+```
+
+or
+
+```shell
+git fetch --all --prune
+git merge upstream/master
+```
+
+If you are very good at handling conflicts, then you can use rebase to resolve conflicts, as this will keep your commit logs tidy. If you are not familiar with `rebase`, then you can use `merge` to resolve conflicts.
+
+### Guidance
+
+#### Document rendering
+
+If the documents are modified/added, we should check the rendering result. We could install the dependencies and run the following command to render the documents and check the results:
+
+```shell
+pip install -r requirements/docs.txt
+cd docs/zh_cn/
+# or docs/en
+make html
+# check file in ./docs/zh_cn/_build/html/index.html
+```
+
+### Code style
+
+#### Python
+
+We adopt [PEP8](https://www.python.org/dev/peps/pep-0008/) as the preferred code style.
+
+We use the following tools for linting and formatting:
+
+- [flake8](https://github.com/PyCQA/flake8): A wrapper around some linter tools.
+- [isort](https://github.com/timothycrosley/isort): A Python utility to sort imports.
+- [yapf](https://github.com/google/yapf): A formatter for Python files.
+- [codespell](https://github.com/codespell-project/codespell): A Python utility to fix common misspellings in text files.
+- [mdformat](https://github.com/executablebooks/mdformat): Mdformat is an opinionated Markdown formatter that can be used to enforce a consistent style in Markdown files.
+- [docformatter](https://github.com/myint/docformatter): A formatter to format docstring.
+
+We use [pre-commit hook](https://pre-commit.com/) that checks and formats for `flake8`, `yapf`, `isort`, `trailing whitespaces`, `markdown files`,
+fixes `end-of-files`, `double-quoted-strings`, `python-encoding-pragma`, `mixed-line-ending`, sorts `requirments.txt` automatically on every commit.
+The config for a pre-commit hook is stored in [.pre-commit-config](../.pre-commit-config.yaml).
+
+#### C++ and CUDA
+
+The clang-format config is stored in [.clang-format](../.clang-format). And it's recommended to use clang-format version **11**. Please do not use older or newer versions as they will result in differences after formatting, which can cause the [lint](https://github.com/InternLM/lmdeploy/blob/main/.github/workflows/lint.yml#L25) to fail.
+
+### PR Specs
+
+1. Use [pre-commit](https://pre-commit.com) hook to avoid issues of code style
+
+2. One short-time branch should be matched with only one PR
+
+3. Accomplish a detailed change in one PR. Avoid large PR
+
+   - Bad: Support Faster R-CNN
+   - Acceptable: Add a box head to Faster R-CNN
+   - Good: Add a parameter to box head to support custom conv-layer number
+
+4. Provide clear and significant commit message
+
+5. Provide clear and meaningful PR description
+
+   - Task name should be clarified in title. The general format is: \[Prefix\] Short description of the PR (Suffix)
+   - Prefix: add new feature \[Feature\], fix bug \[Fix\], related to documents \[Docs\], in developing \[WIP\] (which will not be reviewed temporarily)
+   - Introduce main changes, results and influences on other modules in short description
+   - Associate related issues and pull requests with a milestone
--- a/.github/ISSUE_TEMPLATE/1-bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/1-bug-report.yml
+name: 🐞 Bug report
+description: Create a report to help us reproduce and fix the bug
+title: "[Bug] "
+labels: ['Bug']
+
+body:
+- type: checkboxes
+  attributes:
+    label: Checklist
+    options:
+    - label: 1. I have searched related issues but cannot get the expected help.
+    - label: 2. The bug has not been fixed in the latest version.
+    - label: 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
+- type: textarea
+  attributes:
+    label: Describe the bug
+    description: A clear and concise description of what the bug is.
+  validations:
+    required: true
+- type: textarea
+  attributes:
+    label: Reproduction
+    description: |
+      1. What command or script did you run?
+    placeholder: |
+      A placeholder for the command.
+  validations:
+    required: true
+- type: textarea
+  attributes:
+    label: Environment
+    description: |
+      1. Please run `lmdeploy check_env` to collect necessary environment information and paste it here.
+      2. You may add addition that may be helpful for locating the problem, such as
+         - Which **model** are you using?
+         - How you installed PyTorch \[e.g., pip, conda, source\]
+         - Other environment variables that may be related (such as `$PATH`, `$LD_LIBRARY_PATH`, `$PYTHONPATH`, etc.)
+    placeholder: Environment here.
+    render: Shell
+  validations:
+    required: true
+- type: textarea
+  attributes:
+    label: Error traceback
+    description: |
+      If applicable, paste the error trackback here.
+    placeholder: Logs and traceback here.
+    render: Shell
+- type: markdown
+  attributes:
+    value: >
+     If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
+
+     Thanks for your bug report. We appreciate it a lot.
--- a/.github/ISSUE_TEMPLATE/2-feature-request.yml
+++ b/.github/ISSUE_TEMPLATE/2-feature-request.yml
+name: 🚀 Feature request
+description: Suggest an idea for this project
+title: "[Feature] "
+
+body:
+- type: markdown
+  attributes:
+    value: |
+      We strongly appreciate you creating a PR to implement this feature [here](https://github.com/OpenGVLab/InternVL/pulls)!
+      If you need our help, please fill in as much of the following form as you're able to.
+
+      **The less clear the description, the longer it will take to solve it.**
+- type: textarea
+  attributes:
+    label: Motivation
+    description: |
+      A clear and concise description of the motivation of the feature.
+      Ex1. It is inconvenient when \[....\].
+  validations:
+    required: true
+- type: textarea
+  attributes:
+    label: Related resources
+    description: |
+      If there is an official code release or third-party implementations, please also provide the information here, which would be very helpful.
+- type: textarea
+  attributes:
+    label: Additional context
+    description: |
+      Add any other context or screenshots about the feature request here.
+      If you would like to implement the feature and create a PR, please leave a comment here and that would be much appreciated.
--- a/.github/ISSUE_TEMPLATE/3-documentation.yml
+++ b/.github/ISSUE_TEMPLATE/3-documentation.yml
+name: 📚 Documentation
+description: Report an issue related to the documentation.
+labels: "kind/doc,status/unconfirmed"
+title: "[Docs] "
+
+body:
+- type: textarea
+  attributes:
+    label: 📚 The doc issue
+    description: >
+      A clear and concise description the issue.
+  validations:
+    required: true
+
+- type: textarea
+  attributes:
+    label: Suggest a potential alternative/fix
+    description: >
+      Tell us how we could improve the documentation in this regard.
+- type: markdown
+  attributes:
+    value: >
+      Thanks for contributing 🎉!
--- a/.gitignore
+++ b/.gitignore
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+.idea/
+
+.DS_Store
+data_process/
+internvl_chat/work_dirs/
+internvl_chat/unittest/
+internvl_chat/data/
+Husky2/*
+data_process/
+*distillation*
--- a/.isort.cfg
+++ b/.isort.cfg
+[isort]
+line-length = 180
+multi_line_output = 0
+extra_standard_library = setuptools
+known_third_party = PIL,asynctest,cityscapesscripts,cv2,gather_models,matplotlib,mmcv,numpy,onnx,onnxruntime,pycocotools,pytest,pytorch_sphinx_theme,requests,scipy,seaborn,six,terminaltables,torch,ts,yaml
+no_lines_before = STDLIB,LOCALFOLDER
+default_section = THIRDPARTY
+
+[yapf]
+BASED_ON_STYLE = pep8
+BLANK_LINE_BEFORE_NESTED_CLASS_OR_DEF = true
+SPLIT_BEFORE_EXPRESSION_AFTER_OPENING_PAREN = true
+
+[codespell]
+skip = *.ipynb
+quiet-level = 3
+ignore-words-list = patten,nd,ty,mot,hist,formating,winn,gool,datas,wan,confids,TOOD,tood
+© 2022 GitHub, Inc.
+Terms
+Privacy
+Security
+Status
+Docs
+Contact GitHub
+Pricing
+API
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
+exclude: ^internvl_chat_llava/
+repos:
+  - repo: https://github.com/PyCQA/flake8
+    rev: 5.0.4
+    hooks:
+      - id: flake8
+  - repo: https://github.com/PyCQA/isort
+    rev: 5.11.5
+    hooks:
+      - id: isort
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.3.0
+    hooks:
+      - id: trailing-whitespace
+      - id: check-yaml
+      - id: end-of-file-fixer
+      - id: requirements-txt-fixer
+      - id: double-quote-string-fixer
+      - id: check-merge-conflict
+      - id: fix-encoding-pragma
+        args: ["--remove"]
+      - id: mixed-line-ending
+        args: ["--fix=lf"]
+  - repo: https://github.com/executablebooks/mdformat
+    rev: 0.7.9
+    hooks:
+      - id: mdformat
+        args: ["--number"]
+        additional_dependencies:
+          - mdformat-openmmlab
+          - mdformat_frontmatter
+          - linkify-it-py
--- a/INSTALLATION.md
+++ b/INSTALLATION.md
+## 🛠️ Installation
+
+- Clone this repository:
+
+  ```bash
+  git clone https://github.com/OpenGVLab/InternVL.git
+  ```
+
+- Create a conda virtual environment and activate it:
+
+  ```bash
+  conda create -n internvl python=3.9 -y
+  conda activate internvl
+  ```
+
+- Install dependencies using `requirements.txt`:
+
+  ```bash
+  pip install -r requirements.txt
+  ```
+  
+  By default, our `requirements.txt` file includes the following dependencies:
+  
+  - `-r requirements/internvl_chat.txt`
+  - `-r requirements/streamlit_demo.txt`
+  - `-r requirements/classification.txt`
+  - `-r requirements/segmentation.txt`
+  
+  The `clip_benchmark.txt` is **not** included in the default installation. If you require the `clip_benchmark` functionality, please install it manually by running the following command:
+
+  ```bash
+  pip install -r requirements/clip_benchmark.txt
+  ```
+
+### Additional Instructions
+
+- Install `flash-attn==2.3.6`:
+
+  ```bash
+  pip install flash-attn==2.3.6 --no-build-isolation
+  ```
+
+  Alternatively you can compile from source:
+
+  ```bash
+  git clone https://github.com/Dao-AILab/flash-attention.git
+  cd flash-attention
+  git checkout v2.3.6
+  python setup.py install
+  ```
+
+- Install `mmcv-full==1.6.2` (optional, for `segmentation`):
+
+  ```bash
+  pip install -U openmim
+  mim install mmcv-full==1.6.2
+  ```
+
+- Install `apex` (optional, for `segmentation`):
+
+  ```bash
+  git clone https://github.com/NVIDIA/apex.git
+  git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82  # https://github.com/NVIDIA/apex/issues/1735
+  pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
+  ```
+
+  If you encounter `ModuleNotFoundError: No module named 'fused_layer_norm_cuda'`, it is because apex's CUDA extensions are not being installed successfully. You can try uninstalling apex and the code will default to the PyTorch version of RMSNorm. Alternatively, if you prefer using apex, try adding a few lines to `setup.py` and then recompiling.
+
+  <img src=https://github.com/OpenGVLab/InternVL/assets/23737120/c04a989c-8024-49fa-b62c-2da623e63729 width=50%>
--- a/LICENSE
+++ b/LICENSE
+MIT License
+
+Copyright (c) 2023 OpenGVLab
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
+# InternVL2
+
+InternVL2是一个开源的多模态大型语言模型，旨在缩小开源模型与商业专有模型在多模态理解方面的能力差距，可用于OCR、视频理解、文档问答。
+
+## 论文
+
+- [InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks](https://arxiv.org/abs/2312.14238)
+
+- [How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites](https://arxiv.org/abs/2404.16821)
+
+## 模型结构
+
+InternVL 2.0的架构集成了一个预训练的视觉变换器模型（InternViT-6B）和一个预训练的语言模型（InternLM2-20B）。这两个模型通过一个随机初始化的多层感知器（MLP）投影器连接起来。InternViT-6B是一个视觉基础模型（VFM），它在预训练阶段通过持续学习策略进行改进，增强了模型对视觉内容的理解能力，并提高了其在不同语言模型中的适应性。InternLM2-20B作为语言基础模型，提供了强大的初始语言处理能力。在训练过程中，MLP投影器用于优化视觉特征提取，将视觉编码器的输出与语言模型的输入相匹配。
+
+<div align="center">
+    <img src="./images/model2.png"/>
+</div>
+
+## 算法原理
+
+InternVL2.0采用了一种动态的高分辨率训练方法，该方法将图像分割成448×448像素的瓦片，瓦片数量根据输入图像的纵横比和分辨率在1到12之间变化。在测试阶段，这个数量可以扩展到40个瓦片（即4K分辨率）。为了增强高分辨率下的可扩展性，模型使用了一个简单的像素洗牌操作，将视觉标记的数量减少到原始数量的四分之一。因此，在模型中，一个448×448像素的图像由256个视觉标记表示。在微调阶段，模型使用了精心选择的数据集来增强在多模态任务中的性能，这些数据集包括图像字幕、通用问答、科学图像理解、图表解释、数学问题解决、基于知识的问答、OCR和文档理解等。
+
+<div align=center>
+    <img src="./images/train.png"/>
+</div>
+
+
+## 环境配置
+
+### Docker（方法一）
+
+[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
+
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
+
+docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=128G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name internvl2 <your imageID> bash
+
+cd /path/your_code_data/
+
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+```
+
+### Dockerfile（方法二）
+
+```
+cd /path/your_code_data/docker
+
+docker build --no-cache -t  internvl2:latest .
+
+docker run --shm-size=128G --name  internvl2 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it  internvl2 bash
+```
+
+### Anaconda（方法三）
+
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+```
+DTK驱动：dtk24.04
+python：python3.10
+torch:2.1
+torchvision: 0.16.0
+deepspped: 0.12.3
+```
+`Tips：以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`
+
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+```
+conda create -n  internvl2 python=3.10
+
+conda activate  internvl2
+
+cd /path/your_code_data/
+
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
+```
+
+## 数据集
+
+测试数据集 [ai2d](https://allenai.org/data/diagrams) 
+
+预训练需要准备你的训练数据，需要将所有图片样本放到中并存入`playground/data/`，文本文件以jsonl存入文件夹路径如下, **ai2d_train_12k.jsonl** 可以在`playground/opensource`中找到，具体可以参考官方[Fine-tune on a Custom Dataset](https://internvl.readthedocs.io/en/latest/internvl2.0/finetune.html)。
+
+```
+playground/
+├── opensource
+│   ├── ai2d_train_12k.jsonl
+├── data
+│   ├── ai2d
+│   │   ├── abc_images
+│   │   └── images
+
+```
+
+
+下载预训练模型后，准备自定义的 SFT（监督微调）数据。之后在`internvl_chat/shell/data/`中创建一个 JSON 文件格式如下，并命名为 **internvl_1_2_finetune_custom.json** 。
+
+```
+{
+  "ai2d_train_12k": {
+    "root": "playground/data/ai2d/",
+    "annotation": "playground/opensource/ai2d_train_12k.jsonl",
+    "data_augment": false,
+    "repeat_time": 1,
+    "length": 12413
+  }
+}
+```
+## 训练
+
+根据实际情况在脚本中修改权重相关路径
+
+### 单机多卡
+
+```
+sh finetune_lora_multi_dcu.sh
+```
+
+## 推理
+
+### 单机多卡
+
+推理前需要修改模型路径和图片路径
+
+```
+path = 'OpenGVLab/InternVL2-40B'
+pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
+generation_config = dict(max_new_tokens=1024, do_sample=False)
+```
+
+```
+python internvl_chat.py
+```
+## result
+
+### OCR
+
+<div align=center>
+    <img src="./images/ocr_result.png"/>
+</div>
+
+### 问答
+
+<div align=center>
+    <img src="./images/qa_result.png"/>
+</div>
+
+### 精度
+测试数据：[ai2d](https://allenai.org/data/diagrams)，使用的加速卡:K100AI/A800。
+
+| device | train_loss | samples/second | samples/step |
+| :------: | :------: |  :------: | :------: | 
+| K100AI | 0.1223 | 0.118 |0.019 |
+| A800 | 0.1245 | 0.249 | 0.041 |
+
+
+## 应用场景
+
+### 算法类别
+
+`COR`
+
+### 热点应用行业
+
+`金融,教育,交通,政府`
+
+## 预训练权重
+
+- [OpenGVLab/InternVL2-40B](https://modelscope.cn/models/Duxiaoman-DI/XuanYuan-13B-Chat/files)
+
+预训练权重快速下载中心：[SCNet AIModels](http://113.200.138.88:18080/aimodels)
+
+项目中的预训练权重可从快速下载通道下载： [OpenGVLab/InternVL2-40B](http://113.200.138.88:18080/aimodels/opengvlab/internvl2-40b)
+
+## 源码仓库及问题反馈
+
+- https://developer.hpccube.com/codes/modelzoo/internvl2_pytorch
+
+## 参考资料
+
+- [OpenGVLab/InternVL github](https://github.com/OpenGVLab/InternVL)
+
+- [InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks](https://arxiv.org/abs/2312.14238)
+
+- [How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites](https://arxiv.org/abs/2404.16821)
+
--- a/README_base.md
+++ b/README_base.md
+<div align="center">
+
+# <img width="60" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/47669167/7037290e-f474-4d11-b90f-1d8316087bf8"> InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4o
+
+[\[🆕 Blog\]](https://internvl.github.io/blog/)  [\[🚀 InternVL2 Blog\]](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/)    [\[🤗 HF Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)  [\[🌐 API\]](./document/How_to_use_InternVL_API.md)    [\[🚀 Quick Start\]](#quick-start-with-huggingface)
+
+[\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821)  [\[📖 1.0 中文解读\]](https://zhuanlan.zhihu.com/p/702946079) [\[📖 1.5 中文解读\]](https://zhuanlan.zhihu.com/p/699439759)  [\[📖 2.0 中文解读\]](https://zhuanlan.zhihu.com/p/706547971)
+
+[Switch to the Chinese version (切换至中文版)](/README_zh.md)
+
+<a href="https://trendshift.io/repositories/9803" target="_blank"><img src="https://trendshift.io/api/badge/repositories/9803" alt="OpenGVLab%2FInternVL | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
+<img height="55" alt="image" src="https://github.com/user-attachments/assets/bd62ab46-f0ea-40c6-ab10-7fde671716cc">
+
+![opencompass](https://github.com/user-attachments/assets/7ce93c05-84ae-4997-a480-53897d1d3a1c)
+
+</div>
+
+## News 🚀🚀🚀
+
+- `2024/07/18`: 🔥🔥 InternVL2-40B achieved SOTA performance among open-source models on the [Video-MME](https://github.com/BradyFU/Video-MME) dataset, scoring 61.2 when inputting 16 frames and 64.4 when inputting 32 frames. It significantly outperforms other open-source models and is the closest open-source model to GPT-4o mini.
+- `2024/07/18`: 🔥 InternVL2-Pro achieved the SOTA performance on the [DocVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1) and [InfoVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=3) benchmarks.
+- `2024/07/04`: 🚀 We release the [InternVL2 series](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e). InternVL2-Pro achieved a 62.0% accuracy on the MMMU benchmark, matching the performance of leading closed-source commercial models like GPT-4o. The free API of this model can be applied by filling ([application form](https://docs.google.com/forms/d/e/1FAIpQLSfMCzhPr1OOEKau_6jwTU0EiZMSFckDo-HMlc_hUudhF_97rw/viewform?usp=sf_link)) / ([申请表](https://wj.qq.com/s2/14910502/25a4/)). Other models are available at [HF link](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e).
+- `2024/06/19`: We propose Needle In A Multimodal Haystack ([MM-NIAH](https://github.com/OpenGVLab/MM-NIAH)), the first benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
+- `2024/05/30`: We release [ShareGPT-4o](https://sharegpt4o.github.io/), a large-scale dataset that we plan to open-source with 200K images, 10K videos, and 10K audios with detailed descriptions.
+- `2024/05/29`: We release the Mini-InternVL series, which includes two chat models: [Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5) and [Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5). These models achieve impressive performance with minimal size: the 2B model delivers 80% of the performance with only 8% of the model size, and the 4B model achieves 90% of the performance with just 16% of the model size. For more details, please check our [blog](https://internvl.github.io/blog/2024-05-25-Mini-InternVL-1.5/).
+- `2024/05/28`: Thanks to the [lmdeploy](https://github.com/InternLM/lmdeploy) team for providing AWQ quantization support. The 4-bit model is available at [OpenGVLab/InternVL-Chat-V1-5-AWQ](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-AWQ).
+- `2024/05/13`: InternVL 1.0 can now be used as the [text encoder](https://huggingface.co/OpenGVLab/InternVL-14B-224px) for diffusion models to support multilingual generation natively in over 110 languages worldwide. See [MuLan](https://github.com/mulanai/MuLan) for more details.
+- `2024/04/18`: InternVL-Chat-V1-5 has been released at [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5), approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.
+- `2024/02/27`: InternVL is accepted by CVPR 2024 (Oral)! 🎉
+- `2024/02/24`: InternVL-Chat models have been included in the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit).
+- `2024/02/21`: [InternVL-Chat-V1-2-Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) achieved SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our [blog](https://internvl.github.io/blog/2024-02-21-InternVL-1.2/) for more details.
+- `2024/02/12`: InternVL-Chat-V1-2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our [blog](https://internvl.github.io/blog/2024-02-21-InternVL-1.2/) and [SFT data](./internvl_chat#prepare-training-datasets). The model is now available on [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2), and both training / evaluation data and scripts are open-sourced.
+- `2024/01/24`: InternVL-Chat-V1-1 is released, it supports Chinese and has stronger OCR capability, see [here](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1).
+- `2024/01/16`: We release our [customized mmcv/mmsegmentation/mmdetection code](https://github.com/OpenGVLab/InternVL-MMDetSeg), integrated with DeepSpeed, which can be used for training large-scale detection and segmentation models.
+
+## TODO List
+
+- [ ] Support vLLM and Ollama
+- [ ] Rebuild documents using readthedocs
+- [x] Support fine-tuning different LLMs with LoRA
+- [ ] Support video and PDF input in online demo
+- [ ] Release InternVL2 with VisionLLMv2 integration
+- [x] Release `requirements.txt` for InternVL2
+- [x] Release training / evaluation code for InternVL2 series
+- [x] Release Streamlit web UI for InternVL1.5 and InternVL2
+
+## Documents
+
+- Installation
+
+  - How to install the environment? [\[link\]](./INSTALLATION.md) [\[requirements.txt\]](./requirements.txt)
+
+- Training or Fine-tuning
+
+  - How to reproduce the SFT stage of InternVL-Chat-V1-2? [\[link\]](./internvl_chat#start-training)
+  - How to fine-tune InternVL-Chat-V1-2 on a custom dataset? [\[link\]](./document/How_to_finetune_internvl_chat_v1_2_on_a_custom_dataset.md)
+  - How to fine-tune the Mini-InternVL-Chat series on a custom dataset? [\[link\]](./document/How_to_finetune_mini_internvl_chat_v1_5_on_a_custom_dataset.md)
+
+- Benchmark Test
+
+  > Due to minor implementation differences between this codebase and VLMEvalKit, slight discrepancies in performance metrics may occur when testing the same model.
+
+  - How to evaluate InternVL-Chat-V1-5? [\[link\]](./document/How_to_evaluate_internvl_chat_v1_5.md)
+  - How to evaluate InternVL-Chat-V1-5 using VLMEvalKit? (Recommend) [\[link\]](./document/How_to_evaluate_internvl_chat_v1_5_using_vlmevalkit.md)
+  - How to evaluate Mini-InternVL-Chat-2B-V1-5 using VLMEvalKit? (Recommend) [\[link\]](./document/How_to_evaluate_mini_internvl_chat_2b_v1_5_using_vlmevalkit.md)
+  - How to evaluate Mini-InternVL-Chat-4B-V1-5 using VLMEvalKit? (Recommend) [\[link\]](./document/How_to_evaluate_mini_internvl_chat_4b_v1_5_using_vlmevalkit.md)
+
+- Deployment
+
+  - How to use InternVL API? [\[link\]](./document/How_to_use_InternVL_API.md)
+  - How to deploy a local demo? [\[link\]](./document/How_to_deploy_a_local_demo.md)
+  - How to run InternVL-1.5 8bit with Nvidia V100 GPU? [\[link\]](https://github.com/OpenGVLab/InternVL/issues/144) [\[中文教程\]](https://zhuanlan.zhihu.com/p/697188143)
+  - How to perform batch inference? [\[link\]](./README.md?plain=1#L849)
+
+## Compared with SOTA VLLMs
+
+![waic_performance](https://github.com/user-attachments/assets/7b24ad6c-45dd-4bcd-aa77-79da1ec856ee)
+
+## Model Zoo
+
+#### Multimodal Large Language Model (InternVL 2.0)
+
+<table>
+  <tr>
+    <th>Model Name</th>
+    <th>Vision Part</th>
+    <th>Language Part</th>
+    <th>HF&nbsp;Link</th>
+    <th>MS&nbsp;Link</th>
+    <th>Document</th>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;1B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-300M-448px">InternViT&#8209;300M&#8209;448px</a></td>
+    <td><a href="https://huggingface.co/Qwen/Qwen2-0.5B-Instruct">Qwen2&#8209;0.5B&#8209;Instruct</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-1B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-1B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-1B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;2B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-300M-448px">InternViT&#8209;300M&#8209;448px</a></td>
+    <td><a href="https://huggingface.co/internlm/internlm2-chat-1_8b">internlm2&#8209;chat&#8209;1&#8209;8b</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-2B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-2B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-2B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;4B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-300M-448px">InternViT&#8209;300M&#8209;448px</a></td>
+    <td><a href="https://huggingface.co/microsoft/Phi-3-mini-128k-instruct">Phi&#8209;3&#8209;mini&#8209;128k&#8209;instruct</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-4B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-4B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-4B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;8B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-300M-448px">InternViT&#8209;300M&#8209;448px</a></td>
+    <td><a href="https://huggingface.co/internlm/internlm2_5-7b-chat">internlm2_5&#8209;7b&#8209;chat</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-8B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-8B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-8B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;26B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5">InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;5</a></td>
+    <td><a href="https://huggingface.co/internlm/internlm2-chat-20b">internlm2&#8209;chat&#8209;20b</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-26B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-26B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-26B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;40B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5">InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;5</a></td>
+    <td><a href="https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B">Nous&#8209;Hermes&#8209;2&#8209;Yi&#8209;34B</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-40B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-40B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-40B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2-Llama3-76B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5">InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;5</a></td>
+    <td><a href="https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-70B">Hermes‑2‑Theta‑<br>Llama‑3‑70B</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-Llama3-76B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B#quick-start">📖 doc</a></td>
+  </tr>
+</table>
+
+#### InternVL2-Pro API
+
+We encourage everyone to use our API for research. For better management, please submit ([application form](https://docs.google.com/forms/d/e/1FAIpQLSfMCzhPr1OOEKau_6jwTU0EiZMSFckDo-HMlc_hUudhF_97rw/viewform?usp=sf_link)) / ([申请表](https://wj.qq.com/s2/14910502/25a4/)) to obtain free API access.
+
+#### Multimodal Large Language Model (InternVL 1.0-1.5)
+
+<table>
+  <tr>
+    <th>Model</th>
+    <th>Date</th>
+    <th>HF&nbsp;Link</th>
+    <th>MS&nbsp;Link</th>
+    <th>Note</th>
+  </tr>
+  <tr>
+    <td>Mini&#8209;InternVL&#8209;Chat&#8209;4B&#8209;V1&#8209;5</td>
+    <td>2024.05.28</td>
+    <td><a href="https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5">🤖 link</a></td>
+    <td>🚀🚀 16% of the model size, 90% of the performance</td>
+  </tr>
+  <tr>
+    <td>Mini&#8209;InternVL&#8209;Chat&#8209;2B&#8209;V1&#8209;5</td>
+    <td>2024.05.19</td>
+    <td><a href="https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5">🤖 link</a></td>
+    <td>🚀 8% of the model size, 80% of the performance</td>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;V1&#8209;5</td>
+    <td>2024.04.18</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-V1-5">🤖 link</a></td>
+    <td>support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.</td>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;V1&#8209;2&#8209;Plus</td>
+    <td>2024.02.21</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-V1-2-Plus">🤖 link</a></td>
+    <td>more SFT data and stronger</td>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;V1&#8209;2</td>
+    <td>2024.02.11</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-V1-2">🤖 link</a></td>
+    <td>scaling up LLM to 34B</td>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;V1&#8209;1</td>
+    <td>2024.01.24</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-V1-1">🤖 link</a></td>
+    <td>support Chinese and stronger OCR</td>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;19B</td>
+    <td>2023.12.25</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B">🤖 link</a></td>
+    <td>English multimodal dialogue</td>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;13B</td>
+    <td>2023.12.25</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B">🤖 link</a></td>
+    <td>English multimodal dialogue</td>
+  </tr>
+</table>
+
+#### Vision Foundation Model (InternVL 1.0-1.5)
+
+<table>
+  <tr>
+    <th>Model</th>
+    <th>Date</th>
+    <th>HF&nbsp;Link</th>
+    <th>MS&nbsp;Link</th>
+    <th>Note</th>
+  </tr>
+  <tr>
+    <td>InternViT&#8209;300M&#8209;448px</td>
+    <td>2024.05.25</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-300M-448px">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternViT-300M-448px">🤖 link</a></td>
+    <td>distilled small vision foundation model with 300M parameters (🔥new)</td>
+  </tr>
+  <tr>
+    <td>InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;5</td>
+    <td>2024.04.20</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternViT-6B-448px-V1-5">🤖 link</a></td>
+    <td>support dynamic resolution and super strong OCR feature extraction capability by incremental pre-training (🔥new)</td>
+  </tr>
+  <tr>
+    <td>InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;2</td>
+    <td>2024.02.11</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternViT-6B-448px-V1-2">🤖 link</a></td>
+    <td>support 448 resolution by incremental pre-training</td>
+  </tr>
+  <tr>
+    <td>InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;0</td>
+    <td>2024.01.30</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternViT-6B-448px-V1-0">🤖 link</a></td>
+    <td>support 448 resolution by incremental pre-training</td>
+  </tr>
+  <tr>
+    <td>InternViT&#8209;6B&#8209;224px</td>
+    <td>2023.12.22</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-224px">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternViT-6B-224px">🤖 link</a></td>
+    <td>the first version of InternViT-6B, extracted from InternVL‑14B‑224px</td>
+  </tr>
+</table>
+
+#### Vision-Language Foundation Model (InternVL 1.0)
+
+<table>
+  <tr>
+    <th>Model</th>
+    <th>Date</th>
+    <th>HF&nbsp;Link</th>
+    <th>MS&nbsp;Link</th>
+    <th>Note</th>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;14B&#8209;224px</td>
+    <td>2023.12.22</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-14B-224px">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-14B-224px">🤖 link</a></td>
+    <td>vision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP</td>
+  </tr>
+</table>
+
+## What can InternVL do?
+
+<details>
+  <summary>Visual Perception (click to expand)</summary>
+
+- Linear-Probe Image Classification [\[see details\]](./classification#-evaluation)
+
+  ViT-22B uses the private JFT-3B dataset.
+
+  | method              | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
+  | ------------------- | :----: | :---: | :-----: | :---: | :--: | :--: | :-------: |
+  | OpenCLIP-G          |  1.8B  | 86.2  |  89.4   | 77.2  | 63.8 | 87.8 |   66.4    |
+  | DINOv2-g            |  1.1B  | 86.5  |  89.6   | 78.4  | 75.9 | 78.8 |   62.5    |
+  | EVA-01-CLIP-g       |  1.1B  | 86.5  |  89.3   | 77.4  | 70.5 | 87.7 |   63.1    |
+  | MAWS-ViT-6.5B       |  6.5B  | 87.8  |    -    |   -   |  -   |  -   |     -     |
+  | ViT-22B\*           | 21.7B  | 89.5  |  90.9   | 83.2  | 83.8 | 87.4 |     -     |
+  | InternViT-6B (ours) |  5.9B  | 88.2  |  90.4   | 79.9  | 77.5 | 89.8 |   69.1    |
+
+- Semantic Segmentation [\[see details\]](./segmentation#-evaluation)
+
+  | method                | decoder | #param (train/total) | crop size | mIoU         |
+  | --------------------- | :-----: | :------------------: | :-------: | ------------ |
+  | OpenCLIP-G (frozen)   | Linear  |     0.3M / 1.8B      |    512    | 39.3         |
+  | ViT-22B (frozen)      | Linear  |     0.9M / 21.7B     |    504    | 34.6         |
+  | InternViT-6B (frozen) | Linear  |     0.5M / 5.9B      |    504    | 47.2 (+12.6) |
+  | ViT-22B (frozen)      | UperNet |     0.8B / 22.5B     |    504    | 52.7         |
+  | InternViT-6B (frozen) | UperNet |     0.4B / 6.3B      |    504    | 54.9 (+2.2)  |
+  | ViT-22B               | UperNet |    22.5B / 22.5B     |    504    | 55.3         |
+  | InternViT-6B          | UperNet |     6.3B / 6.3B      |    504    | 58.9 (+3.6)  |
+
+- Zero-Shot Image Classification [\[see details\]](./clip_benchmark#imagenet-variants-and-objectnet)
+
+  | method            | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
+  | ----------------- | :---: | :--: | :--: | :---: | :-------: | :-------: |
+  | OpenCLIP-G        | 80.1  | 69.3 | 92.1 | 73.6  |   68.9    |   73.0    |
+  | EVA-02-CLIP-E+    | 82.0  | 82.1 | 94.5 | 75.7  |   71.6    |   79.6    |
+  | ViT-22B\*         | 85.9  | 90.1 | 96.0 | 80.9  |     -     |   87.6    |
+  | InternVL-C (ours) | 83.2  | 83.8 | 95.5 | 77.3  |   73.9    |   80.6    |
+
+- Multilingual Zero-Shot Image Classification [\[see details\]](./clip_benchmark#multilingual-imagenet-1k)
+
+  EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian
+
+  | method            | IN-1K (EN) | IN-1K (ZH) | IN-1K (JP) | IN-1K (AR) | IN-1K (IT) |
+  | ----------------- | :--------: | :--------: | :--------: | :--------: | :--------: |
+  | Taiyi-CLIP-ViT-H  |     -      |    54.4    |     -      |     -      |     -      |
+  | WuKong-ViT-L-G    |     -      |    57.5    |     -      |     -      |     -      |
+  | CN-CLIP-ViT-H     |     -      |    59.6    |     -      |     -      |     -      |
+  | AltCLIP-ViT-L     |    74.5    |    59.6    |     -      |     -      |     -      |
+  | EVA-02-CLIP-E+    |    82.0    |     -      |     -      |     -      |    41.2    |
+  | OpenCLIP-XLM-R-H  |    77.0    |    55.7    |    53.1    |    37.0    |    56.8    |
+  | InternVL-C (ours) |    83.2    |    64.5    |    61.5    |    44.9    |    65.7    |
+
+- Zero-Shot Video Classification
+
+  | method            | #frame | K400 | K600 | K700 |
+  | ----------------- | :----: | :--: | :--: | :--: |
+  | OpenCLIP-G        |   1    | 65.9 | 66.1 | 59.2 |
+  | EVA-02-CLIP-E+    |   1    | 69.8 | 69.3 | 63.4 |
+  | InternVL-C (ours) |   1    | 71.0 | 71.3 | 65.7 |
+  | ViCLIP            |   8    | 75.7 | 73.5 | 66.4 |
+  | InternVL-C (ours) |   8    | 79.4 | 78.8 | 71.5 |
+
+</details>
+
+<details>
+  <summary>Cross-Modal Retrieval (click to expand)</summary>
+
+- English Zero-Shot Image-Text Retrieval [\[see details\]](./clip_benchmark#flickr30k--coco)
+
+  <table>
+    <tr align=center>
+        <td rowspan="3" align=left><b>model</b></td>
+        <td colspan="6" align=center><b>Flickr30K</b></td>
+        <td colspan="6" align=center><b>COCO</b></td>
+        <td rowspan="3" align=center><b>avg</b></td>
+
+  </tr>
+     <tr align=center>
+        <td colspan="3" align=center><b>image-to-text</b></td>
+        <td colspan="3" align=center><b>text-to-image</b></td>
+         <td colspan="3" align=center><b>image-to-text</b></td>
+        <td colspan="3" align=center><b>text-to-image</b></td>
+     </tr>
+     <tr>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+     </tr>
+
+  <tr align=center>
+        <td align=left>OpenCLIP-G</td>
+        <td>92.9</td>
+        <td>99.3</td>
+        <td>99.8</td>
+        <td>79.5</td>
+        <td>95.0</td>
+        <td>97.1</td>
+        <td>67.3</td>
+        <td>86.9</td>
+        <td>92.6</td>
+        <td>51.4</td>
+        <td>74.9</td>
+        <td>83.0</td>
+        <td>85.0</td>
+     </tr>
+  <tr align=center>
+        <td align=left>EVA-02-CLIP-E+</td>
+        <td>93.9</td>
+        <td>99.4</td>
+        <td>99.8</td>
+        <td>78.8</td>
+        <td>94.2</td>
+        <td>96.8</td>
+        <td>68.8</td>
+        <td>87.8</td>
+        <td>92.8</td>
+        <td>51.1</td>
+        <td>75.0</td>
+        <td>82.7</td>
+        <td>85.1</td>
+     </tr>
+    <tr align=center>
+        <td align=left>EVA-CLIP-8B</td>
+        <td>95.6</td>
+        <td>99.6</td>
+        <td>99.9</td>
+        <td>80.8</td>
+        <td>95.5</td>
+        <td>97.6</td>
+        <td>70.3</td>
+        <td>89.3</td>
+        <td>93.9</td>
+        <td>53.0</td>
+        <td>76.0</td>
+        <td>83.4</td>
+        <td>86.2</td>
+     </tr>
+  <tr align=center>
+        <td align=left>InternVL-C (ours)</td>
+        <td>94.7</td>
+        <td>99.6</td>
+        <td>99.9</td>
+        <td>81.7</td>
+        <td>96.0</td>
+        <td>98.2</td>
+        <td>70.6</td>
+        <td>89.0</td>
+        <td>93.5</td>
+        <td>54.1</td>
+        <td>77.3</td>
+        <td>84.6</td>
+        <td>86.6</td>
+     </tr>
+  <tr align=center>
+        <td align=left>InternVL-G (ours)</td>
+        <td>95.7</td>
+        <td>99.7</td>
+        <td>99.9</td>
+        <td>85.0</td>
+        <td>97.0</td>
+        <td>98.6</td>
+        <td>74.9</td>
+        <td>91.3</td>
+        <td>95.2</td>
+        <td>58.6</td>
+        <td>81.3</td>
+        <td>88.0</td>
+        <td>88.8</td>
+     </tr>
+
+  </table>
+
+- Chinese Zero-Shot Image-Text Retrieval [\[see details\]](./clip_benchmark#flickr30k-cn--coco-cn)
+
+  <table>
+    <tr  align=center>
+        <td rowspan="3" align=left><b>model</b></td>
+        <td colspan="6" align=center><b>Flickr30K-CN</b></td>
+        <td colspan="6" align=center><b>COCO-CN</b></td>
+        <td rowspan="3" align=center><b>avg</b></td>
+
+  </tr>
+     <tr  align=center>
+        <td colspan="3" align=center><b>image-to-text</b></td>
+        <td colspan="3" align=center><b>text-to-image</b></td>
+         <td colspan="3" align=center><b>image-to-text</b></td>
+        <td colspan="3" align=center><b>text-to-image</b></td>
+     </tr>
+     <tr>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+     </tr>
+
+  <tr align=center>
+        <td align=left>CN-CLIP-ViT-H</td>
+        <td>81.6</td>
+        <td>97.5</td>
+        <td>98.8</td>
+        <td>71.2</td>
+        <td>91.4</td>
+        <td>95.5</td>
+        <td>63.0</td>
+        <td>86.6</td>
+        <td>92.9</td>
+        <td>69.2</td>
+        <td>89.9</td>
+        <td>96.1</td>
+        <td>86.1</td>
+     </tr>
+
+  <tr align=center>
+        <td align=left>OpenCLIP-XLM-R-H</td>
+        <td>86.1</td>
+        <td>97.5</td>
+        <td>99.2</td>
+        <td>71.0</td>
+        <td>90.5</td>
+        <td>94.9</td>
+        <td>70.0</td>
+        <td>91.5</td>
+        <td>97.0</td>
+        <td>66.1</td>
+        <td>90.8</td>
+        <td>96.0</td>
+        <td>87.6</td>
+     </tr>
+
+  <tr align=center>
+        <td align=left>InternVL-C (ours)</td>
+        <td>90.3</td>
+        <td>98.8</td>
+        <td>99.7</td>
+        <td>75.1</td>
+        <td>92.9</td>
+        <td>96.4</td>
+        <td>68.8</td>
+        <td>92.0</td>
+        <td>96.7</td>
+        <td>68.9</td>
+        <td>91.9</td>
+        <td>96.5</td>
+        <td>89.0</td>
+     </tr>
+  <tr align=center>
+        <td align=left>InternVL-G (ours)</td>
+        <td>92.9</td>
+        <td>99.4</td>
+        <td>99.8</td>
+        <td>77.7</td>
+        <td>94.8</td>
+        <td>97.3</td>
+        <td>71.4</td>
+        <td>93.9</td>
+        <td>97.7</td>
+        <td>73.8</td>
+        <td>94.4</td>
+        <td>98.1</td>
+        <td>90.9</td>
+     </tr>
+
+  </table>
+
+- Multilingual Zero-Shot Image-Text Retrieval on XTD [\[see details\]](./clip_benchmark#xtd)
+
+  | method            |  EN  |  ES  |  FR  |  ZH  |  IT  |  KO  |  RU  |  JP  | average |
+  | ----------------- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :-----: |
+  | AltCLIP           | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 |  93.7   |
+  | OpenCLIP-XLM-R-H  | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 |  94.6   |
+  | InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 |  95.1   |
+  | InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 |  96.6   |
+
+</details>
+
+<details>
+  <summary>Multimodal Dialogue</summary>
+
+See ["Compared with SOTA VLLMs"](#compared-with-sota-vllms) section.
+
+</details>
+
+## Quick Start with HuggingFace
+
+<details>
+  <summary>using InternViT-6B for visual feature extraction (click to expand)</summary>
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, CLIPImageProcessor
+
+model = AutoModel.from_pretrained(
+    'OpenGVLab/InternViT-6B-448px-V1-5',
+    torch_dtype=torch.bfloat16,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True).cuda().eval()
+
+image = Image.open('./examples/image1.jpg').convert('RGB')
+
+image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-5')
+
+pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
+pixel_values = pixel_values.to(torch.bfloat16).cuda()
+
+outputs = model(pixel_values)
+```
+
+</details>
+
+<details>
+  <summary>using InternVL-C(ontrastive) and InternVL-G(enerative) for cross-modal retrieval (click to expand)</summary>
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, CLIPImageProcessor
+from transformers import AutoTokenizer
+
+
+model = AutoModel.from_pretrained(
+    'OpenGVLab/InternVL-14B-224px',
+    torch_dtype=torch.bfloat16,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True).cuda().eval()
+
+image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')
+
+tokenizer = AutoTokenizer.from_pretrained(
+    'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)
+tokenizer.pad_token_id = 0  # set pad_token_id to 0
+
+images = [
+    Image.open('./examples/image1.jpg').convert('RGB'),
+    Image.open('./examples/image2.jpg').convert('RGB'),
+    Image.open('./examples/image3.jpg').convert('RGB')
+]
+prefix = 'summarize:'
+texts = [
+    prefix + 'a photo of a red panda',  # English
+    prefix + '一张熊猫的照片',  # Chinese
+    prefix + '二匹の猫の写真'  # Japanese
+]
+
+pixel_values = image_processor(images=images, return_tensors='pt').pixel_values
+pixel_values = pixel_values.to(torch.bfloat16).cuda()
+input_ids = tokenizer(texts, return_tensors='pt', max_length=80,
+                      truncation=True, padding='max_length').input_ids.cuda()
+
+# InternVL-C
+logits_per_image, logits_per_text = model(
+    image=pixel_values, text=input_ids, mode='InternVL-C')
+probs = logits_per_image.softmax(dim=-1)
+# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
+#         [2.2949e-02, 9.7656e-01, 5.9903e-06],
+#         [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',
+#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)
+
+# InternVL-G
+logits_per_image, logits_per_text = model(
+    image=pixel_values, text=input_ids, mode='InternVL-G')
+probs = logits_per_image.softmax(dim=-1)
+# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],
+#         [8.6060e-03, 9.9219e-01, 2.8759e-06],
+#         [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',
+#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)
+
+# please set add_eos_token to False for generation
+tokenizer.add_eos_token = False
+image = Image.open('./examples/image1.jpg').convert('RGB')
+pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
+pixel_values = pixel_values.to(torch.bfloat16).cuda()
+
+tokenized = tokenizer("English caption:", return_tensors='pt')
+pred = model.generate(
+    pixel_values=pixel_values,
+    input_ids=tokenized.input_ids.cuda(),
+    attention_mask=tokenized.attention_mask.cuda(),
+    num_beams=5,
+    min_new_tokens=8,
+)
+caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()
+# English caption: a red panda sitting on top of a wooden platform
+```
+
+</details>
+
+<details>
+  <summary>using InternVL-Chat for multimodal chat (click to expand)</summary>
+
+Here, we take the smaller OpenGVLab/InternVL2-8B as an example:
+
+```python
+import numpy as np
+import torch
+import torchvision.transforms as T
+from decord import VideoReader, cpu
+from PIL import Image
+from torchvision.transforms.functional import InterpolationMode
+from transformers import AutoModel, AutoTokenizer
+
+IMAGENET_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_STD = (0.229, 0.224, 0.225)
+
+
+def build_transform(input_size):
+    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
+    transform = T.Compose([
+        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
+        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
+        T.ToTensor(),
+        T.Normalize(mean=MEAN, std=STD)
+    ])
+    return transform
+
+
+def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
+    best_ratio_diff = float('inf')
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+        if ratio_diff < best_ratio_diff:
+            best_ratio_diff = ratio_diff
+            best_ratio = ratio
+        elif ratio_diff == best_ratio_diff:
+            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                best_ratio = ratio
+    return best_ratio
+
+
+def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
+    orig_width, orig_height = image.size
+    aspect_ratio = orig_width / orig_height
+
+    # calculate the existing image aspect ratio
+    target_ratios = set(
+        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
+        i * j <= max_num and i * j >= min_num)
+    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+
+    # find the closest aspect ratio to the target
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
+
+    # calculate the target width and height
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+
+    # resize the image
+    resized_img = image.resize((target_width, target_height))
+    processed_images = []
+    for i in range(blocks):
+        box = (
+            (i % (target_width // image_size)) * image_size,
+            (i // (target_width // image_size)) * image_size,
+            ((i % (target_width // image_size)) + 1) * image_size,
+            ((i // (target_width // image_size)) + 1) * image_size
+        )
+        # split the image
+        split_img = resized_img.crop(box)
+        processed_images.append(split_img)
+    assert len(processed_images) == blocks
+    if use_thumbnail and len(processed_images) != 1:
+        thumbnail_img = image.resize((image_size, image_size))
+        processed_images.append(thumbnail_img)
+    return processed_images
+
+
+def load_image(image_file, input_size=448, max_num=6):
+    image = Image.open(image_file).convert('RGB')
+    transform = build_transform(input_size=input_size)
+    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
+    pixel_values = [transform(image) for image in images]
+    pixel_values = torch.stack(pixel_values)
+    return pixel_values
+
+
+path = 'OpenGVLab/InternVL2-8B'
+model = AutoModel.from_pretrained(
+    path,
+    torch_dtype=torch.bfloat16,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True).eval().cuda()
+
+tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
+# set the max number of tiles in `max_num`
+pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
+
+generation_config = dict(
+    num_beams=1,
+    max_new_tokens=1024,
+    do_sample=False,
+)
+
+# pure-text conversation (纯文本对话)
+question = 'Hello, who are you?'
+response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+question = 'Can you tell me a story?'
+response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+# single-image single-round conversation (单图单轮对话)
+question = '<image>\nPlease describe the image shortly.'
+response = model.chat(tokenizer, pixel_values, question, generation_config)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+# single-image multi-round conversation (单图多轮对话)
+question = '<image>\nPlease describe the image in detail.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+question = 'Please write a poem according to the image.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+# multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)
+pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
+pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
+pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
+
+question = '<image>\nDescribe the two images in detail.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               history=None, return_history=True)
+
+question = 'What are the similarities and differences between these two images.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               history=history, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
+pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
+pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
+pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
+num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
+
+question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               num_patches_list=num_patches_list,
+                               history=None, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+question = 'What are the similarities and differences between these two images.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               num_patches_list=num_patches_list,
+                               history=history, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+# batch inference, single image per sample (单图批处理)
+pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
+pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
+num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
+pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
+
+questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
+responses = model.batch_chat(tokenizer, pixel_values,
+                             num_patches_list=num_patches_list,
+                             questions=questions,
+                             generation_config=generation_config)
+for question, response in zip(questions, responses):
+    print(f'User: {question}')
+    print(f'Assistant: {response}')
+
+# video multi-round conversation (视频多轮对话)
+def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
+    if bound:
+        start, end = bound[0], bound[1]
+    else:
+        start, end = -100000, 100000
+    start_idx = max(first_idx, round(start * fps))
+    end_idx = min(round(end * fps), max_frame)
+    seg_size = float(end_idx - start_idx) / num_segments
+    frame_indices = np.array([
+        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
+        for idx in range(num_segments)
+    ])
+    return frame_indices
+
+def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
+    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
+    max_frame = len(vr) - 1
+    fps = float(vr.get_avg_fps())
+
+    pixel_values_list, num_patches_list = [], []
+    transform = build_transform(input_size=input_size)
+    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
+    for frame_index in frame_indices:
+        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
+        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
+        pixel_values = [transform(tile) for tile in img]
+        pixel_values = torch.stack(pixel_values)
+        num_patches_list.append(pixel_values.shape[0])
+        pixel_values_list.append(pixel_values)
+    pixel_values = torch.cat(pixel_values_list)
+    return pixel_values, num_patches_list
+
+
+video_path = './examples/red-panda.mp4'
+# pixel_values, num_patches_list = load_video(video_path, num_segments=32, max_num=1)
+pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
+pixel_values = pixel_values.to(torch.bfloat16).cuda()
+video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
+question = video_prefix + 'What is the red panda doing?'
+# Frame1: <image>\nFrame2: <image>\n...\nFrame31: <image>\n{question}
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               num_patches_list=num_patches_list,
+                               history=None, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+question = 'Describe this video in detail. Don\'t repeat.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               num_patches_list=num_patches_list,
+                               history=history, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+```
+
+</details>
+
+## License
+
+This project is released under the [MIT license](LICENSE). Parts of this project contain code and models from other sources, which are subject to their respective licenses.
+
+## Citation
+
+If you find this project useful in your research, please consider cite:
+
+```BibTeX
+@article{chen2023internvl,
+  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
+  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
+  journal={arXiv preprint arXiv:2312.14238},
+  year={2023}
+}
+
+@article{chen2024far,
+  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
+  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
+  journal={arXiv preprint arXiv:2404.16821},
+  year={2024}
+}
+```
+
+## Acknowledgement
+
+InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
+
+______________________________________________________________________
+
+If you want to join our WeChat group, please scan the following QR Code to add our assistant as a Wechat friend:
+
+<p align="center"><img width="300" alt="image" src="https://github.com/OpenGVLab/DragGAN/assets/26198430/e3f0807f-956a-474e-8fd2-1f7c22d73997"></p>
--- a/README_zh.md
+++ b/README_zh.md
+<div align="center">
+
+# <img width="60" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/47669167/7037290e-f474-4d11-b90f-1d8316087bf8"> InternVL家族：通过开源组件缩小与商业多模态模型的差距 —— GPT-4o的开源替代方案
+
+[\[🆕 博客\]](https://internvl.github.io/blog/)  [\[🚀 InternVL2 博客\]](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/)  [\[🤗 HF 对话Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🗨️ 对话Demo\]](https://internvl.opengvlab.com/)  [\[🌐 API\]](./document/How_to_use_InternVL_API.md)    [\[🚀 快速开始\]](#使用-huggingface-快速开始)
+
+[\[📜 InternVL 1.0 论文\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 报告\]](https://arxiv.org/abs/2404.16821)  [\[📖 1.0 中文解读\]](https://zhuanlan.zhihu.com/p/702946079) [\[📖 1.5 中文解读\]](https://zhuanlan.zhihu.com/p/699439759)  [\[📖 2.0 中文解读\]](https://zhuanlan.zhihu.com/p/706547971)
+
+[Switch to the English version (切换至中文版)](/README.md)
+
+<a href="https://trendshift.io/repositories/9803" target="_blank"><img src="https://trendshift.io/api/badge/repositories/9803" alt="OpenGVLab%2FInternVL | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
+<img height="55" alt="image" src="https://github.com/user-attachments/assets/bd62ab46-f0ea-40c6-ab10-7fde671716cc">
+
+![opencompass](https://github.com/user-attachments/assets/7ce93c05-84ae-4997-a480-53897d1d3a1c)
+
+</div>
+
+## 最新消息 🚀🚀🚀
+
+- `2024/07/18`: 🔥🔥 InternVL2-40B 在 [Video-MME](https://github.com/BradyFU/Video-MME) 数据集中实现了开源模型中的 SOTA 性能，当输入 16 帧时得分为 61.2，输入 32 帧时得分为 64.4，大幅领先其它开源模型，是最接近 GPT-4o mini 的开源模型。
+- `2024/07/18`: 🔥 InternVL2-Pro 在 [DocVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1) 和 [InfoVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=3) 的基准测试中实现了 SOTA 性能。
+- `2024/07/04`: 🚀 我们发布了 InternVL2 系列模型。InternVL2-Pro 在 MMMU 基准测试中达到了 62.0% 的准确率，实现了与 GPT-4o 等领先闭源商业模型比肩的性能。该模型的免费 API 可以通过填写 ([英文申请表](https://docs.google.com/forms/d/e/1FAIpQLSfMCzhPr1OOEKau_6jwTU0EiZMSFckDo-HMlc_hUudhF_97rw/viewform?usp=sf_link)) / ([中文申请表](https://wj.qq.com/s2/14910502/25a4/)) 来申请。其它模型可在 [HF 链接](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e) 中下载。
+- `2024/06/19`: 我们提出了 Needle In A Multimodal Haystack ([MM-NIAH](https://github.com/OpenGVLab/MM-NIAH))，这是第一个针对模型关于长多模态文档理解能力的评测基准。
+- `2024/05/30`: 我们发布了 [ShareGPT-4o](https://sharegpt4o.github.io/)，这是一个大规模、高质量的多模态数据集。我们计划开源一批使用 GPT-4o 精心标注的数据，包括 200K 条图像详细描述、10K 条视频详细描述，以及 10K 条音频详细描述。
+- `2024/05/29`: 我们开源了 Mini-InternVL 系列，包括以下两个对话模型：[Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5) 和 [Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)。这些模型在极小的尺寸下实现了令人印象深刻的性能：2B 模型以 8% 的模型尺寸实现了 80% 的性能，4B 模型以 16% 的模型尺寸实现了 90% 的性能。更多细节请查看我们的[博客](https://internvl.github.io/blog/2024-05-25-Mini-InternVL-1.5/)。
+- `2024/05/28`: 感谢 [lmdeploy](https://github.com/InternLM/lmdeploy) 团队提供的 AWQ 量化支持。InternVL 1.5 的 4-bit 模型发布在 [OpenGVLab/InternVL-Chat-V1-5-AWQ](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-AWQ)。
+- `2024/05/13`: InternVL 1.0 现在可以作为扩散模型的 [文本编码器](https://huggingface.co/OpenGVLab/InternVL-14B-224px)，支持全球超过 110 种语言的多语言生成。详情请看 [MuLan](https://github.com/mulanai/MuLan)。
+- `2024/04/18`: InternVL-Chat-V1-5 已经在 [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) 发布，在 MMMU、DocVQA、ChartQA、MathVista 等各种基准测试中，性能接近 GPT-4V 和 Gemini Pro。
+- `2024/02/27`: InternVL 已被 CVPR 2024 (Oral) 接收！🎉
+- `2024/02/24`: InternVL-Chat 系列模型已经接入 [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) 评测框架。
+- `2024/02/21`: [InternVL-Chat-V1-2-Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) 在 MathVista（59.9）、MMBench（83.8）和 MMVP（58.7）上实现了 SOTA 性能。详情请看我们的[博客](https://internvl.github.io/blog/2024-02-21-InternVL-1.2/)。
+- `2024/02/12`: InternVL-Chat-V1-2 已经发布，它在 MMMU 验证集上达到了 51.6，在 MMBench 测试集上达到了 82.3。 更多信息请参考我们的[博客](https://internvl.github.io/blog/2024-02-21-InternVL-1.2/)以及 [SFT 数据](./internvl_chat#prepare-training-datasets)。该模型已经在 [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) 发布，训练、测评的数据和脚本均已开源。
+- `2024/01/24`: InternVL-Chat-V1-1 已经发布，它支持中文对话，并具备强大的 OCR 能力，详情请看[这里](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)。
+- `2024/01/16`: 我们发布了 [定制的 mmcv/mmsegmentation/mmdetection 代码库](https://github.com/OpenGVLab/InternVL-MMDetSeg)，集成了 DeepSpeed，可以用于训练检测和分割大模型。
+
+## TODO 列表
+
+- [ ] 支持 vLLM 和 Ollama
+- [ ] 使用 readthedocs 重新构建文档
+- [x] 支持使用 LoRA 微调不同的 LLMs
+- [ ] 在 Demo 中支持视频和 PDF 输入
+- [ ] 发布集成 VisionLLMv2 的 InternVL2
+- [x] 发布 InternVL2 的 `requirements.txt`
+- [x] 发布 InternVL2 系列的训练 / 评估代码
+- [x] 发布 InternVL1.5 和 InternVL2 的 Streamlit 网页 UI
+
+## 使用文档
+
+- 安装
+
+  - 如何搭建运行环境?  [\[link\]](./INSTALLATION.md) [\[requirements.txt\]](./requirements.txt)
+
+- 训练或微调
+
+  - 如何复现 InternVL-Chat-V1-2 的SFT阶段? [\[link\]](./internvl_chat#start-training)
+  - 如何在自定义数据集上微调 InternVL-Chat-V1-2? [\[link\]](./document/How_to_finetune_internvl_chat_v1_2_on_a_custom_dataset.md)
+  - 如何在自定义数据集上微调 Mini-InternVL-Chat 系列? [\[link\]](./document/How_to_finetune_mini_internvl_chat_v1_5_on_a_custom_dataset.md)
+
+- Benchmark 测评
+
+  > 由于此代码库与 VLMEvalKit 之间存在细微的实现差异，在测试同一模型时，性能指标可能会出现轻微差异。
+
+  - 如何评测 InternVL-Chat-V1-5? [\[link\]](./document/How_to_evaluate_internvl_chat_v1_5.md)
+  - 如何使用 VLMEvalKit 评测 InternVL-Chat-V1-5? (推荐) [\[link\]](./document/How_to_evaluate_internvl_chat_v1_5_using_vlmevalkit.md)
+  - 如何使用 VLMEvalKit 评测 Mini-InternVL-Chat-2B-V1-5? (推荐) [\[link\]](./document/How_to_evaluate_mini_internvl_chat_2b_v1_5_using_vlmevalkit.md)
+  - 如何使用 VLMEvalKit 评测 Mini-InternVL-Chat-4B-V1-5? (推荐) [\[link\]](./document/How_to_evaluate_mini_internvl_chat_4b_v1_5_using_vlmevalkit.md)
+
+- 模型部署
+
+  - 如何使用 InternVL API? [\[link\]](./document/How_to_use_InternVL_API.md)
+  - 如何部署本地的 demo? [\[link\]](./document/How_to_deploy_a_local_demo.md)
+  - 如何用 Nvidia V100 GPU 运行 InternVL-1.5 8bit? [\[link\]](https://github.com/OpenGVLab/InternVL/issues/144) [\[中文教程\]](https://zhuanlan.zhihu.com/p/697188143)
+  - 如何进行批量推理？ [\[link\]](./README.md?plain=1#L849)
+
+## 和 SOTA 多模态大模型对比
+
+![waic_performance](https://github.com/user-attachments/assets/7b24ad6c-45dd-4bcd-aa77-79da1ec856ee)
+
+## 模型库
+
+#### 多模态大语言模型 (InternVL 2.0)
+
+<table>
+  <tr>
+    <th>Model Name</th>
+    <th>Vision Part</th>
+    <th>Language Part</th>
+    <th>HF&nbsp;Link</th>
+    <th>MS&nbsp;Link</th>
+    <th>Document</th>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;1B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-300M-448px">InternViT&#8209;300M&#8209;448px</a></td>
+    <td><a href="https://huggingface.co/Qwen/Qwen2-0.5B-Instruct">Qwen2&#8209;0.5B&#8209;Instruct</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-1B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-1B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-1B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;2B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-300M-448px">InternViT&#8209;300M&#8209;448px</a></td>
+    <td><a href="https://huggingface.co/internlm/internlm2-chat-1_8b">internlm2&#8209;chat&#8209;1&#8209;8b</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-2B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-2B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-2B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;4B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-300M-448px">InternViT&#8209;300M&#8209;448px</a></td>
+    <td><a href="https://huggingface.co/microsoft/Phi-3-mini-128k-instruct">Phi&#8209;3&#8209;mini&#8209;128k&#8209;instruct</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-4B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-4B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-4B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;8B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-300M-448px">InternViT&#8209;300M&#8209;448px</a></td>
+    <td><a href="https://huggingface.co/internlm/internlm2_5-7b-chat">internlm2_5&#8209;7b&#8209;chat</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-8B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-8B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-8B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;26B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5">InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;5</a></td>
+    <td><a href="https://huggingface.co/internlm/internlm2-chat-20b">internlm2&#8209;chat&#8209;20b</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-26B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-26B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-26B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2&#8209;40B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5">InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;5</a></td>
+    <td><a href="https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B">Nous&#8209;Hermes&#8209;2&#8209;Yi&#8209;34B</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-40B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-40B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-40B#quick-start">📖 doc</a></td>
+  </tr>
+  <tr>
+    <td>InternVL2-Llama3-76B</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5">InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;5</a></td>
+    <td><a href="https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-70B">Hermes‑2‑Theta‑<br>Llama‑3‑70B</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL2-Llama3-76B">🤖 link</a></td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B#quick-start">📖 doc</a></td>
+  </tr>
+</table>
+
+#### InternVL2-Pro API
+
+我们诚挚邀请大家将 InternVL2-Pro 的 API 用于学术研究。为了更好地管理，请提交[英文申请表](https://docs.google.com/forms/d/e/1FAIpQLSfMCzhPr1OOEKau_6jwTU0EiZMSFckDo-HMlc_hUudhF_97rw/viewform?usp=sf_link)/[中文申请表](https://wj.qq.com/s2/14910502/25a4/)以获得免费 API 访问权限。
+
+#### 多模态大语言模型 (InternVL 1.0-1.5)
+
+<table>
+  <tr>
+    <th>Model</th>
+    <th>Date</th>
+    <th>HF&nbsp;Link</th>
+    <th>MS&nbsp;Link</th>
+    <th>Note</th>
+  </tr>
+  <tr>
+    <td>Mini&#8209;InternVL&#8209;Chat&#8209;4B&#8209;V1&#8209;5</td>
+    <td>2024.05.28</td>
+    <td><a href="https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5">🤖 link</a></td>
+    <td>🚀🚀 16% 的模型大小, 90% 的性能</td>
+  </tr>
+  <tr>
+    <td>Mini&#8209;InternVL&#8209;Chat&#8209;2B&#8209;V1&#8209;5</td>
+    <td>2024.05.19</td>
+    <td><a href="https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5">🤖 link</a></td>
+    <td>🚀 8% 的模型大小, 80% 的性能</td>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;V1&#8209;5</td>
+    <td>2024.04.18</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-V1-5">🤖 link</a></td>
+    <td>支持 4K 图像；超强的 OCR 能力；在 MMMU、DocVQA、ChartQA、MathVista 等各种基准测试中，性能接近 GPT-4V 和 Gemini Pro
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;V1&#8209;2&#8209;Plus</td>
+    <td>2024.02.21</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-V1-2-Plus">🤖 link</a></td>
+    <td>更多的 SFT 数据和更强的性能</td>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;V1&#8209;2</td>
+    <td>2024.02.11</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-V1-2">🤖 link</a></td>
+    <td>将 LLM 扩展到 34B</td>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;V1&#8209;1</td>
+    <td>2024.01.24</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-V1-1">🤖 link</a></td>
+    <td>支持中文和更强的 OCR 能力</td>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;19B</td>
+    <td>2023.12.25</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B">🤖 link</a></td>
+    <td>英语多模态对话</td>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;Chat&#8209;13B</td>
+    <td>2023.12.25</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B">🤖 link</a></td>
+    <td>英语多模态对话</td>
+  </tr>
+</table>
+
+#### 视觉基础模型 (InternVL 1.0-1.5)
+
+<table>
+  <tr>
+    <th>Model</th>
+    <th>Date</th>
+    <th>HF&nbsp;Link</th>
+    <th>MS&nbsp;Link</th>
+    <th>Note</th>
+  </tr>
+  <tr>
+    <td>InternViT&#8209;300M&#8209;448px</td>
+    <td>2024.05.25</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-300M-448px">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternViT-300M-448px">🤖 link</a></td>
+    <td>蒸馏的小型视觉基础模型，具有 300M 参数（🔥新）</td>
+  </tr>
+  <tr>
+    <td>InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;5</td>
+    <td>2024.04.20</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternViT-6B-448px-V1-5">🤖 link</a></td>
+    <td>通过增量预训练支持动态分辨率和超强的 OCR 特征提取能力（🔥新）</td>
+  </tr>
+  <tr>
+    <td>InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;2</td>
+    <td>2024.02.11</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternViT-6B-448px-V1-2">🤖 link</a></td>
+    <td>通过增量预训练支持 448 分辨率</td>
+  </tr>
+  <tr>
+    <td>InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;0</td>
+    <td>2024.01.30</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternViT-6B-448px-V1-0">🤖 link</a></td>
+    <td>通过增量预训练支持 448 分辨率</td>
+  </tr>
+  <tr>
+    <td>InternViT&#8209;6B&#8209;224px</td>
+    <td>2023.12.22</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternViT-6B-224px">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternViT-6B-224px">🤖 link</a></td>
+    <td>InternViT-6B 的第一个版本，提取自 InternVL‑14B‑224px</td>
+  </tr>
+</table>
+
+#### 视觉语言基础模型 (InternVL 1.0)
+
+<table>
+  <tr>
+    <th>Model</th>
+    <th>Date</th>
+    <th>HF&nbsp;Link</th>
+    <th>MS&nbsp;Link</th>
+    <th>Note</th>
+  </tr>
+  <tr>
+    <td>InternVL&#8209;14B&#8209;224px</td>
+    <td>2023.12.22</td>
+    <td><a href="https://huggingface.co/OpenGVLab/InternVL-14B-224px">🤗 link</a></td>
+    <td><a href="https://modelscope.cn/models/OpenGVLab/InternVL-14B-224px">🤖 link</a></td>
+    <td>视觉-语言基础模型，InternViT-6B + QLLaMA，可以用于类似 CLIP 的图文检索</td>
+  </tr>
+</table>
+
+## InternVL 可以做什么?
+
+<details>
+  <summary>视觉感知 (点击展开)</summary>
+
+- 线性探针图像分类 [\[查看详情\]](./classification#-evaluation)
+
+  ViT-22B uses the private JFT-3B dataset.
+
+  | method              | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
+  | ------------------- | :----: | :---: | :-----: | :---: | :--: | :--: | :-------: |
+  | OpenCLIP-G          |  1.8B  | 86.2  |  89.4   | 77.2  | 63.8 | 87.8 |   66.4    |
+  | DINOv2-g            |  1.1B  | 86.5  |  89.6   | 78.4  | 75.9 | 78.8 |   62.5    |
+  | EVA-01-CLIP-g       |  1.1B  | 86.5  |  89.3   | 77.4  | 70.5 | 87.7 |   63.1    |
+  | MAWS-ViT-6.5B       |  6.5B  | 87.8  |    -    |   -   |  -   |  -   |     -     |
+  | ViT-22B\*           | 21.7B  | 89.5  |  90.9   | 83.2  | 83.8 | 87.4 |     -     |
+  | InternViT-6B (ours) |  5.9B  | 88.2  |  90.4   | 79.9  | 77.5 | 89.8 |   69.1    |
+
+- 语义分割 [\[查看详情\]](./segmentation#-evaluation)
+
+  | method                | decoder | #param (train/total) | crop size | mIoU         |
+  | --------------------- | :-----: | :------------------: | :-------: | ------------ |
+  | OpenCLIP-G (frozen)   | Linear  |     0.3M / 1.8B      |    512    | 39.3         |
+  | ViT-22B (frozen)      | Linear  |     0.9M / 21.7B     |    504    | 34.6         |
+  | InternViT-6B (frozen) | Linear  |     0.5M / 5.9B      |    504    | 47.2 (+12.6) |
+  | ViT-22B (frozen)      | UperNet |     0.8B / 22.5B     |    504    | 52.7         |
+  | InternViT-6B (frozen) | UperNet |     0.4B / 6.3B      |    504    | 54.9 (+2.2)  |
+  | ViT-22B               | UperNet |    22.5B / 22.5B     |    504    | 55.3         |
+  | InternViT-6B          | UperNet |     6.3B / 6.3B      |    504    | 58.9 (+3.6)  |
+
+- 零样本图像分类 [\[查看详情\]](./clip_benchmark#imagenet-variants-and-objectnet)
+
+  | method            | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
+  | ----------------- | :---: | :--: | :--: | :---: | :-------: | :-------: |
+  | OpenCLIP-G        | 80.1  | 69.3 | 92.1 | 73.6  |   68.9    |   73.0    |
+  | EVA-02-CLIP-E+    | 82.0  | 82.1 | 94.5 | 75.7  |   71.6    |   79.6    |
+  | ViT-22B\*         | 85.9  | 90.1 | 96.0 | 80.9  |     -     |   87.6    |
+  | InternVL-C (ours) | 83.2  | 83.8 | 95.5 | 77.3  |   73.9    |   80.6    |
+
+- 多语言零样本图像分类 [\[查看详情\]](./clip_benchmark#multilingual-imagenet-1k)
+
+  EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian
+
+  | method            | IN-1K (EN) | IN-1K (ZH) | IN-1K (JP) | IN-1K (AR) | IN-1K (IT) |
+  | ----------------- | :--------: | :--------: | :--------: | :--------: | :--------: |
+  | Taiyi-CLIP-ViT-H  |     -      |    54.4    |     -      |     -      |     -      |
+  | WuKong-ViT-L-G    |     -      |    57.5    |     -      |     -      |     -      |
+  | CN-CLIP-ViT-H     |     -      |    59.6    |     -      |     -      |     -      |
+  | AltCLIP-ViT-L     |    74.5    |    59.6    |     -      |     -      |     -      |
+  | EVA-02-CLIP-E+    |    82.0    |     -      |     -      |     -      |    41.2    |
+  | OpenCLIP-XLM-R-H  |    77.0    |    55.7    |    53.1    |    37.0    |    56.8    |
+  | InternVL-C (ours) |    83.2    |    64.5    |    61.5    |    44.9    |    65.7    |
+
+- 零样本视频分类
+
+  | method            | #frame | K400 | K600 | K700 |
+  | ----------------- | :----: | :--: | :--: | :--: |
+  | OpenCLIP-G        |   1    | 65.9 | 66.1 | 59.2 |
+  | EVA-02-CLIP-E+    |   1    | 69.8 | 69.3 | 63.4 |
+  | InternVL-C (ours) |   1    | 71.0 | 71.3 | 65.7 |
+  | ViCLIP            |   8    | 75.7 | 73.5 | 66.4 |
+  | InternVL-C (ours) |   8    | 79.4 | 78.8 | 71.5 |
+
+</details>
+
+<details>
+  <summary>跨模态检索 (点击展开)</summary>
+
+- 英语零样本图文检索 [\[查看详情\]](./clip_benchmark#flickr30k--coco)
+
+  <table>
+    <tr align=center>
+        <td rowspan="3" align=left><b>model</b></td>
+        <td colspan="6" align=center><b>Flickr30K</b></td>
+        <td colspan="6" align=center><b>COCO</b></td>
+        <td rowspan="3" align=center><b>avg</b></td>
+
+  </tr>
+     <tr align=center>
+        <td colspan="3" align=center><b>image-to-text</b></td>
+        <td colspan="3" align=center><b>text-to-image</b></td>
+         <td colspan="3" align=center><b>image-to-text</b></td>
+        <td colspan="3" align=center><b>text-to-image</b></td>
+     </tr>
+     <tr>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+     </tr>
+
+  <tr align=center>
+        <td align=left>OpenCLIP-G</td>
+        <td>92.9</td>
+        <td>99.3</td>
+        <td>99.8</td>
+        <td>79.5</td>
+        <td>95.0</td>
+        <td>97.1</td>
+        <td>67.3</td>
+        <td>86.9</td>
+        <td>92.6</td>
+        <td>51.4</td>
+        <td>74.9</td>
+        <td>83.0</td>
+        <td>85.0</td>
+     </tr>
+  <tr align=center>
+        <td align=left>EVA-02-CLIP-E+</td>
+        <td>93.9</td>
+        <td>99.4</td>
+        <td>99.8</td>
+        <td>78.8</td>
+        <td>94.2</td>
+        <td>96.8</td>
+        <td>68.8</td>
+        <td>87.8</td>
+        <td>92.8</td>
+        <td>51.1</td>
+        <td>75.0</td>
+        <td>82.7</td>
+        <td>85.1</td>
+     </tr>
+    <tr align=center>
+        <td align=left>EVA-CLIP-8B</td>
+        <td>95.6</td>
+        <td>99.6</td>
+        <td>99.9</td>
+        <td>80.8</td>
+        <td>95.5</td>
+        <td>97.6</td>
+        <td>70.3</td>
+        <td>89.3</td>
+        <td>93.9</td>
+        <td>53.0</td>
+        <td>76.0</td>
+        <td>83.4</td>
+        <td>86.2</td>
+     </tr>
+  <tr align=center>
+        <td align=left>InternVL-C (ours)</td>
+        <td>94.7</td>
+        <td>99.6</td>
+        <td>99.9</td>
+        <td>81.7</td>
+        <td>96.0</td>
+        <td>98.2</td>
+        <td>70.6</td>
+        <td>89.0</td>
+        <td>93.5</td>
+        <td>54.1</td>
+        <td>77.3</td>
+        <td>84.6</td>
+        <td>86.6</td>
+     </tr>
+  <tr align=center>
+        <td align=left>InternVL-G (ours)</td>
+        <td>95.7</td>
+        <td>99.7</td>
+        <td>99.9</td>
+        <td>85.0</td>
+        <td>97.0</td>
+        <td>98.6</td>
+        <td>74.9</td>
+        <td>91.3</td>
+        <td>95.2</td>
+        <td>58.6</td>
+        <td>81.3</td>
+        <td>88.0</td>
+        <td>88.8</td>
+     </tr>
+
+  </table>
+
+- 中文零样本图文检索 [\[查看详情\]](./clip_benchmark#flickr30k-cn--coco-cn)
+
+  <table>
+    <tr  align=center>
+        <td rowspan="3" align=left><b>model</b></td>
+        <td colspan="6" align=center><b>Flickr30K-CN</b></td>
+        <td colspan="6" align=center><b>COCO-CN</b></td>
+        <td rowspan="3" align=center><b>avg</b></td>
+
+  </tr>
+     <tr  align=center>
+        <td colspan="3" align=center><b>image-to-text</b></td>
+        <td colspan="3" align=center><b>text-to-image</b></td>
+         <td colspan="3" align=center><b>image-to-text</b></td>
+        <td colspan="3" align=center><b>text-to-image</b></td>
+     </tr>
+     <tr>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+        <td>R@1</td>
+        <td>R@5</td>
+        <td>R@10</td>
+     </tr>
+
+  <tr align=center>
+        <td align=left>CN-CLIP-ViT-H</td>
+        <td>81.6</td>
+        <td>97.5</td>
+        <td>98.8</td>
+        <td>71.2</td>
+        <td>91.4</td>
+        <td>95.5</td>
+        <td>63.0</td>
+        <td>86.6</td>
+        <td>92.9</td>
+        <td>69.2</td>
+        <td>89.9</td>
+        <td>96.1</td>
+        <td>86.1</td>
+     </tr>
+
+  <tr align=center>
+        <td align=left>OpenCLIP-XLM-R-H</td>
+        <td>86.1</td>
+        <td>97.5</td>
+        <td>99.2</td>
+        <td>71.0</td>
+        <td>90.5</td>
+        <td>94.9</td>
+        <td>70.0</td>
+        <td>91.5</td>
+        <td>97.0</td>
+        <td>66.1</td>
+        <td>90.8</td>
+        <td>96.0</td>
+        <td>87.6</td>
+     </tr>
+
+  <tr align=center>
+        <td align=left>InternVL-C (ours)</td>
+        <td>90.3</td>
+        <td>98.8</td>
+        <td>99.7</td>
+        <td>75.1</td>
+        <td>92.9</td>
+        <td>96.4</td>
+        <td>68.8</td>
+        <td>92.0</td>
+        <td>96.7</td>
+        <td>68.9</td>
+        <td>91.9</td>
+        <td>96.5</td>
+        <td>89.0</td>
+     </tr>
+  <tr align=center>
+        <td align=left>InternVL-G (ours)</td>
+        <td>92.9</td>
+        <td>99.4</td>
+        <td>99.8</td>
+        <td>77.7</td>
+        <td>94.8</td>
+        <td>97.3</td>
+        <td>71.4</td>
+        <td>93.9</td>
+        <td>97.7</td>
+        <td>73.8</td>
+        <td>94.4</td>
+        <td>98.1</td>
+        <td>90.9</td>
+     </tr>
+
+  </table>
+
+- 多语言零样本图文对检索 [\[查看详情\]](./clip_benchmark#xtd)
+
+  | method            |  EN  |  ES  |  FR  |  ZH  |  IT  |  KO  |  RU  |  JP  | average |
+  | ----------------- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :-----: |
+  | AltCLIP           | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 |  93.7   |
+  | OpenCLIP-XLM-R-H  | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 |  94.6   |
+  | InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 |  95.1   |
+  | InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 |  96.6   |
+
+</details>
+
+<details>
+  <summary>多模态对话</summary>
+
+请看 ["和SOTA多模态大模型对比"](#和-sota-多模态大模型对比)
+
+</details>
+
+## 使用 HuggingFace 快速开始
+
+<details>
+  <summary>使用 InternViT-6B 提取视觉特征 (点击展开)</summary>
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, CLIPImageProcessor
+
+model = AutoModel.from_pretrained(
+    'OpenGVLab/InternViT-6B-448px-V1-5',
+    torch_dtype=torch.bfloat16,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True).cuda().eval()
+
+image = Image.open('./examples/image1.jpg').convert('RGB')
+
+image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-5')
+
+pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
+pixel_values = pixel_values.to(torch.bfloat16).cuda()
+
+outputs = model(pixel_values)
+```
+
+</details>
+
+<details>
+  <summary>使用 InternVL-C(ontrastive) 和 InternVL-G(enerative) 进行跨模态检索 (点击展开)</summary>
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, CLIPImageProcessor
+from transformers import AutoTokenizer
+
+
+model = AutoModel.from_pretrained(
+    'OpenGVLab/InternVL-14B-224px',
+    torch_dtype=torch.bfloat16,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True).cuda().eval()
+
+image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')
+
+tokenizer = AutoTokenizer.from_pretrained(
+    'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)
+tokenizer.pad_token_id = 0  # set pad_token_id to 0
+
+images = [
+    Image.open('./examples/image1.jpg').convert('RGB'),
+    Image.open('./examples/image2.jpg').convert('RGB'),
+    Image.open('./examples/image3.jpg').convert('RGB')
+]
+prefix = 'summarize:'
+texts = [
+    prefix + 'a photo of a red panda',  # English
+    prefix + '一张熊猫的照片',  # Chinese
+    prefix + '二匹の猫の写真'  # Japanese
+]
+
+pixel_values = image_processor(images=images, return_tensors='pt').pixel_values
+pixel_values = pixel_values.to(torch.bfloat16).cuda()
+input_ids = tokenizer(texts, return_tensors='pt', max_length=80,
+                      truncation=True, padding='max_length').input_ids.cuda()
+
+# InternVL-C
+logits_per_image, logits_per_text = model(
+    image=pixel_values, text=input_ids, mode='InternVL-C')
+probs = logits_per_image.softmax(dim=-1)
+# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
+#         [2.2949e-02, 9.7656e-01, 5.9903e-06],
+#         [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',
+#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)
+
+# InternVL-G
+logits_per_image, logits_per_text = model(
+    image=pixel_values, text=input_ids, mode='InternVL-G')
+probs = logits_per_image.softmax(dim=-1)
+# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],
+#         [8.6060e-03, 9.9219e-01, 2.8759e-06],
+#         [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',
+#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)
+
+# please set add_eos_token to False for generation
+tokenizer.add_eos_token = False
+image = Image.open('./examples/image1.jpg').convert('RGB')
+pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
+pixel_values = pixel_values.to(torch.bfloat16).cuda()
+
+tokenized = tokenizer("English caption:", return_tensors='pt')
+pred = model.generate(
+    pixel_values=pixel_values,
+    input_ids=tokenized.input_ids.cuda(),
+    attention_mask=tokenized.attention_mask.cuda(),
+    num_beams=5,
+    min_new_tokens=8,
+)
+caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()
+# English caption: a red panda sitting on top of a wooden platform
+```
+
+</details>
+
+<details>
+  <summary>使用 InternVL-Chat 进行多模态对话 (点击展开)</summary>
+
+这里我们以较小的 OpenGVLab/InternVL2-8B 为例：
+
+```python
+import numpy as np
+import torch
+import torchvision.transforms as T
+from decord import VideoReader, cpu
+from PIL import Image
+from torchvision.transforms.functional import InterpolationMode
+from transformers import AutoModel, AutoTokenizer
+
+IMAGENET_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_STD = (0.229, 0.224, 0.225)
+
+
+def build_transform(input_size):
+    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
+    transform = T.Compose([
+        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
+        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
+        T.ToTensor(),
+        T.Normalize(mean=MEAN, std=STD)
+    ])
+    return transform
+
+
+def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
+    best_ratio_diff = float('inf')
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+        if ratio_diff < best_ratio_diff:
+            best_ratio_diff = ratio_diff
+            best_ratio = ratio
+        elif ratio_diff == best_ratio_diff:
+            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                best_ratio = ratio
+    return best_ratio
+
+
+def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
+    orig_width, orig_height = image.size
+    aspect_ratio = orig_width / orig_height
+
+    # calculate the existing image aspect ratio
+    target_ratios = set(
+        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
+        i * j <= max_num and i * j >= min_num)
+    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+
+    # find the closest aspect ratio to the target
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
+
+    # calculate the target width and height
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+
+    # resize the image
+    resized_img = image.resize((target_width, target_height))
+    processed_images = []
+    for i in range(blocks):
+        box = (
+            (i % (target_width // image_size)) * image_size,
+            (i // (target_width // image_size)) * image_size,
+            ((i % (target_width // image_size)) + 1) * image_size,
+            ((i // (target_width // image_size)) + 1) * image_size
+        )
+        # split the image
+        split_img = resized_img.crop(box)
+        processed_images.append(split_img)
+    assert len(processed_images) == blocks
+    if use_thumbnail and len(processed_images) != 1:
+        thumbnail_img = image.resize((image_size, image_size))
+        processed_images.append(thumbnail_img)
+    return processed_images
+
+
+def load_image(image_file, input_size=448, max_num=6):
+    image = Image.open(image_file).convert('RGB')
+    transform = build_transform(input_size=input_size)
+    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
+    pixel_values = [transform(image) for image in images]
+    pixel_values = torch.stack(pixel_values)
+    return pixel_values
+
+
+path = 'OpenGVLab/InternVL2-8B'
+model = AutoModel.from_pretrained(
+    path,
+    torch_dtype=torch.bfloat16,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True).eval().cuda()
+
+tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
+# set the max number of tiles in `max_num`
+pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
+
+generation_config = dict(
+    num_beams=1,
+    max_new_tokens=1024,
+    do_sample=False,
+)
+
+# pure-text conversation (纯文本对话)
+question = 'Hello, who are you?'
+response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+question = 'Can you tell me a story?'
+response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+# single-image single-round conversation (单图单轮对话)
+question = '<image>\nPlease describe the image shortly.'
+response = model.chat(tokenizer, pixel_values, question, generation_config)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+# single-image multi-round conversation (单图多轮对话)
+question = '<image>\nPlease describe the image in detail.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+question = 'Please write a poem according to the image.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+# multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)
+pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
+pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
+pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
+
+question = '<image>\nDescribe the two images in detail.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               history=None, return_history=True)
+
+question = 'What are the similarities and differences between these two images.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               history=history, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
+pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
+pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
+pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
+num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
+
+question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               num_patches_list=num_patches_list,
+                               history=None, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+question = 'What are the similarities and differences between these two images.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               num_patches_list=num_patches_list,
+                               history=history, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+# batch inference, single image per sample (单图批处理)
+pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
+pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
+num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
+pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
+
+questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
+responses = model.batch_chat(tokenizer, pixel_values,
+                             num_patches_list=num_patches_list,
+                             questions=questions,
+                             generation_config=generation_config)
+for question, response in zip(questions, responses):
+    print(f'User: {question}')
+    print(f'Assistant: {response}')
+
+# video multi-round conversation (视频多轮对话)
+def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
+    if bound:
+        start, end = bound[0], bound[1]
+    else:
+        start, end = -100000, 100000
+    start_idx = max(first_idx, round(start * fps))
+    end_idx = min(round(end * fps), max_frame)
+    seg_size = float(end_idx - start_idx) / num_segments
+    frame_indices = np.array([
+        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
+        for idx in range(num_segments)
+    ])
+    return frame_indices
+
+def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
+    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
+    max_frame = len(vr) - 1
+    fps = float(vr.get_avg_fps())
+
+    pixel_values_list, num_patches_list = [], []
+    transform = build_transform(input_size=input_size)
+    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
+    for frame_index in frame_indices:
+        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
+        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
+        pixel_values = [transform(tile) for tile in img]
+        pixel_values = torch.stack(pixel_values)
+        num_patches_list.append(pixel_values.shape[0])
+        pixel_values_list.append(pixel_values)
+    pixel_values = torch.cat(pixel_values_list)
+    return pixel_values, num_patches_list
+
+
+video_path = './examples/red-panda.mp4'
+# pixel_values, num_patches_list = load_video(video_path, num_segments=32, max_num=1)
+pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
+pixel_values = pixel_values.to(torch.bfloat16).cuda()
+video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
+question = video_prefix + 'What is the red panda doing?'
+# Frame1: <image>\nFrame2: <image>\n...\nFrame31: <image>\n{question}
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               num_patches_list=num_patches_list,
+                               history=None, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+
+question = 'Describe this video in detail. Don\'t repeat.'
+response, history = model.chat(tokenizer, pixel_values, question, generation_config,
+                               num_patches_list=num_patches_list,
+                               history=history, return_history=True)
+print(f'User: {question}')
+print(f'Assistant: {response}')
+```
+
+</details>
+
+## 许可证
+
+本项目以 [MIT](LICENSE) 许可证发布。项目中的部分代码和模型来自其它来源，受其原始许可证的约束。
+
+## 引用
+
+如果您在研究中发现本项目有用，请考虑引用：
+
+```BibTeX
+@article{chen2023internvl,
+  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
+  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
+  journal={arXiv preprint arXiv:2312.14238},
+  year={2023}
+}
+
+@article{chen2024far,
+  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
+  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
+  journal={arXiv preprint arXiv:2404.16821},
+  year={2024}
+}
+```
+
+## 致谢
+
+InternVL 的代码构建参考了以下的项目: [OpenAI CLIP](https://github.com/openai/CLIP)、[Open CLIP](https://github.com/mlfoundations/open_clip)、[CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark)、[EVA](https://github.com/baaivision/EVA/tree/master)、[InternImage](https://github.com/OpenGVLab/InternImage)、[ViT-Adapter](https://github.com/czczup/ViT-Adapter)、[MMSegmentation](https://github.com/open-mmlab/mmsegmentation)、[Transformers](https://github.com/huggingface/transformers)、[DINOv2](https://github.com/facebookresearch/dinov2)、[BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)、[Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm)和 [LLaVA-1.5](https://github.com/haotian-liu/LLaVA)，感谢这些杰出的工作。
+
+______________________________________________________________________
+
+如何您想加入我们的项目微信群，请扫描下方二维码添加我们的小助手：
+
+<p align="center"><img width="300" alt="image" src="https://github.com/OpenGVLab/DragGAN/assets/26198430/e3f0807f-956a-474e-8fd2-1f7c22d73997"></p>
--- a/classification/README.md
+++ b/classification/README.md
+# InternViT-6B for Image Classification
+
+This folder contains the implementation of the InternViT-6B for image classification, which corresponds to Section 4.2.1 of our [InternVL 1.0 paper](https://arxiv.org/pdf/2312.14238).
+The codebase for this part is derived from [InternImage](https://github.com/OpenGVLab/InternImage), with some code references to [EVA](https://github.com/baaivision/EVA/tree/master) and [DINOv2](https://github.com/facebookresearch/dinov2). Thanks for their great work.
+
+In this part, we validate the visual perception capabilities of InternViT-6B, the most core component of InternVL 1.0.
+We evaluate the quality of visual representation produced by InternViT-6B using the ImageNet-1K dataset. Following common practices, we adopt the linear probing evaluation, i.e. training a linear classifier while keeping the backbone frozen. In addition to the ImageNet-1K validation set,
+we also report performance metrics on several ImageNet variants, to benchmark the domain generalization capability.
+
+InternViT-6B follows the structure of vanilla ViT, and its hyperparameters are listed in the table below.
+
+<img width="558" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/23737120/e6bb0151-ab2f-4436-982f-6c68c5a69bc4">
+
+## 🛠️ Installation
+
+Follow the [installation guide](../INSTALLATION.md) to perform installations.
+
+## 📦 Data Preparation
+
+> Please prepare the dataset according to your needs.
+
+- `ImageNet-1K`: We use the standard ImageNet dataset, you can download it from [http://image-net.org/](http://image-net.org/).
+
+- `ImageNet-A`: Download it from [https://people.eecs.berkeley.edu/~hendrycks/imagenet-a.tar](https://people.eecs.berkeley.edu/~hendrycks/imagenet-a.tar).
+
+- `ImageNet-R`: Download it from [https://people.eecs.berkeley.edu/~hendrycks/imagenet-r.tar](https://people.eecs.berkeley.edu/~hendrycks/imagenet-r.tar).
+
+- `ImageNetV2`: Download it from [https://imagenetv2public.s3-us-west-2.amazonaws.com/imagenetv2-matched-frequency.tar.gz](https://imagenetv2public.s3-us-west-2.amazonaws.com/imagenetv2-matched-frequency.tar.gz).
+
+- `ImageNet-Sketch`: Download it using `gdown`.
+
+  ```shell
+  # GDown is needed to download the dataset.
+  # Please install it via `pip install gdown`
+  gdown --id 1Mj0i5HBthqH1p_yeXzsg22gZduvgoNeA
+  ```
+
+First, please prepare the `ImageNet-1K`, `ImageNet-A`, `ImageNet-R`, `ImageNetV2`, and `ImageNet-Sketch` datasets following the directory structure outlined below.
+
+```bash
+$ tree data
+data
+├── imagenet-1k
+│         ├── train
+          │    ├── n01498041
+          │    └── ...
+│         └── val
+│              ├── ILSVRC2012_val_00000001.JPEG
+│              └── ...
+├── imagenet-a
+│         ├── n01498041
+│         └── ...
+├── imagenet-r
+│         ├── n01443537
+│         └── ...
+├── imagenet-sketch
+│         ├── n01440764
+│         └── ...
+└── imagenetv2
+    └── ImageNetV2-matched-frequency
+```
+
+Then, unzip the `train.txt.zip` and `val.txt.zip` in `meta_data/`.
+
+```shell
+cd meta_data/
+unzip train.txt.zip
+unzip val.txt.zip
+```
+
+## 📦 Model Preparation
+
+| model name                   | type    | download                                                                                       |  size   |
+| ---------------------------- | ------- | ---------------------------------------------------------------------------------------------- | :-----: |
+| intern_vit_6b_224px.pth      | pytorch | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL/blob/main/intern_vit_6b_224px.pth)      |  12 GB  |
+| intern_vit_6b_224px_head.pth | pytorch | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL/blob/main/intern_vit_6b_224px_head.pth) | 25.7 MB |
+
+Please download the above model weights and place them in the `pretrained/` folder.
+
+```sh
+cd pretrained
+wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px.pth
+wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px_head.pth
+```
+
+The directory structure is:
+
+```sh
+pretrained
+├── intern_vit_6b_224px_head.pth
+└── intern_vit_6b_224px.pth
+```
+
+## 🔍 Linear Probing on ImageNet-1K
+
+> **Warning**: Please install `apex` before training (see [installation guide](../INSTALLATION.md#additional-instructions) for details).
+
+To train a linear classifier for `InternViT-6B` on ImageNet with 8 GPUs, run:
+
+```bash
+python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --cfg configs/intern_vit_6b_1k_224.yaml
+# or manage jobs with slurm
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --launcher slurm
+```
+
+Note, it is normal for the following information to appear during training and it can be safely ignored:
+
+> \_IncompatibleKeys(missing_keys=\[\], unexpected_keys=\['clip_projector.norm1_q.weight', 'clip_projector.norm1_q.bias', 'clip_projector.norm1_k.weight', 'clip_projector.norm1_k.bias', 'clip_projector.norm1_v.weight', 'clip_projector.norm1_v.bias', 'clip_projector.cross_attn.q_bias', 'clip_projector.cross_attn.k_bias', 'clip_projector.cross_attn.v_bias', 'clip_projector.cross_attn.q.weight', 'clip_projector.cross_attn.k.weight', 'clip_projector.cross_attn.v.weight', 'clip_projector.cross_attn.proj.weight', 'clip_projector.cross_attn.proj.bias'\])
+
+## 📊 Evaluation
+
+> **Warning**: Please install `apex` before evaluation (see [installation guide](../INSTALLATION.md#additional-instructions) for details).
+
+| model name                                                     | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |                                                                       download                                                                       |
+| -------------------------------------------------------------- | :---: | :-----: | :---: | :--: | :--: | :-------: | :--------------------------------------------------------------------------------------------------------------------------------------------------: |
+| [intern_vit_6b_1k_224.yaml](configs/intern_vit_6b_1k_224.yaml) | 88.2  |  90.4   | 79.9  | 77.5 | 89.8 |   69.1    | [ckpt](https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px_head.pth) \| [log](./work_dirs/intern_vit_6b_1k_224/log_rank0.txt) |
+
+<details>
+  <summary>Evaluate InternViT-6B on <b>ImageNet-1K val</b> with 8 GPUs (click to expand).</summary>
+
+```bash
+python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
+    --cfg configs/intern_vit_6b_1k_224.yaml --resume pretrained/intern_vit_6b_224px_head.pth
+# or manage jobs with slurm
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --eval \
+    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
+```
+
+Expected results:
+
+```
+ * Acc@1 88.230 Acc@5 98.474
+Accuracy of the network on the 50000 test images: 88.2%
+```
+
+</details>
+
+<details>
+  <summary>Evaluate InternViT-6B on <b>ImageNet-ReaL</b> with 1 GPU (click to expand).</summary>
+
+**Note: ImageNet-ReaL now only supports single-GPU testing.**
+
+```bash
+python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval \
+    --cfg configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --resume pretrained/intern_vit_6b_224px_head.pth
+# or manage jobs with slurm
+GPUS=1 GPUS_PER_NODE=1 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --eval \
+    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
+```
+
+Expected results:
+
+```
+* ReaL Acc@1 90.437 Acc@5 98.567 loss 0.605
+ReaL Accuracy of the network on the 50000 test images: 90.4%
+```
+
+</details>
+
+<details>
+  <summary>Evaluate InternViT-6B on <b>ImageNetV2</b> with 8 GPUs (click to expand).</summary>
+
+```bash
+python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
+    --cfg configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --resume pretrained/intern_vit_6b_224px_head.pth
+# or manage jobs with slurm
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --eval \
+    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
+```
+
+Expected results:
+
+```
+ * Acc@1 79.940 Acc@5 95.340
+Accuracy of the network on the 10000 test images: 79.9%
+```
+
+</details>
+
+<details>
+  <summary>Evaluate InternViT-6B on <b>ImageNet-A</b> with 8 GPUs (click to expand).</summary>
+
+```bash
+python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
+    --cfg configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --resume pretrained/intern_vit_6b_224px_head.pth
+# or manage jobs with slurm
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --eval \
+    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
+```
+
+Expected results:
+
+```
+ * Acc@1 77.479 Acc@5 92.737
+Accuracy of the network on the 7500 test images: 77.5%
+```
+
+</details>
+
+<details>
+  <summary>Evaluate InternViT-6B on <b>ImageNet-R</b> with 8 GPUs (click to expand).</summary>
+
+```bash
+python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
+    --cfg configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --resume pretrained/intern_vit_6b_224px_head.pth
+# or manage jobs with slurm
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --eval \
+    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
+```
+
+Expected results:
+
+```
+ * Acc@1 89.777 Acc@5 97.023
+Accuracy of the network on the 30000 test images: 89.8%
+```
+
+</details>
+
+<details>
+  <summary>Evaluate InternViT-6B on <b>ImageNet-Sketch</b> with 8 GPUs (click to expand).</summary>
+
+```bash
+python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
+    --cfg configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --resume pretrained/intern_vit_6b_224px_head.pth
+# or manage jobs with slurm
+GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --eval \
+    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
+```
+
+Expected results:
+
+```
+ * Acc@1 69.117 Acc@5 88.341
+Accuracy of the network on the 50889 test images: 69.1%
+```
+
+</details>
--- a/classification/config.py
+++ b/classification/config.py
+# --------------------------------------------------------
+# InternVL
+# Copyright (c) 2022 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+import os
+
+import yaml
+from yacs.config import CfgNode as CN
+
+_C = CN()
+
+# Base config files
+_C.BASE = ['']
+
+# -----------------------------------------------------------------------------
+# Data settings
+# -----------------------------------------------------------------------------
+_C.DATA = CN()
+# Batch size for a single GPU, could be overwritten by command line argument
+_C.DATA.BATCH_SIZE = 128
+# Path to dataset, could be overwritten by command line argument
+_C.DATA.DATA_PATH = ''
+# Dataset name
+_C.DATA.DATASET = 'imagenet'
+# Input image size
+_C.DATA.IMG_SIZE = 224
+# Interpolation to resize image (random, bilinear, bicubic)
+_C.DATA.INTERPOLATION = 'bicubic'
+# Use zipped dataset instead of folder dataset
+# could be overwritten by command line argument
+_C.DATA.ZIP_MODE = False
+# Cache Data in Memory, could be overwritten by command line argument
+_C.DATA.CACHE_MODE = 'part'
+# Pin CPU memory in DataLoader for more efficient (sometimes) transfer to GPU.
+_C.DATA.PIN_MEMORY = True
+# Number of data loading threads
+_C.DATA.NUM_WORKERS = 8
+# Load data to memory
+_C.DATA.IMG_ON_MEMORY = False
+# Name of the build_transform function
+_C.DATA.TRANSFORM = 'build_transform'
+
+# -----------------------------------------------------------------------------
+# Model settings
+# -----------------------------------------------------------------------------
+_C.MODEL = CN()
+# Model type
+_C.MODEL.TYPE = 'intern_vit_6b'
+# Model name
+_C.MODEL.NAME = 'intern_vit_6b'
+# Pretrained weight from checkpoint, could be imagenet22k pretrained weight
+# could be overwritten by command line argument
+_C.MODEL.PRETRAINED = ''
+# Checkpoint to resume, could be overwritten by command line argument
+_C.MODEL.RESUME = ''
+# Number of classes, overwritten in data preparation
+_C.MODEL.NUM_CLASSES = 1000
+# Dropout rate
+_C.MODEL.DROP_RATE = 0.0
+# Drop path rate
+_C.MODEL.DROP_PATH_RATE = 0.1
+# Drop path type
+_C.MODEL.DROP_PATH_TYPE = 'linear'  # linear, uniform
+# Label Smoothing
+_C.MODEL.LABEL_SMOOTHING = 0.1
+
+# INTERN_VIT_6B parameters
+_C.MODEL.INTERN_VIT_6B = CN()
+_C.MODEL.INTERN_VIT_6B.PATCH_SIZE = 14
+_C.MODEL.INTERN_VIT_6B.PRETRAIN_SIZE = 224
+_C.MODEL.INTERN_VIT_6B.QKV_BIAS = False
+_C.MODEL.INTERN_VIT_6B.EMBED_DIM = 3200
+_C.MODEL.INTERN_VIT_6B.NUM_HEADS = 25
+_C.MODEL.INTERN_VIT_6B.MLP_RATIO = 4
+_C.MODEL.INTERN_VIT_6B.INIT_VALUES = 0.1
+_C.MODEL.INTERN_VIT_6B.QK_NORMALIZATION = True
+_C.MODEL.INTERN_VIT_6B.DEPTH = 48
+_C.MODEL.INTERN_VIT_6B.USE_FLASH_ATTN = True
+_C.MODEL.INTERN_VIT_6B.FREEZE_VIT = True
+_C.MODEL.INTERN_VIT_6B.PRETRAINED = None
+_C.MODEL.INTERN_VIT_6B.CLS_TARGET = 'cls_patch_concat'
+_C.MODEL.INTERN_VIT_6B.HEAD_NORM_TYPE = 'bn'
+
+# -----------------------------------------------------------------------------
+# Training settings
+# -----------------------------------------------------------------------------
+_C.TRAIN = CN()
+_C.TRAIN.START_EPOCH = 0
+_C.TRAIN.EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 5e-4
+_C.TRAIN.WARMUP_LR = 5e-7
+_C.TRAIN.MIN_LR = 5e-6
+# Clip gradient norm
+_C.TRAIN.CLIP_GRAD = 5.0
+# Auto resume from latest checkpoint
+_C.TRAIN.AUTO_RESUME = True
+# Gradient accumulation steps
+# could be overwritten by command line argument
+_C.TRAIN.ACCUMULATION_STEPS = 0
+# Whether to use gradient checkpointing to save memory
+# could be overwritten by command line argument
+_C.TRAIN.USE_CHECKPOINT = False
+
+# LR scheduler
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'cosine'
+# Epoch interval to decay LR, used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30
+# LR decay rate, used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1
+
+# Optimizer
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'adamw'
+# Optimizer Epsilon
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+# Optimizer Betas
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)
+# SGD momentum
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+# ZeRO
+_C.TRAIN.OPTIMIZER.USE_ZERO = False
+# freeze backbone
+_C.TRAIN.OPTIMIZER.FREEZE_BACKBONE = None
+# dcn lr
+_C.TRAIN.OPTIMIZER.DCN_LR_MUL = None
+
+# EMA
+_C.TRAIN.EMA = CN()
+_C.TRAIN.EMA.ENABLE = False
+_C.TRAIN.EMA.DECAY = 0.9998
+
+# LR_LAYER_DECAY
+_C.TRAIN.LR_LAYER_DECAY = False
+_C.TRAIN.LR_LAYER_DECAY_RATIO = 0.875
+
+# FT head init weights
+_C.TRAIN.RAND_INIT_FT_HEAD = False
+
+# -----------------------------------------------------------------------------
+# Augmentation settings
+# -----------------------------------------------------------------------------
+_C.AUG = CN()
+# Color jitter factor
+_C.AUG.COLOR_JITTER = 0.4
+# Use AutoAugment policy. "v0" or "original"
+_C.AUG.AUTO_AUGMENT = 'rand-m9-mstd0.5-inc1'
+# Random erase prob
+_C.AUG.REPROB = 0.25
+# Random erase mode
+_C.AUG.REMODE = 'pixel'
+# Random erase count
+_C.AUG.RECOUNT = 1
+# Mixup alpha, mixup enabled if > 0
+_C.AUG.MIXUP = 0.8
+# Cutmix alpha, cutmix enabled if > 0
+_C.AUG.CUTMIX = 1.0
+# Cutmix min/max ratio, overrides alpha and enables cutmix if set
+_C.AUG.CUTMIX_MINMAX = None
+# Probability of performing mixup or cutmix when either/both is enabled
+_C.AUG.MIXUP_PROB = 1.0
+# Probability of switching to cutmix when both mixup and cutmix enabled
+_C.AUG.MIXUP_SWITCH_PROB = 0.5
+# How to apply mixup/cutmix params. Per "batch", "pair", or "elem"
+_C.AUG.MIXUP_MODE = 'batch'
+# RandomResizedCrop
+_C.AUG.RANDOM_RESIZED_CROP = False
+_C.AUG.MEAN = (0.485, 0.456, 0.406)
+_C.AUG.STD = (0.229, 0.224, 0.225)
+
+# -----------------------------------------------------------------------------
+# Testing settings
+# -----------------------------------------------------------------------------
+_C.TEST = CN()
+# Whether to use center crop when testing
+_C.TEST.CROP = True
+
+# Whether to use SequentialSampler as validation sampler
+_C.TEST.SEQUENTIAL = False
+
+# -----------------------------------------------------------------------------
+# Misc
+# -----------------------------------------------------------------------------
+# Mixed precision opt level, if O0, no amp is used ('O0', 'O1', 'O2')
+# overwritten by command line argument
+_C.AMP_OPT_LEVEL = ''
+# Path to output folder, overwritten by command line argument
+_C.OUTPUT = ''
+# Tag of experiment, overwritten by command line argument
+_C.TAG = 'default'
+# Frequency to save checkpoint
+_C.SAVE_FREQ = 1
+# Frequency to logging info
+_C.PRINT_FREQ = 10
+# eval freq
+_C.EVAL_FREQ = 1
+# Fixed random seed
+_C.SEED = 0
+# Perform evaluation only, overwritten by command line argument
+_C.EVAL_MODE = False
+# Test throughput only, overwritten by command line argument
+_C.THROUGHPUT_MODE = False
+# local rank for DistributedDataParallel, given by command line argument
+_C.LOCAL_RANK = 0
+_C.EVAL_22K_TO_1K = False
+
+_C.AMP_TYPE = 'float16'
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as f:
+        yaml_cfg = yaml.load(f, Loader=yaml.FullLoader)
+
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg))
+    print('=> merge config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+
+def update_config(config, args):
+    _update_config_from_file(config, args.cfg)
+
+    config.defrost()
+    if hasattr(args, 'opts') and args.opts:
+        config.merge_from_list(args.opts)
+
+    # merge from specific arguments
+    if hasattr(args, 'batch_size') and args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if hasattr(args, 'dataset') and args.dataset:
+        config.DATA.DATASET = args.dataset
+    if hasattr(args, 'data_path') and args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if hasattr(args, 'zip') and args.zip:
+        config.DATA.ZIP_MODE = True
+    if hasattr(args, 'cache_mode') and args.cache_mode:
+        config.DATA.CACHE_MODE = args.cache_mode
+    if hasattr(args, 'pretrained') and args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if hasattr(args, 'resume') and args.resume:
+        config.MODEL.RESUME = args.resume
+    if hasattr(args, 'accumulation_steps') and args.accumulation_steps:
+        config.TRAIN.ACCUMULATION_STEPS = args.accumulation_steps
+    if hasattr(args, 'use_checkpoint') and args.use_checkpoint:
+        config.TRAIN.USE_CHECKPOINT = True
+    if hasattr(args, 'amp_opt_level') and args.amp_opt_level:
+        config.AMP_OPT_LEVEL = args.amp_opt_level
+    if hasattr(args, 'output') and args.output:
+        config.OUTPUT = args.output
+    if hasattr(args, 'tag') and args.tag:
+        config.TAG = args.tag
+    if hasattr(args, 'eval') and args.eval:
+        config.EVAL_MODE = True
+    if hasattr(args, 'throughput') and args.throughput:
+        config.THROUGHPUT_MODE = True
+    if hasattr(args, 'save_ckpt_num') and args.save_ckpt_num:
+        config.SAVE_CKPT_NUM = args.save_ckpt_num
+    if hasattr(args, 'use_zero') and args.use_zero:
+        config.TRAIN.OPTIMIZER.USE_ZERO = True
+    # set local rank for distributed training
+    if hasattr(args, 'local_rank') and args.local_rank:
+        config.LOCAL_RANK = args.local_rank
+
+    # output folder
+    config.MODEL.NAME = args.cfg.split('/')[-1].replace('.yaml', '')
+    config.OUTPUT = os.path.join(config.OUTPUT, config.MODEL.NAME)
+    # config.OUTPUT = os.path.join(config.OUTPUT, config.MODEL.NAME, config.TAG)
+
+    config.freeze()
+
+
+def get_config(args):
+    """Get a yacs CfgNode object with default values."""
+    # Return a clone so that the defaults will not be altered
+    # This is for the "local variable" use pattern
+    config = _C.clone()
+    update_config(config, args)
+
+    return config
--- a/classification/configs/intern_vit_6b_1k_224.yaml
+++ b/classification/configs/intern_vit_6b_1k_224.yaml
+DATA:
+  IMG_ON_MEMORY: False
+  BATCH_SIZE: 128
+  TRANSFORM: 'build_transform_for_linear_probe'
+  DATA_PATH: './data/imagenet-1k'
+MODEL:
+  TYPE: intern_vit_6b
+  DROP_PATH_RATE: 0.0
+  INTERN_VIT_6B:
+    FREEZE_VIT: True
+    PATCH_SIZE: 14
+    PRETRAIN_SIZE: 224
+    QKV_BIAS: False
+    EMBED_DIM: 3200
+    NUM_HEADS: 25
+    MLP_RATIO: 4
+    INIT_VALUES: 0.1
+    QK_NORMALIZATION: True
+    DEPTH: 48
+    USE_FLASH_ATTN: True
+    PRETRAINED: "./pretrained/intern_vit_6b_224px.pth"
+    CLS_TARGET: 'cls_patch_concat'
+TRAIN:
+  EMA:
+    ENABLE: False
+    DECAY: 0.998
+  EPOCHS: 10
+  WARMUP_EPOCHS: 1
+  WEIGHT_DECAY: 0.0
+  BASE_LR: 0.1 # 512
+  WARMUP_LR: .0
+  MIN_LR: .0
+  LR_LAYER_DECAY: false
+  OPTIMIZER:
+    NAME: 'sgd'
--- a/classification/configs/intern_vit_6b_1k_224_test_imagenet_a.yaml
+++ b/classification/configs/intern_vit_6b_1k_224_test_imagenet_a.yaml
+DATA:
+  IMG_ON_MEMORY: False
+  BATCH_SIZE: 128
+  DATASET: 'imagenet_a'
+  TRANSFORM: 'build_transform_for_linear_probe'
+  DATA_PATH: './data/imagenet-a'
+MODEL:
+  TYPE: intern_vit_6b
+  DROP_PATH_RATE: 0.0
+  INTERN_VIT_6B:
+    FREEZE_VIT: True
+    PATCH_SIZE: 14
+    PRETRAIN_SIZE: 224
+    QKV_BIAS: False
+    EMBED_DIM: 3200
+    NUM_HEADS: 25
+    MLP_RATIO: 4
+    INIT_VALUES: 0.1
+    QK_NORMALIZATION: True
+    DEPTH: 48
+    USE_FLASH_ATTN: True
+    PRETRAINED: "./pretrained/intern_vit_6b_224px.pth"
+    CLS_TARGET: 'cls_patch_concat'
+TRAIN:
+  EMA:
+    ENABLE: False
+    DECAY: 0.998
+  EPOCHS: 10
+  WARMUP_EPOCHS: 1
+  WEIGHT_DECAY: 0.0
+  BASE_LR: 0.1 # 512
+  WARMUP_LR: .0
+  MIN_LR: .0
+  LR_LAYER_DECAY: false
+  OPTIMIZER:
+    NAME: 'sgd'
--- a/classification/configs/intern_vit_6b_1k_224_test_imagenet_r.yaml
+++ b/classification/configs/intern_vit_6b_1k_224_test_imagenet_r.yaml
+DATA:
+  IMG_ON_MEMORY: False
+  BATCH_SIZE: 128
+  DATASET: 'imagenet_r'
+  TRANSFORM: 'build_transform_for_linear_probe'
+  DATA_PATH: './data/imagenet-r'
+MODEL:
+  TYPE: intern_vit_6b
+  DROP_PATH_RATE: 0.0
+  INTERN_VIT_6B:
+    FREEZE_VIT: True
+    PATCH_SIZE: 14
+    PRETRAIN_SIZE: 224
+    QKV_BIAS: False
+    EMBED_DIM: 3200
+    NUM_HEADS: 25
+    MLP_RATIO: 4
+    INIT_VALUES: 0.1
+    QK_NORMALIZATION: True
+    DEPTH: 48
+    USE_FLASH_ATTN: True
+    PRETRAINED: "./pretrained/intern_vit_6b_224px.pth"
+    CLS_TARGET: 'cls_patch_concat'
+TRAIN:
+  EMA:
+    ENABLE: False
+    DECAY: 0.998
+  EPOCHS: 10
+  WARMUP_EPOCHS: 1
+  WEIGHT_DECAY: 0.0
+  BASE_LR: 0.1 # 512
+  WARMUP_LR: .0
+  MIN_LR: .0
+  LR_LAYER_DECAY: false
+  OPTIMIZER:
+    NAME: 'sgd'
--- a/classification/configs/intern_vit_6b_1k_224_test_imagenet_real.yaml
+++ b/classification/configs/intern_vit_6b_1k_224_test_imagenet_real.yaml
+DATA:
+  IMG_ON_MEMORY: False
+  BATCH_SIZE: 128
+  DATASET: 'imagenet-real'
+  TRANSFORM: 'build_transform_for_linear_probe'
+  DATA_PATH: './data/imagenet-1k'
+MODEL:
+  TYPE: intern_vit_6b
+  DROP_PATH_RATE: 0.0
+  INTERN_VIT_6B:
+    FREEZE_VIT: True
+    PATCH_SIZE: 14
+    PRETRAIN_SIZE: 224
+    QKV_BIAS: False
+    EMBED_DIM: 3200
+    NUM_HEADS: 25
+    MLP_RATIO: 4
+    INIT_VALUES: 0.1
+    QK_NORMALIZATION: True
+    DEPTH: 48
+    USE_FLASH_ATTN: True
+    PRETRAINED: "./pretrained/intern_vit_6b_224px.pth"
+    CLS_TARGET: 'cls_patch_concat'
+TRAIN:
+  EMA:
+    ENABLE: False
+    DECAY: 0.998
+  EPOCHS: 10
+  WARMUP_EPOCHS: 1
+  WEIGHT_DECAY: 0.0
+  BASE_LR: 0.1 # 512
+  WARMUP_LR: .0
+  MIN_LR: .0
+  LR_LAYER_DECAY: false
+  OPTIMIZER:
+    NAME: 'sgd'
--- a/classification/configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml
+++ b/classification/configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml
+DATA:
+  IMG_ON_MEMORY: False
+  BATCH_SIZE: 128
+  DATASET: 'imagenet_sketch'
+  TRANSFORM: 'build_transform_for_linear_probe'
+  DATA_PATH: './data/imagenet-sketch'
+MODEL:
+  TYPE: intern_vit_6b
+  DROP_PATH_RATE: 0.0
+  INTERN_VIT_6B:
+    FREEZE_VIT: True
+    PATCH_SIZE: 14
+    PRETRAIN_SIZE: 224
+    QKV_BIAS: False
+    EMBED_DIM: 3200
+    NUM_HEADS: 25
+    MLP_RATIO: 4
+    INIT_VALUES: 0.1
+    QK_NORMALIZATION: True
+    DEPTH: 48
+    USE_FLASH_ATTN: True
+    PRETRAINED: "./pretrained/intern_vit_6b_224px.pth"
+    CLS_TARGET: 'cls_patch_concat'
+TRAIN:
+  EMA:
+    ENABLE: False
+    DECAY: 0.998
+  EPOCHS: 10
+  WARMUP_EPOCHS: 1
+  WEIGHT_DECAY: 0.0
+  BASE_LR: 0.1 # 512
+  WARMUP_LR: .0
+  MIN_LR: .0
+  LR_LAYER_DECAY: false
+  OPTIMIZER:
+    NAME: 'sgd'